First Impressions
Georg Wrede
georg.wrede at nospam.org
Fri Sep 29 15:24:51 PDT 2006
Chad J > wrote:
> Georg Wrede wrote:
>
>> The secret is, there actually is a delicate balance between UTF-8 and
>> the library string operations. As long as you use library functions to
>> extract substrings, join or manipulate them, everything is OK. And
>> very few of us actually either need to, or see the effort of
>> bit-twiddling individual octets in these "char" arrays.
>>
>
> But this is what I'm talking about... you can't slice them or index
> them. I might actually index a character out of an array from time to
> time. If I don't know about UTF, and I do just keep on coding, and I do
> something like this:
>
> char[] str = "some string in nonenglish text";
> for ( int i = 0; i < str.length; i++ )
> {
> str[i] = doSomething( str[i] );
> }
>
> and this will fail right?
>
> If it does fail, then everything is not alright. You do have to worry
> about UTF. Someone has to tell you to use a foreach there.
Yes. That's why I talked about you falling down once you realise Daddy's
not holding the bike.
Part of UTF-8's magic lies in that it is amazingly easy to get working
smoothly with truly minor tweaks to "formerly ASCII-only" libraries --
so that even the most exotic languages have no problem.
Your concerns about the for loop are valid, and expected. Now, IMHO, the
standard library should take care of "all" the situations where you
would ever need to split, join, examine, or otherwise use strings,
"non-ASCII" or not. (And I really have no complaint (Walter!) about
this.) Therefore, in no normal circumstances should you have to twiddle
them yourself -- unless.
And this "unless" is exactly why I'm unhappy with the situation, too.
Problem is, _technology_wise_ the existing setup may actually be the
best, both considering ease of writing the library, ease of using it,
robustness of both the library and users' code, and the headaches saved
from programmers who, either haven't heard of the issue (whether they're
American or Chinese!), or who simply trust their lives with the machinery.
So, where's the actual problem???
At this point I'm inclined to say: the documentation, and the stage
props! The latter meaning: exposing the fact that our "strings" are just
arrays is psychologically wrong, and even more so is the fact that we're
shamelessly storing entities of variable length in arrays which have no
notion of such -- even worse, while we brag with slices!
If this had been a university course assignment, we'd have been thrown
out of class, for both half baked work, and for arrogance towards our
client, victimizing the coder.
The former meaning: we should not be like "we're bad enough to overtly
use plain arrays for variable-length data, now if you have a problem
with it, the go home and learn stuff, or then just trust us".
Both "documentation" and "stage props" ultimately meaning that the
largest problem here is psychology, pedagogy, and education.
---
A lot would already be won by:
merely aliasing char[] to string, and discouraging other than guru-level
folks from screwing with their internals. This alone would save a lot of
Fear, Uncertainty and D-phobia.
The documentation should take pains in explaining up front that if you
_really_ want to do Character-by-Character ops _and_ you live outside of
America, then the Right way to do it (ehh, actually the Canonical Way),
is to first convert the string to dchar[]. Period.
Then, if somebody else knows enough of UTF-8 and knows he can handle bit
twiddling more efficiently than using the Canonical Way, with plain
char[] and "foreignish", then let him. But let that be undocumented and
Un-Discussed in the docs. Precisely like a lot of other things are. (And
should be.) And will be. He's on his own, and he ought to know it.
---
In other words, the normal programmer should believe he's working with
black-box Strings, and he will be happy with it. That way he'll survive
whether he's in Urduland or Boise, Idaho -- without neither ever needing
to have heard about UTF nor other crap.
Not until in Appendix Z of the manual should we ever admit that the
Emperor's Clothes are just plain arrays, and we apologize for the breach
of manners of storing variable length data in simple naked arrays. And
here would be the right place to explain how come this hasn't blown up
in our faces already. And, exactly how you'll avoid it too. (This
_needs_ to contain an adequate explanation about the actual format of
UTF-8.)
---
TO RECAP
The _single_ biggest strings-related disservice to our pilgrims is to
lead them to believe, that D stores
strings in something like utf8[]
internally.
Now that's an oxymoron, if I ever saw one. (If utf8[] was _actually_
implemented, it would probably have to be an alias of char[][]. Right?
Right? What we have instead is ubyte[], which is _not_ the same as
utf8[].) (Oh, and if it ever becomes obvious that not _everybody_
understood this, then that in itself simply proves my point here.)
(*1)
And the fault lies in the documentation, not the implementation!
This results, in braincell-hours wasted, precisely as much as everybody
has to waste them, before they realise that the acronym RAII is a filthy
lie. Akin only to the former "German _Democratic_ Republic". Only a
politician should be capable of this kind of deception.
Ok, nobody is doing it on purpose. Things being too clear to oneself
often result in difficulties to find ways to express them to new people.
(Happens every day at the Math department! :-( ) And since all
in-the-know are unable to see it, and all not-in-the-know are too, then
both groups might think it's the thing itself that is "the problem", and
not merely the chosen _presentation_ of it.
#################
Sorry for sonding Righteous, arrogant and whatever. But this really is a
5 minute thing for one person to fix for good, while it wastes entire
days or months _per_person_, from _every_ non-defoiled victim who
approaches the issue. Originally I was one of them: hence the aggression.
-------------------------------------------
(*1) Even I am not simultaneously both literally and theoretically right
here. Those who saw it right away, probably won't mind, since it's the
point that is the issue here.
Now, having to write this disclaimer, IMHO simply again underlines the
very point attempted here.
More information about the Digitalmars-d
mailing list