First Impressions

Georg Wrede georg.wrede at nospam.org
Fri Sep 29 15:24:51 PDT 2006


Chad J > wrote:
> Georg Wrede wrote:
> 
>> The secret is, there actually is a delicate balance between UTF-8 and 
>> the library string operations. As long as you use library functions to 
>> extract substrings, join or manipulate them, everything is OK. And 
>> very few of us actually either need to, or see the effort of 
>> bit-twiddling individual octets in these "char" arrays.
>>
> 
> But this is what I'm talking about... you can't slice them or index 
> them.  I might actually index a character out of an array from time to 
> time.  If I don't know about UTF, and I do just keep on coding, and I do 
> something like this:
> 
> char[] str = "some string in nonenglish text";
> for ( int i = 0; i < str.length; i++ )
> {
>   str[i] = doSomething( str[i] );
> }
> 
> and this will fail right?
> 
> If it does fail, then everything is not alright.  You do have to worry 
> about UTF.  Someone has to tell you to use a foreach there.

Yes. That's why I talked about you falling down once you realise Daddy's 
not holding the bike.

Part of UTF-8's magic lies in that it is amazingly easy to get working 
smoothly with truly minor tweaks to "formerly ASCII-only" libraries -- 
so that even the most exotic languages have no problem.

Your concerns about the for loop are valid, and expected. Now, IMHO, the 
standard library should take care of "all" the situations where you 
would ever need to split, join, examine, or otherwise use strings, 
"non-ASCII" or not. (And I really have no complaint (Walter!) about 
this.) Therefore, in no normal circumstances should you have to twiddle 
them yourself -- unless.

And this "unless" is exactly why I'm unhappy with the situation, too.

Problem is, _technology_wise_ the existing setup may actually be the 
best, both considering ease of writing the library, ease of using it, 
robustness of both the library and users' code, and the headaches saved 
from programmers who, either haven't heard of the issue (whether they're 
American or Chinese!), or who simply trust their lives with the machinery.

So, where's the actual problem???

At this point I'm inclined to say: the documentation, and the stage 
props! The latter meaning: exposing the fact that our "strings" are just 
arrays is psychologically wrong, and even more so is the fact that we're 
shamelessly storing entities of variable length in arrays which have no 
notion of such -- even worse, while we brag with slices!

If this had been a university course assignment, we'd have been thrown 
out of class, for both half baked work, and for arrogance towards our 
client, victimizing the coder.

The former meaning: we should not be like "we're bad enough to overtly 
use plain arrays for variable-length data, now if you have a problem 
with it, the go home and learn stuff, or then just trust us".

Both "documentation" and "stage props" ultimately meaning that the 
largest problem here is psychology, pedagogy, and education.

---

A lot would already be won by:

merely aliasing char[] to string, and discouraging other than guru-level 
folks from screwing with their internals. This alone would save a lot of 
Fear, Uncertainty and D-phobia.

The documentation should take pains in explaining up front that if you 
_really_ want to do Character-by-Character ops _and_ you live outside of 
America, then the Right way to do it (ehh, actually the Canonical Way), 
is to first convert the string to dchar[]. Period.

Then, if somebody else knows enough of UTF-8 and knows he can handle bit 
twiddling more efficiently than using the Canonical Way, with plain 
char[] and "foreignish", then let him. But let that be undocumented and 
Un-Discussed in the docs. Precisely like a lot of other things are. (And 
should be.) And will be. He's on his own, and he ought to know it.

---

In other words, the normal programmer should believe he's working with 
black-box Strings, and he will be happy with it. That way he'll survive 
whether he's in Urduland or Boise, Idaho -- without neither ever needing 
to have heard about UTF nor other crap.

Not until in Appendix Z of the manual should we ever admit that the 
Emperor's Clothes are just plain arrays, and we apologize for the breach 
of manners of storing variable length data in simple naked arrays. And 
here would be the right place to explain how come this hasn't blown up 
in our faces already. And, exactly how you'll avoid it too. (This 
_needs_ to contain an adequate explanation about the actual format of 
UTF-8.)

---

TO RECAP

The _single_ biggest strings-related disservice to our pilgrims is to

     lead them to believe, that D stores
     strings in something like utf8[]

internally.

Now that's an oxymoron, if I ever saw one. (If utf8[] was _actually_ 
implemented, it would probably have to be an alias of char[][]. Right? 
Right? What we have instead is ubyte[], which is _not_ the same as 
utf8[].) (Oh, and if it ever becomes obvious that not _everybody_ 
understood this, then that in itself simply proves my point here.)

(*1)

And the fault lies in the documentation, not the implementation!

This results, in braincell-hours wasted, precisely as much as everybody 
has to waste them, before they realise that the acronym RAII is a filthy 
lie. Akin only to the former "German _Democratic_ Republic". Only a 
politician should be capable of this kind of deception.

Ok, nobody is doing it on purpose. Things being too clear to oneself 
often result in difficulties to find ways to express them to new people. 
(Happens every day at the Math department! :-( ) And since all 
in-the-know are unable to see it, and all not-in-the-know are too, then 
both groups might think it's the thing itself that is "the problem", and 
not merely the chosen _presentation_ of it.

#################

Sorry for sonding Righteous, arrogant and whatever. But this really is a 
5 minute thing for one person to fix for good, while it wastes entire 
days or months _per_person_, from _every_ non-defoiled victim who 
approaches the issue. Originally I was one of them: hence the aggression.

-------------------------------------------


(*1) Even I am not simultaneously both literally and theoretically right 
here. Those who saw it right away, probably won't mind, since it's the 
point that is the issue here.

Now, having to write this disclaimer, IMHO simply again underlines the 
very point attempted here.



More information about the Digitalmars-d mailing list