First Impressions

Sat Sep 30 17:15:56 PDT 2006

Anders F Björklund wrote:
>>> What is not powerful enough about the foreach(dchar c; str) ?
>>> It will step through that UTF-8 array one codepoint at a time.
>>
>>
>> I'm assuming 'str' is a char[], which would make that very nice.  But 
>> it doesn't solve correctly slicing or indexing into a char[].  
> 
> 
> Well, it's also a lot "trickier" than that... For instance, my last name
> can be written in Unicode as Björklund or Bj¨orklund, both of which are 
> valid - only that in one of them, the 'ö' occupies two full code points!
> It's still a single character, which is why Unicode avoids that term...
> 

So it seems to me the problem is that those 2 bytes are both 2 
characters and 1 character at the same time.

In this case, I'd prefer being able to index to a safe default (like the 
ö, instead of the umlauts next to the o), or not being able to index at 
all.

> As you know, if you need to access your strings by codepoint (something 
> that the Unicode group explicitly recommends against, in their FAQ) then 
> char[] isn't a very nice format - because of the conversion overhead...
> But it's still possible to translate, transform, and translate back ?
> 

I read that FAQ at the bottom of this post, and didn't see anything 
about accessing strings by codepoint.  Maybe you mean a different FAQ 
here, in which case, could I have a link please?  I've been to the 
unicode site before and all I remember was being confused and having a 
hard time finding the info I wanted :(

Also I still am not sure exactly what a code point is.  And that FAQ at 
the bottom used the word "surrogate" a lot; I'm not sure about that one 
either.

When you say char[] isn't a nice format, I wasn't thinking about having 
the string class I mentioned earlier store the data ONLY as char[].  It 
might be wchar[].  Or dchar[].  Then it would be automatically converted 
between the two either at compile time (when possible) or dynamically at 
runtime (hopefully only when needed).  So if someone throws a Chinese 
character literal at it, there is a very big clue there to use UTF32 or 
something that can store all of the characters in a uniform width sort 
of way, to speed indexing.  Algorithms could be used so that a program 
'learns' at runtime what kind of strings are dominating the program, and 
uses algorithms optimized for those.  Maybe this is a bit too complex, 
but I can dream, hehe.

>> If nothing was done about this and I absolutely needed UTF support,
>> I'd probably make a class like so: [...]
> 
> 
> In my own mock String class, I cached the dchar[] codepoints on demand.
> (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)
> 
>> All in all it is a drag that we should have to learn all of this UTF 
>> stuff.  I want char[] to just work!
> 
> 
> Using Unicode strings and characters does require a little learning...
> (where http://www.unicode.org/faq/utf_bom.html is a very good page)
> And D does force you to think about string implementation, no question.
> This has both pros and cons, but it is a deliberate language decision.
> 
> If you're willing to handle the "surrogates", then UTF-16 is a rather
> good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
> A downside is that it is not "ascii-compatible" (has embedded NUL chars)
> and that it is endian-dependant unlike the more universal UTF-8 format.
> 
> --anders

My impression has gone from being quite scared of UTF to being not so 
worried, but only for myself.  D seems to be good at handling UTF, but 
only if someone tells you to never handle strings as arrays of 
characters.  Unfortunately, the first thing you see in a lot of D 
programs is "int main( char[][] args )" and there are some arrays of 
characters being used as strings.  This also means that some array 
capabilities like indexing and the braggable slicing are more dangerous 
than useful for string handling.  It's a newbie trap.

Like I said earlier, I either want to be able to index/slice strings 
safely, or not at all (or better yet, not by any intuitive means).