First Impressions

Fri Sep 29 13:45:30 PDT 2006

Chad J > wrote:

> Probably 7-bit.  Anything where the size of one character is ALWAYS one 
> byte.  I am already assuming that ASCII is a subset or at least is 
> mostly a subset of UTF8.  However, I talk about it in an exclusive 
> manner because if you handle UTF8 strings properly then the code will 
> probably run at least slightly slower than with ASCII-only strings.

It's mostly about looking out for the UTF "control" characters, which is 
  not more than a simple assertion in your ASCII-only functions really...

I don't think handling UTF-8 properly is a burden for string functions, 
when you compare it with the enormous gain that it has over ASCII-only.

>> What is not powerful enough about the foreach(dchar c; str) ?
>> It will step through that UTF-8 array one codepoint at a time.
> 
> I'm assuming 'str' is a char[], which would make that very nice.  But it 
> doesn't solve correctly slicing or indexing into a char[].  

Well, it's also a lot "trickier" than that... For instance, my last name
can be written in Unicode as Björklund or Bj¨orklund, both of which are 
valid - only that in one of them, the 'ö' occupies two full code points!
It's still a single character, which is why Unicode avoids that term...

As you know, if you need to access your strings by codepoint (something 
that the Unicode group explicitly recommends against, in their FAQ) then 
char[] isn't a very nice format - because of the conversion overhead...
But it's still possible to translate, transform, and translate back ?

> If nothing was done about this and I absolutely needed UTF support,
> I'd probably make a class like so: [...]

In my own mock String class, I cached the dchar[] codepoints on demand.
(viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)

> All in all it is a drag that we should have to learn all of this UTF 
> stuff.  I want char[] to just work!

Using Unicode strings and characters does require a little learning...
(where http://www.unicode.org/faq/utf_bom.html is a very good page)
And D does force you to think about string implementation, no question.
This has both pros and cons, but it is a deliberate language decision.

If you're willing to handle the "surrogates", then UTF-16 is a rather
good trade-off between the default UTF-8 and wasteful UTF-32 formats ?
A downside is that it is not "ascii-compatible" (has embedded NUL chars)
and that it is endian-dependant unlike the more universal UTF-8 format.

--anders