First Impressions

Sun Oct 1 00:54:06 PDT 2006

Chad J > wrote:

> I read that FAQ at the bottom of this post, and didn't see anything 
> about accessing strings by codepoint.  Maybe you mean a different FAQ 
> here, in which case, could I have a link please?  I've been to the 
> unicode site before and all I remember was being confused and having a 
> hard time finding the info I wanted :(

I meant http://www.unicode.org/faq/utf_bom.html#12

> Also I still am not sure exactly what a code point is.  And that FAQ at 
> the bottom used the word "surrogate" a lot; I'm not sure about that one 
> either.

Code point is the closest thing to a "character", although it might take 
more than one Unicode code point to represent a single Unicode grapheme.

Surrogates are used with UTF-16, to represent "too large" code points...
i.e. they always occur in "surrogate pairs", which combine to a single

> When you say char[] isn't a nice format, I wasn't thinking about having 
> the string class I mentioned earlier store the data ONLY as char[].  It 
> might be wchar[].  Or dchar[].  Then it would be automatically converted 
> between the two either at compile time (when possible) or dynamically at 
> runtime (hopefully only when needed).  So if someone throws a Chinese 
> character literal at it, there is a very big clue there to use UTF32 or 
> something that can store all of the characters in a uniform width sort 
> of way, to speed indexing.  Algorithms could be used so that a program 
> 'learns' at runtime what kind of strings are dominating the program, and 
> uses algorithms optimized for those.  Maybe this is a bit too complex, 
> but I can dream, hehe.

Actually I said that dchar[] (i.e. UTF-32) wasn't ideal, but anyway...
(UTF-8 or UTF-16 is preferrable, for the reasons in the UTF FAQ above)

We already have char[] as the string default in D, but most models for
a String class uses wchar[] (i.e. UTF-16), for instance Mango or Java:
* http://mango.dsource.org/classUString.html (uses the ICU lib)
* http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html

All formats do use Unicode, so converting from one UTF to another is 
mostly a question of memory/performance and not about any data loss.
However, it is not converted at compile time (without using templates)
so mixing and matching different representations is somewhat of a pain.

I think that char[] for string and wchar[] for String are good defaults.

> My impression has gone from being quite scared of UTF to being not so 
> worried, but only for myself.  D seems to be good at handling UTF, but 
> only if someone tells you to never handle strings as arrays of 
> characters.  Unfortunately, the first thing you see in a lot of D 
> programs is "int main( char[][] args )" and there are some arrays of 
> characters being used as strings.  This also means that some array 
> capabilities like indexing and the braggable slicing are more dangerous 
> than useful for string handling.  It's a newbie trap.

It is, since it isn't really "arrays of characters" but "arrays of code 
units". What muddies the waters further is that sometimes they're equal.
That is, with ASCII characters each character fits into a a D char unit.
Without surrogates, each character (from BMP) fits into one wchar unit.

However, all code that handles the shorter formats should be prepared to 
handle non-ASCII (for UTF-8) and surrogates (for UTF-16), or use UTF-32:
bool isAscii(char c) { return (c <= 0x7f); }
bool isSurrogate(wchar c) { return (c >= 0xD800 && c <= 0xDFFF); }

But a warning that D uses multi-byte strings might be in order, yes...
Another warning that it only supports UTF-8 platforms* might also be ?

--anders

* "main(char[][] args)" does not work for any non-UTF consoles,
   as you will get invalid UTF sequences for the non-ASCII chars.