until strange behavior

Jonathan M Davis jmdavisProg at gmx.com
Sun Jun 2 16:12:09 PDT 2013


On Monday, June 03, 2013 01:04:28 Jack Applegame wrote:
> On Sunday, 2 June 2013 at 20:50:31 UTC, Jonathan M Davis wrote:
> > http://stackoverflow.com/questions/12288465
> 
> Lets have string of chars, and it contains UTF-8 string.
> Does front(str[]) automatically convert first unicode character
> to UTF-32 and returns it?
> I made a test case and answer is: "Yes, it does!"
> May be this make sense. But such implicit conversion confuses
> everyone whom I asked.
> Therefore, string is not ordinary array (in Phobos context), but
> special array with special processing rules.
> 
> I'm moving from C++ and often ask myself: "why D has so much
> hidden confusing things?"

The language treats strings as arrays of code units. The standard library 
treats them as ranges of code points. Yes, this can be confusing, but we need 
both. In order to operate on strings efficiently, they need to be made up of 
code units, but correctness requires code points. This means that the 
complexity is to a great extent an intrinsic part of dealing with strings 
properly. In C++, people usually just screw it up and treat char as if it were  
a character when in fact it's not. It's a piece of one.

Whether we went about handling the complexity of code units vs code points in 
the best manner is debatable, but it can't be made simple if you want both 
efficiency and correctness. A better approach might have been to have a string 
type which operated on code points and held the code units internally so that 
everything operated on code points by default, but the library stuff was added 
later, and Walter Bright tends to think that everyone should understand 
Unicode well, so the decisions he makes with regards to that aren't always the 
best (since most people don't understand Unicode well and don't want to care).

What we have actually works quite well, but it does require that you come to 
at least a basic understanding of the difference between code units and code 
points.

- Jonathan M Davis


More information about the Digitalmars-d-learn mailing list