Inconsitency

Sun Oct 13 09:31:56 PDT 2013

> This will _not_ return a trailing surrogate of a Cyrillic 
> letter. It will return the second code unit of the "ä" 
> character (U+00E4).

True. It's UTF-8, not UTF-16.

> However, it could also yield the first code unit of the umlaut 
> diacritic, depending on how the string is represented.

This is not true for UTF-8, which is not subject to "endianism".

> If the string were in UTF-32, [2] could yield either the 
> Cyrillic character, or the umlaut diacritic.
> The .length of the UTF-32 string could be either 3 or 4.

Both are not true for UTF-32. There is no interpretation (except 
for the "endianism", which could be taken care of in a 
library/the core) for the code point.

> There are multiple reasons why .length and index access is 
> based on code units rather than code points or any higher level 
> representation, but one is that the complexity would suddenly 
> be O(n) instead of O(1).

see my last statement below

> In-place modifications of char[] arrays also wouldn't be 
> possible anymore

They would be, but

> as the size of the underlying array might have to change.

Well that's a point; on the other hand, D is constantly creating 
and throwing away new strings, so this isn't quite an argument. 
The current solution puts the programmer in charge of dealing 
with UTF-x, where a more consistent implementation would put the 
burden on the implementors of the libraries/core, i.e. the ones 
who usually have a better understanding of Unicode than the 
average programmer.

Also, implementing such a semantics would not per se abandon a 
byte-wise access, would it?

So, how do you guys handle UTF-8 strings in D? What are your 
solutions to the problems described? Does it all come down to 
converting "string"s and "wstring"s to "dstring"s, manipulating 
them, and re-convert them to "string"s? Btw, what would this mean 
in terms of speed?

These is no irony in my questions. I'm really looking for 
solutions...