Inconsitency
nickles
ben at world-of-ben.de
Sun Oct 13 09:31:56 PDT 2013
> This will _not_ return a trailing surrogate of a Cyrillic
> letter. It will return the second code unit of the "ä"
> character (U+00E4).
True. It's UTF-8, not UTF-16.
> However, it could also yield the first code unit of the umlaut
> diacritic, depending on how the string is represented.
This is not true for UTF-8, which is not subject to "endianism".
> If the string were in UTF-32, [2] could yield either the
> Cyrillic character, or the umlaut diacritic.
> The .length of the UTF-32 string could be either 3 or 4.
Both are not true for UTF-32. There is no interpretation (except
for the "endianism", which could be taken care of in a
library/the core) for the code point.
> There are multiple reasons why .length and index access is
> based on code units rather than code points or any higher level
> representation, but one is that the complexity would suddenly
> be O(n) instead of O(1).
see my last statement below
> In-place modifications of char[] arrays also wouldn't be
> possible anymore
They would be, but
> as the size of the underlying array might have to change.
Well that's a point; on the other hand, D is constantly creating
and throwing away new strings, so this isn't quite an argument.
The current solution puts the programmer in charge of dealing
with UTF-x, where a more consistent implementation would put the
burden on the implementors of the libraries/core, i.e. the ones
who usually have a better understanding of Unicode than the
average programmer.
Also, implementing such a semantics would not per se abandon a
byte-wise access, would it?
So, how do you guys handle UTF-8 strings in D? What are your
solutions to the problems described? Does it all come down to
converting "string"s and "wstring"s to "dstring"s, manipulating
them, and re-convert them to "string"s? Btw, what would this mean
in terms of speed?
These is no irony in my questions. I'm really looking for
solutions...
More information about the Digitalmars-d
mailing list