Inconsitency

Sönke Ludwig sludwig at outerproduct.org
Sun Oct 13 07:48:17 PDT 2013


Am 13.10.2013 16:14, schrieb nickles:
> Ok, I understand, that "length" is - obviously - used in analogy to any
> array's length value.
>
> Still, this seems to be inconsistent. D elaborates on implementing
> "char"s as UTF-8 which means that a "char" in D can be of any length
> between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't
> then this (i.e. the character's length) be the "unit of measurement" for
> "char"s - like e.g. the size of the underlying struct in an array of
> "struct"s? The story continues with indexing "string"s: In a consistent
> implementation, shouldn't
>
>     writeln("säд"[2])
>
> return "д" instead of the trailing surrogate of this cyrillic letter?

This will _not_ return a trailing surrogate of a Cyrillic letter. It 
will return the second code unit of the "ä" character (U+00E4). However, 
it could also yield the first code unit of the umlaut diacritic, 
depending on how the string is represented. If the string were in 
UTF-32, [2] could yield either the Cyrillic character, or the umlaut 
diacritic. The .length of the UTF-32 string could be either 3 or 4.

There are multiple reasons why .length and index access is based on code 
units rather than code points or any higher level representation, but 
one is that the complexity would suddenly be O(n) instead of O(1). 
In-place modifications of char[] arrays also wouldn't be possible 
anymore as the size of the underlying array might have to change.


More information about the Digitalmars-d mailing list