char array weirdness

Jonathan M Davis via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Mon Mar 28 16:07:22 PDT 2016


On Monday, March 28, 2016 22:34:31 Jack Stouffer via Digitalmars-d-learn 
wrote:
> void main () {
>      import std.range.primitives;
>      char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
>      pragma(msg, ElementEncodingType!(typeof(val)));
>      pragma(msg, typeof(val.front));
> }
>
> prints
>
>      char
>      dchar
>
> Why?

assert(typeof(ElementType!(typeof(val)) == dchar));

The range API considers all strings to have an element type of dchar. char,
wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32
respectively. One or more code units make up a code point, which is actually
something displayable but not necessarily what you'd call a character (e.g.
it could be an accent). One or more code points then make up a grapheme,
which is really what a displayable character is. When Andrei designed the
range API, he didn't know about graphemes - just code units and code points,
so he thought that code points were guaranteed to be full characters and
decided that that's what we'd operate on for correctness' sake.

In the case of UTF-8, a code point is made up of 1 - 4 code units of 8 bits
each. In the case of UTF-16, a code point is mode up of 1 - 2 code units of
16 bits each. And in the case of UTF-32, a code unit is guaranteed to be a
single code point. So, by having the range API decode UTF-8 and UTF-16 to
UTF-32, strings then become ranges of dchar and avoid having code points
chopped up by stuff like slicing. So, while a code point is not actually
guaranteed to be a full character, certain classes of bugs are prevented by
operating on ranges of code points rather than code units. Of course, for
full correctness, graphemes need to be taken into account, and some
algorithms generally don't care whether they're operating on code units,
code points, or graphemes (e.g. find on code units generally works quite
well, whereas something like filter would be a complete disaster if you're
not actually dealing with ASCII).

Arrays of char and wchar are termed "narrow strings" - hence isNarrowString
is true for them (but not arrays of dchar) - and the range API does not
consider them to have slicing, be random access, or have length, because as
ranges of dchar, those operations would be O(n) rather than O(1). However,
because of this mess of whether an algorithm works best when operating on
code units or code points and the desire to avoid decoding to code points if
unnecessary, many algorithms special case narrow strings in order to
operate on them more efficiently. So, ElementEncodingType was introduced for
such cases. ElementType gives you the element type of the range, and for
everythnig but narrow strings ElementEncodingType is the same as
ElementType, but in the case of narrow strings, whereas ElementType is
dchar, ElementEncodingType is the actual element type of the array - hence
why ElementEncodingType(typeof(val)) is char in your code above.

The correct way to deal with this is really to understand Unicode well
enough to know when you should be dealing at the code unit, code point, or
grapheme level and write your code accordingly, but that's not exactly easy.
So, in some respects, just operating on strings as dchar simplifies things
and reduces bugs relating to breaking up code points, but it does come with
an efficiency cost, and it does make the range API more confusing when it
comes to operating on narrow strings. And it isn't even fully correct,
because it doesn't take graphemes into account. But it's what we're stuck
with at this point.

std.utf provides byCodeUnit and byChar to iterate by code unit or specific
character types, and std.uni provides byGrapheme for iterating by grapheme
(along with plenty of other helper functions). So, the tools to deal with
range s of characters more precisely are there, but they do require some
understanding of Unicode, and they don't always interact with the rest of
Phobos very well, since they're newer (e.g. std.conv.to doesn't fully work
with byCodeUnit yet, even though it works with ranges of dchar just fine).

- Jonathan M Davis



More information about the Digitalmars-d-learn mailing list