Char representation

Jonathan M Davis via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Tue Nov 22 06:23:28 PST 2016


On Tuesday, November 22, 2016 13:29:47 RazvanN via Digitalmars-d-learn 
wrote:
> Given the following code:
>
>   char[5] a = ['a', 'b', 'c', 'd', 'e'];
>   alias Range = char[];
>   writeln(is(ElementType!Range == char));
>
> One would expect that the program will print true. In fact, it
> prints false and I noticed that if Range is char[], wchar[],
> dchar[], string, wstring, dstring
> Unqual!(ElementType!Range) is dchar. I find it odd that the
> internal representation for char and string is dchar. Is this a
> bug?

You misunderstand. char[] is a dynamic array of char, wchar[] is a dynamic
array of wchar[], and dchar[] is a dynamic array of dchar. There is nothing
funny going on with the internal representation. Rather, the problem is with
the range API and the traits that go with it. And it's not a bug; it's a
design mistake.

I don't know how much you know about Unicode, but for a quick explanation,
you have code units, code points, and graphemes. A grapheme is made up of
one or more code points, and  a code point is made up of one or more code
units. In the case of UTF-8, a code unit is 8 bits; in UTF-16, a code unit
is 16 bits; and in UTF-32, a code unit is 32 bits. Those are represented in
D by char, wchar, dchar respectively. There is no guarantee that a char,
wchar, or dchar is a representable character. A code unit is just a piece of
a character except in the cases where it happens to be a full character. :|

A code point, on the other hand, actually makes up something composable and
printable. It's something like the letter A, or é, or, の, etc. It could
also be an accent, a superscript, subscript, etc. In the case of UTF-8 and
UTF-16, it can take several code units to form a single code point. In the
case of UTF-32, a single code unit is always a code point, because code
points take up 32 bits.

However, that's still not necessarily a full character. After all, an accent
or a superscript is not really a character. Rather, it's a modifier for a
character. So, one or more code points can be combined to form graphemes
which _are_ actual characters. Unfortunately, there are several
normalization schemes for the order of code points in a grapheme, and some
graphemes can be represented as a single code point or as several (most
notably, the characters which commonly have accents on them such as é come
both as single code points and as combined code points). So, this whole
thing gets stupidly complicated. It's even worse when you want to handle it
all _efficiently_.

Well, when Andrei added ranges to D, he tried to simplify things so that the
default was correct and reasonably efficient while allowing for code to
specialize where appropriate to get the full efficiency. That's a noble
goal, but unfortunately, he didn't know about graphemes at the time. He
thought that code points were guaranteed to be full characters and that if
you operated at the code point level, you were guaranteed full correctness.
So, in order to avoid errors related to chopping up strings of char or wchar
in the middle of code points, he came up with the concept of "narrow"
strings - i.e. strings which are made up of char or wchar rather than dchar
(so strings where each code unit is not guaranteed to be a code point), and
he restricted what narrow strings could do by default per the range API and
its associated traits. So, we get fun like this.

assert(!hasLength!string);
assert(!hasLength!wstring);
assert(hasLength!dstring);

assert(!isRandomAccessRange!string);
assert(!isRandomAccessRange!wstring);
assert(isRandomAccessRange!dstring);

assert(is(ElementType!string == dchar));
assert(is(ElementType!wstring == dchar));
assert(is(ElementType!dstring == dchar));

And front, popFront, back, and popBack all automatically decode the code
units in a string to code points. So, front and back both return dchar even
if the string is a string of char or wchar. The arrays themselves do not
change. However, the way that the traits in std.range.primitives treat them
is then fundamentally different from how the language treats them. So, even
though

string str = "hello world";
for(auto r = str; !r.empty; r.popFront())
{
    auto e = range.front;
}

will iterate by dchar

string str = "hello world";
foreach(e; str)
{
}

will iterate by char. If you want it to iterate by dchar, then you make it
explicit.

string str = "hello world";
foreach(dchar e; str)
{
}

The result of all of this is that by default, when you treat strings as
ranges, you operate at the code point level. This avoids certain bugs where
code would otherwise chop up code points by operating on code units, but
since it doesn't actually go to the grapheme level, it still isn't actually
correct, and it's easier to miss the fact that it's wrong, since more cases
work. It's also inefficient, because the code units are always decoded to
code points regardless of whether the algorithm in question actually needs
to do that or not. It also creates confusion and questions like yours.

Most of us agree at this point that all of this was a mistake and that
narrow strings should not have been treated specially. Rather, it should be
required for the programmer to wrap them in other ranges to decode code
units to code points or graphemes so that the programmer has full control
over it. But unfortunately, changing it at this point would be a _huge_
breaking change. So, it's unlikely that we're going to be able to. We hope
that we'll find a way, but for now, we're stuck.

To work around this, phobos tends to special case algorithms on strings in
order to avoid the auto-decoding. find would be a prime example of this. As
long as the code points are normalized, you can do a find using code units
rather than code points. Decoding to code points is just a waste. However,
some algorithms such as filter can't do that, because there is no obviously
correct solution. The programmer really needs to be the one to decide, so
they just always do the auto-decoding.

traits like isNarrowString and ElementEncodingType can be used to detect
when you're dealing with narrow strings and operate on them as strings
rather than via the range API, and Phobos uses them heavily.

In addition, std.utf has byCodeUnit and by{C,Wc,Dc}har, and std.uni has
byGrapheme. So, a lot of range-based code should really be using those
rather than operating on strings directly, though there are a number of
parts of Phobos that don't yet fully support arbitrary ranges of char or
wchar (since previously, it was assumed that all ranges of character types
were ranges of dchar). So, sometimes stuff that should work doesn't (the
situation is improving though). Alternatively, there's
std.string.representation that can be used to cast a string of char, wchar,
or dchar to an array of ubyte, ushort, or uint with the proper constness,
and code can then operate on those integer types, and that won't
auto-decode, but that doesn't work very well if you want to use functions
intended specifically for strings (e.g.  most of std.string doesn't work
with arrays of ubyte, ushort, or uint).

So, yes. This is a bit of a mess. It works fairly well overall in spite of
the problems, but it's still a mess. And you're far from alone in being
confused by it.

- Jonathan M Davis




More information about the Digitalmars-d-learn mailing list