String Type Usage. String vs DString vs WString

Jonathan M Davis newsgroup.d at jmdavisprog.com
Mon Jan 15 04:27:15 UTC 2018


On Monday, January 15, 2018 03:14:02 Tony via Digitalmars-d-learn wrote:
> On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole
>
> wrote:
> > Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
> > The size of a code point is 1, 2 or 4 bytes.
>
> I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4
> (UTF-32) bytes are referred to as "code units" and the size of a
> code point varies in UTF-8 and UTF-16.

Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 of them
(IIRC) in a code point. For UTF-16, a code unit is 16 bits, and there are
either 1 or 2 code units per code point. For UTF-32, a code unit is 32 bits,
and there is always 1 code unit per code point.

For better or worse (mostly worse), ranges then treat all strings as ranges
of code points and decode them to code points such that get a range of dchar
(which means fun things like isRandomAccessRange!string and hasLength!string
are false). As I understand it, each code point is then something which can
be physically printed, but either way, it's not necessarily a full
character.

Multiple code points can then be combined to make a grapheme cluster (which
then corresponds to what we'd normally consider a full character - e.g. a
letter and an accent can each be a code point which are then combined to
create an accented character). std.uni provides the functionality for
operating on graphemes.

And std.utf.byCodeUnit can be used to treat strings as ranges of code units
instead of code points (and a fair bit of Phobos takes the solution of
specializing range-based code for strings to avoid the auto-decoding).

All in all, the whole thing is annoyingly complicated, though at least D is
much more explicit about it than most languages, and I suspect that your
average D programmer is better educated about Unicode than your average
programmer. And having to figure out why the heck strings and wstrings act
so bizarrely as ranges does have the positive side effect of putting it even
more in your face than it would be otherwise, making it that much more
likely that folks are going to learn about Unicode - though I still think
that we'd be better off if we could ever figure out how to treat all strings
as ranges of code units without breaking everything in the process. :|

- Jonathan M Davis



More information about the Digitalmars-d-learn mailing list