sorting a string

Jonathan M Davis via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Fri Jul 14 13:52:54 PDT 2017


On Friday, July 14, 2017 7:50:17 PM MDT Anton Fediushin via Digitalmars-d-
learn wrote:
> On Friday, 14 July 2017 at 17:23:41 UTC, Steven Schveighoffer
>
> wrote:
> > Don't do this, because it's not what you think. It's not
> > actually calling std.algorithm.sort, but the builtin array sort
> > property. This will be going away soon.
>
> This sucks. I know, that `.sort` will be removed, but I thought
> it won't break any code.
>
> > With 2.075, it won't compile even without the parentheses,
> > because a char[] is not an array according to std.algorithm...
>
> But why? This should be true for `char[]`, isn't it?
> -----
> if ((ss == SwapStrategy.unstable && (hasSwappableElements!Range
>
> || hasAssignableElements!Range) || ss != SwapStrategy.unstable &&
>
> hasAssignableElements!Range) && isRandomAccessRange!Range &&
> hasSlicing!Range && hasLength!Range)
> -----
> (It's from
> https://dlang.org/phobos/std_algorithm_sorting.html#sort)

It has to do with Unicode and Andrei's attempt to make ranges Unicode
correct without having to think about it. It was a nice thought, but it
really doesn't work, and it causes a number of annoying problems. As a quick
explanation, in Unicode, you have code units, which combine to make code
points. In UTF-8, a code unit is 8 bits. In UTF-16, it's 16 bits, and in
UTF-32, it's 32 bits. char is a UTF-8 code unit. wchar is a UTF-16 code
unit, and dchar is a UTF-32 cod unit. 32 bits is enough to represent any
code point, so a valid dchar is not only a code unit, it's guaranteed to be
a code point. So, indexing or slicing a dstring will not cut into the middle
of any code points. However, because UTF-8 and UTF-16 potentially require
multiple code units in order to represent a code point, you can't just
arbitrarily index a string or wstring (or arbitrarily slice either of them)
or you risk breaking up a code point.

To avoid this problem and guarantee Unicode correctness, Andrei made it so
that the range API does not treat strings or wstrings as either random
access or sliceable. So, isRandomAccessRange!string and hasSlicing!string
are false. And front returns dchar for all string types, because it decodes
the first code point from the code units, which means that popFront could
pop one char or wchar off, or it could pop several. Similarly, because the
number of code points can't be known without iterating them,
hasLength!string is false.

This does prevent you from blindly doing things to your strings using the
range API which will make them have invalid code points, but it also makes
it really annoying for stuff like sort, because sort requires a random
access range to sort, but char[] and wchar[] are not considered to be random
access ranges, because they're considered to be ranges of dchar rather than
char or wchar. It also hurts performance, because many algorithms don't
actually need to decode the code points - which is why many Phobos functions
have special overloads for strings; they do the decinding only when required
or skip it entirely.

To make matters worse, not only is all of this frustrating to deal with, but
it's not even fully correct. When Andrei added the range API to Phobos, he
thought that code points where the character you see on the screen (and
annoyingly, Unicode _does_ call them characters for some stupid reason), but
they really aren't. They do represent printable items, but they don't just
include characters such as A. They also include stuff like accents or
subscripts such that you sometimd actually need multiple code points to
represent an actual character on the screen - and a group of code points
which represent a character or glyph that you'd see on the screen are called
grapheme clusters. So, to be fully correct, ranges would have to decode
clear to grapheme clusters by default, which would horribly inefficient. The
correct way to handle Unicode requires that the programmer understand it
well enough to know when they should be operating at the code unit level,
when they should be operating at the code point level, and when they should
be operating at the grapheme level. You really can't automate it -
especially not if you want to be efficent. And you rarely want the code
point level, making Andrei's choice particularly bad.

So really, ranges should not be treating strings in a special manner; they
should be treated as ranges of code units and require the programmer to wrap
them in ranges of code points or graphemes as appropriate. But
unfortunately, making that change now would break a lot of code. So, we seem
to be stuck. The result is that you're forced to either specialize your
functions on strings or use functions like std.string.representation or
std.utf.byCodeUnit to work around the problem. And of course, this whole
issue is incredibly confusing to anyone coming to D - especially those who
aren't well-versed in Unicode. :(

- Jonathan M Davis



More information about the Digitalmars-d-learn mailing list