Sort characters in string

Jonathan M Davis newsgroup.d at jmdavisprog.com
Wed Dec 6 09:24:33 UTC 2017


On Wednesday, December 06, 2017 08:59:09 Fredrik Boulund via Digitalmars-d-
learn wrote:
> Hi,
>
> I'm having some trouble sorting the individual characters in a
> string. Searching around, I found this thread
> (http://forum.dlang.org/post/mailman.612.1331659665.4860.digitalmars-d-lea
> rn at puremagic.com) about a similar issue, but it feels quite old so I
> wanted to check if there is a clear cut way of sorting the characters of
> a string nowadays?
>
> I was expecting to be able to do something like this:
>
> string word = "longword";
> writeln(sort(word));
>
> But that doesn't work because I guess a string is not the type of
> range required for sort?
> I tried converting it to a char[] array (to!(char[])(word)) but
> that doesn't seem work either. I get:
>
> Error: template std.algorithm.sorting.sort cannot deduce function
> from argument types !()(char[])

Okay. "Narrow strings" are not considered random access and aren't
considered to have length by the range API. This is because a code unit in
UTF-8 or UTF-16 is not guaranteed to be a full code point (IIRC, 1 - 6 code
units for UTF-8 and 1 - 2 for UTF-16), and slicing them risks cutting into
the middle of a code point. UTF-32 on the other hand is guaranteed to have a
code unit be a full code point. So, arrays of char and wchar are not
considered random access or to have length by the range API, but arrays of
dchar are. As part of that, front and back return a dchar for all string
types, so front, popFront, back, and popBack do decoding of code units to
code points and never cut up code points. The idea was to make it so that
Unicode correctness would be guaranteed by default.

While this design was well-intended, it doesn't really work, and it was a
huge mistake. While a code point is something that can actually be displayed
on the screen, it's not guaranteed to be a full character (e.g. the letter a
with a subscript of 2 would be multiple code points). You need a grapheme
cluster to get a full character, and that potentially means multiple code
points. So, having ranges of characters operate at the code point level does
not guarantee Unicode correctness. You could still cut up a character (it
would just be at the code point level rather than the code unit level). For
full-on Unicode correctness by default, everything would have to operate at
the grapheme level, which would be horribly inefficient (especially for
ASCII).

In reality, whether an algorithm needs to operate at the code unit, code
point, or grapheme level depends on what it's doing (e.g. so long as the
Unicode is normalized properly, find can operate at the code unit level, but
if you want to actually sort a range of full Unicode characters properly,
you'd need to operate at the grapheme level). So, what the range API
_should_ just treat strings as ranges of their element type and let the
programmer wrap them as necessary to deal with code points or graphemes when
appropriate. And ultimately, the programmer just plain needs to do what
they're doing, and a solution that reasonably guarantees Unicode correctness
really doesn't exist (not if you care about efficiency anyway).
Unfortunately, changing the range API at this point would break a lot of
code, and no one has figured out how to do it without breaking code. So, it
currently seems likely that we'll forever be stuck with this design mistake.

So, the best way to work around it depends on what you're doing.
std.utf.byCodeUnit creates a wrapper range that turns all strings into
ranges of their actual element types rather than dchar. So, that's one
solution. Another is std.string.representation, which casts a string to the
corresponding integral type (e.g. typeof("foo".representation) is
immutable(ubyte)[]). And std.uni has stuff for better handling code points
or graphemes if need be.

If you have a string, and you _know_ that it's only ASCII, then either use
representation or byCodeUnit to wrap it for the call to sort, but it _will_
have to be mutable, so string won't actually work. e.g.

char[] str = "hello world".dup;
sort(str.representation);
// str is now sorted

However, if the string could actually contain Unicode, then you'll have to
use std.uni's grapheme support to wrap the string in a range that operates
at the grapheme level; otherwise you could rip pieces of characters apart.

- Jonathan M Davis



More information about the Digitalmars-d-learn mailing list