The Case Against Autodecode

Mon May 30 14:12:58 PDT 2016

Am Mon, 30 May 2016 17:14:47 +0000
schrieb Andrew Godfrey <X at y.com>:

> I like "make string iteration explicit" but I wonder about other 
> constructs. E.g. What about "sort an array of strings"? How would 
> you tell a generic sort function whether you want it to interpret 
> strings by code unit vs code point vs grapheme?

You are just scratching the surface! Unicode strings are
sorted following the Unicode Collation Algorithm which is
described in the 86 pages document here:
(http://www.unicode.org/reports/tr10/)
which is implemented in the ICU library mentioned before.

Some obvious considerations from the description of the
algorithm:

In Sweden z comes before ö, while in Germany its the reverse.
In Germany, words in a dictionary are sorted differently from
lists of names in a phone book.
  dictionary: of < öf,
  phone book: öf < of
Spanish sorts 'll' as one character right after 'l'.

The default collation is selected in Windows through the
control panel's localization app and on Linux (Posix) using
the LC_COLLATE environment variable.
The actual string sorting in the user's locale can then be
performed with the C library using
http://www.cplusplus.com/reference/cstring/strcoll/
or OS specific functions like CompareStringEx on Windows
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx

TL;DR neither code-points nor grapheme clusters are adequate
for string sorting. Also two strings may compare unequal byte
for byte, while they are actually the same text in different
normalization forms. (E.g. Umlauts on OS X (NFD) vs. rest of
the world (NFC)).

Admittedly I find myself using str1 == str2 without first
normalizing both, because it is frigging convenient and fast.

-- 
Marco