The Case Against Autodecode

Mon May 30 09:03:03 PDT 2016

Am Mon, 30 May 2016 09:26:09 +0000
schrieb Chris <wendlec at tcd.ie>:

> If it's true that auto decode is unnecessary in many cases, then 
> it shouldn't affect the whole code base. But I might be mistaken 
> here. Maybe we should make a list of the functions where auto 
> decode does make a difference, see how common they are, and work 
> out a strategy from there. Destroy.

It makes a difference for every function. But it still isn't
necessary in many cases. It's fairly simple:

code unit  == bytes/chars
code point == auto-decode
grapheme*  == .byGrapheme

So if for now you used auto-decode you iterated code-points,
which works correctly for most scripts in NFC**. And here lies
the rub and why people say auto-decoding is unnecessary most
of the time: If you are working with XML, CSV or JSON or
another structured text format, these all use ASCII characters
for their syntax elements. Code unit, code point and graphemes
become all the same and auto-decoding just slows you down.

When on the other hand you work with real world international
text, you'll want to work with graphemes. One example is
putting an ellipsis in long text:

"Alle Segeltörns im Überblick" (in NFD, e.g. OS X file name)
may display as this with auto-decode:
"Alle Segelto…¨berblick"
and this with byGrapheme:
"Alle Segeltö…Überblick"

But at that point you are likely also in need of localized
sorting of strings, a set of algorithms that may change with
the rise and fall of nations or reformations. So you'll use the
platform's go-to Unicode library instead of what Phobos
offers. For Java and Linux that would be ICU***.

That last point makes me think we should not bother much with
decoding in Phobos at all. Odds are we miss other capabilities
to make good use of it. Users of auto-decode should review
their code to see if code-points is really what they want and
potentially switch to no-decoding or .byGrapheme.

* What we typically perceive as one unit in written text.
** A normalization form where e.g. 'ö' is a single code-point,
   as opposed to NFD, where 'ö' would be assembled from the
   two 'o' and '¨' code-points as in OS X file names.
*** http://site.icu-project.org/home#TOC-What-is-ICU-

-- 
Marco