The Case Against Autodecode

H. S. Teoh via Digitalmars-d digitalmars-d at puremagic.com
Thu May 12 16:16:23 PDT 2016


On Thu, May 12, 2016 at 08:24:23PM +0000, Vladimir Panteleev via Digitalmars-d wrote:
[...]
> 12. The result of autodecoding, a range of Unicode code points, is
> rarely actually useful, and code that relies on autodecoding is rarely
> actually, universally correct. Graphemes are occasionally useful for a
> subset of scripts, and a subset of that subset has all graphemes
> mapped to single code points, but this only applies to some
> scripts/languages.
> 
> In the majority of cases, autodecoding provides only the illusion of
> correctness.

A range of Unicode code points is not the same as a range of graphemes
(a grapheme is what a layperson would consider to be a "character").
Autodecoding returns dchar, a code point, rather than a grapheme.

Therefore, autodecoding actually only produces intuitively correct
results when your string has a 1-to-1 correspondence between grapheme
and code point. In general, this is only true for a small subset of
languages, mainly a few common European languages and a handful of
others.  It doesn't work for Korean, and doesn't work for any language
that uses combining diacritics or other modifiers.  You need byGrapheme
to have the correct results.

So basically autodecoding, as currently implemented, fails to meet its
goal of segmenting a string by "character" (i.e., grapheme), and yet
imposes a performance penalty that is difficult to "turn off" (you have
to sprinkle your code with byCodeUnit everywhere, and many Phobos
algorithms just return a range of dchar anyway). Not to mention that a
good number of string algorithms don't actually *need* autodecoding at
all.

(One could make a case for auto-segmenting by grapheme, but that's even
worse in terms of performance (it requires a non-trivial Unicode
algorithm involving lookup tables, and may need memory allocation). At
the end of the day, we're back to square one: iterate by code unit, and
explicitly ask for byGrapheme where necessary.)


T

-- 
"I'm running Windows '98." "Yes." "My computer isn't working now." "Yes, you already said that." -- User-Friendly


More information about the Digitalmars-d mailing list