The Case Against Autodecode

Jonathan M Davis via Digitalmars-d digitalmars-d at puremagic.com
Fri Jun 3 04:46:50 PDT 2016


On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via Digitalmars-d wrote:
> On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
> > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
> >> However, this
> >> meant that some precomposed characters were "redundant": they
> >> represented character + diacritic combinations that could
> >> equally well
> >> be expressed separately. Normalization was the inevitable
> >> consequence.
> >
> > It is not inevitable. Simply disallow the 2 codepoint sequences
> > - the single one has to be used instead.
> >
> > There is precedent. Some characters can be encoded with more
> > than one UTF-8 sequence, and the longer sequences were declared
> > invalid. Simple.
> >
> > I.e. have the normalization up front when the text is created
> > rather than everywhere else.
>
> I don't think it would work (or at least, the analogy doesn't
> hold). It would mean that you can't add new precomposited
> characters, because that means that previously valid sequences
> are now invalid.

I would have argued that no composited characters should have ever existed
regardless of what was done in previous encodings, since they're redundant,
and you need the non-composited characters to avoid a combinatorial
explosion of characters, so you can't have characters that just have a
composited version and be consistent. However, the Unicode folks obviously
didn't go that route. But given where we sit now, even though we're stuck
with some composited characters, I'd argue that we should at least never add
any new ones. But who knows what the Unicode folks are actually going to do.

As it is, you probably should normalize strings in many cases where they
enter the program, just like ideally, you'd validate them when they enter
the program. But regardless, you have to deal with the fact that multiple
normalization schemes exist and that there's no guarantee that you're even
going to get valid Unicode, let alone Unicode that's normalized the way you
want.

- Jonathan M Davis



More information about the Digitalmars-d mailing list