The Case Against Autodecode
Chris via Digitalmars-d
digitalmars-d at puremagic.com
Fri Jun 3 05:04:39 PDT 2016
On Friday, 3 June 2016 at 11:46:50 UTC, Jonathan M Davis wrote:
> On Friday, June 03, 2016 10:10:18 Vladimir Panteleev via
> Digitalmars-d wrote:
>> On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
>> > On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
>> >> However, this
>> >> meant that some precomposed characters were "redundant":
>> >> they
>> >> represented character + diacritic combinations that could
>> >> equally well
>> >> be expressed separately. Normalization was the inevitable
>> >> consequence.
>> >
>> > It is not inevitable. Simply disallow the 2 codepoint
>> > sequences - the single one has to be used instead.
>> >
>> > There is precedent. Some characters can be encoded with more
>> > than one UTF-8 sequence, and the longer sequences were
>> > declared invalid. Simple.
>> >
>> > I.e. have the normalization up front when the text is
>> > created rather than everywhere else.
>>
>> I don't think it would work (or at least, the analogy doesn't
>> hold). It would mean that you can't add new precomposited
>> characters, because that means that previously valid sequences
>> are now invalid.
>
> I would have argued that no composited characters should have
> ever existed regardless of what was done in previous encodings,
> since they're redundant, and you need the non-composited
> characters to avoid a combinatorial explosion of characters, so
> you can't have characters that just have a composited version
> and be consistent. However, the Unicode folks obviously didn't
> go that route. But given where we sit now, even though we're
> stuck with some composited characters, I'd argue that we should
> at least never add any new ones. But who knows what the Unicode
> folks are actually going to do.
>
> As it is, you probably should normalize strings in many cases
> where they enter the program, just like ideally, you'd validate
> them when they enter the program. But regardless, you have to
> deal with the fact that multiple normalization schemes exist
> and that there's no guarantee that you're even going to get
> valid Unicode, let alone Unicode that's normalized the way you
> want.
>
> - Jonathan M Davis
I do exactly this. Validate and normalize.
More information about the Digitalmars-d
mailing list