The Case Against Autodecode
Patrick Schluter via Digitalmars-d
digitalmars-d at puremagic.com
Sat Jun 4 00:22:26 PDT 2016
On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote:
>>
> Even the Greek sigma has two forms depending on whether it's at
> the end of a word or not -- so should it be two code points or
> one? If you say two, then you'd have a problem with how to
> search for sigma in Greek text, and you'd have to search for
> either medial sigma or final sigma. But if you say one, then
> you'd have a problem with having two different letterforms for
> a single codepoint.
In Unicode there are 2 different codepoints for lower case sigma
ς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma.
Codepoint U+3A2 is undefined. So your objection is not
hypothetic, it is actually an issue for uppercase() and
lowercase() functions.
Another difficulty besides dotless and dotted i of Turkic, the
double letters used in latin transcription of cyrillic text in
east and south europe dž, lj, nj and dz, which have an uppercase
forme (DŽ, LJ, NJ, DZ) and a titlecase form (Dž, Lj, Nj, Dz).
>
> Besides, that still doesn't solve the problem of what
> "i".uppercase() should return. In most languages, it should
> return "I", but in Turkish it should not. And if we really
> went the route of encoding Cyrillic letters the same as their
> Latin lookalikes, we'd have a problem with what "m".uppercase()
> should return, because now it depends on which font is in
> effect (if it's a Cyrillic cursive font, the correct answer is
> "Т", if it's a Latin font, the correct answer is "M" -- the
> other combinations: who knows). That sounds far worse than
> what we have today.
As an anecdote I can tell the story of the accession to the
European Union of Romania and Bulgaria in 2007. The issue was
that 3 letters used by Romanian and Bulgarian had been forgotten
by the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21B
and 2 Cyrillic letters that I do not remember). The Romanian used
as a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 and
U+162), which look a little bit alike. When the Commission
finally managed to force Mirosoft to correct the fonts to include
them, we could start to correct the data. The transition was
finished in 2012 and was only possible because no other language
we deal with uses the "wrong" codepoints (Turkish but fortunately
we only have a handful of them in our db's). So 5 years of ad hoc
processing for the substicion of 4 codepoints.
BTW: using combining diacritics was out of the question at that
time simply because Microsoft Word didn't support it at that time
and many documents we encountered still only used codepages (one
has also to remember that in big institution like the EC, the IT
is always several years behind the open market, which means that
when product is in release X, the Institution still might use
release X-5 years).
More information about the Digitalmars-d
mailing list