The Case Against Autodecode

Patrick Schluter via Digitalmars-d digitalmars-d at puremagic.com
Sat Jun 4 00:22:26 PDT 2016


On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote:
>>
> Even the Greek sigma has two forms depending on whether it's at 
> the end of a word or not -- so should it be two code points or 
> one? If you say two, then you'd have a problem with how to 
> search for sigma in Greek text, and you'd have to search for 
> either medial sigma or final sigma. But if you say one, then 
> you'd have a problem with having two different letterforms for 
> a single codepoint.

In Unicode there are 2 different codepoints for lower case sigma 
ς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma. 
Codepoint U+3A2 is undefined. So your objection is not 
hypothetic, it is actually an issue for uppercase() and 
lowercase() functions.
Another difficulty besides dotless and dotted i of Turkic, the 
double letters used in latin transcription of cyrillic text in 
east and south europe dž, lj, nj and dz, which have an uppercase 
forme (DŽ, LJ, NJ, DZ) and a titlecase form (Dž, Lj, Nj, Dz).

>
> Besides, that still doesn't solve the problem of what 
> "i".uppercase() should return. In most languages, it should 
> return "I", but in Turkish it should not.  And if we really 
> went the route of encoding Cyrillic letters the same as their 
> Latin lookalikes, we'd have a problem with what "m".uppercase() 
> should return, because now it depends on which font is in 
> effect (if it's a Cyrillic cursive font, the correct answer is 
> "Т", if it's a Latin font, the correct answer is "M" -- the 
> other combinations: who knows).  That sounds far worse than 
> what we have today.

As an anecdote I can tell the story of the accession to the 
European Union of Romania and Bulgaria in 2007. The issue was 
that 3 letters used by Romanian and Bulgarian had been forgotten 
by the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21B 
and 2 Cyrillic letters that I do not remember). The Romanian used 
as a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 and 
U+162), which look a little bit alike. When the Commission 
finally managed to force Mirosoft to correct the fonts to include 
them, we could start to correct the data. The transition was 
finished in 2012 and was only possible because no other language 
we deal with uses the "wrong" codepoints (Turkish but fortunately 
we only have a handful of them in our db's). So 5 years of ad hoc 
processing for the substicion of 4 codepoints.
BTW: using combining diacritics was out of the question at that 
time simply because Microsoft Word didn't support it at that time 
and many documents we encountered still only used codepages (one 
has also to remember that in big institution like the EC, the IT 
is always several years behind the open market, which means that 
when product is in release X, the Institution still might use 
release X-5 years).




More information about the Digitalmars-d mailing list