Fix Phobos dependencies on autodecoding

Patrick Schluter Patrick.Schluter at bbox.fr
Fri Aug 16 09:20:54 UTC 2019


On Friday, 16 August 2019 at 06:28:30 UTC, Walter Bright wrote:
> On 8/15/2019 3:56 PM, H. S. Teoh wrote:
>> And now that you agree that character encoding should be based 
>> on
>> "symbol" rather than "glyph", the next step is the realization 
>> that, in
>> the wide world of international languages out there, there 
>> exist
>> multiple "symbols" that are rendered with the *same* glyph.  
>> This is a
>> hard fact of reality, and no matter how you wish it to be 
>> otherwise, it
>> simply ain't so.  Your ideal of "character == glyph" simply 
>> doesn't
>> work in real life.
>
> Splitting semantic hares is pointless, as the fact remains it 
> worked just fine in real life before Unicode, it's called 
> "printing" on paper.

Sorry, no it didn't work in reality before Unicode. Multi 
language system were a mess.
My job is on the biggest translation memory in the world, the 
Euramis system of the European Union and when I started there in 
2002, the system supported only 11 languages. The data in the 
Oracle database was already in Unicode but all the supporting 
translation chain was codepage based. It was a catastrophe and 
the amount of crap, especially in Greek data, was staggering. The 
issues H.S.Teoh described above were indeed a real pain point. In 
greek text it was very frequent to have mixed Latin characters 
with Greek character from codepage 1253. Was the A an alpha or a 
\x41. This crap made a lot of algorithms that were used 
downstream from the database (CAT tools, automatic translation 
etc.) completely bonkers.
For the 2004 extension of the EU we had to support one alphabet 
more (Cyrillic for Bulgarian) and 4 codepages more (CP-1250 
Latin-2 Extended-A, CP-1251 Cyrillic, CP-1257 Baltic and 
ISO-8859-3 Maltese). It would have been such a mess that we 
decided to convert everything to Unicode.
We don't have these crap data anymore. Our code is not perfect, 
far from it, but adopting Unicode through and throug and dropping 
all support for the old coding crap simplified our lives 
tremendously.
When we got in 2010 the request from the EEAS (European External 
Action Service) to support also other languages than the 24 
official EU languages, i.e. Russian, Arabic and Chinese, we 
didn't break a sweat to implement it, thanks to Unicode.

>
> As for not working in real life, that's Unicode.

Unicode works much, much better than anything that existed 
before. The issue is that not a lot of people work in a 
multi-language environment and don't have a clue of the unholy 
mess it was before.





More information about the Digitalmars-d mailing list