Fix Phobos dependencies on autodecoding
Patrick Schluter
Patrick.Schluter at bbox.fr
Fri Aug 16 09:20:54 UTC 2019
On Friday, 16 August 2019 at 06:28:30 UTC, Walter Bright wrote:
> On 8/15/2019 3:56 PM, H. S. Teoh wrote:
>> And now that you agree that character encoding should be based
>> on
>> "symbol" rather than "glyph", the next step is the realization
>> that, in
>> the wide world of international languages out there, there
>> exist
>> multiple "symbols" that are rendered with the *same* glyph.
>> This is a
>> hard fact of reality, and no matter how you wish it to be
>> otherwise, it
>> simply ain't so. Your ideal of "character == glyph" simply
>> doesn't
>> work in real life.
>
> Splitting semantic hares is pointless, as the fact remains it
> worked just fine in real life before Unicode, it's called
> "printing" on paper.
Sorry, no it didn't work in reality before Unicode. Multi
language system were a mess.
My job is on the biggest translation memory in the world, the
Euramis system of the European Union and when I started there in
2002, the system supported only 11 languages. The data in the
Oracle database was already in Unicode but all the supporting
translation chain was codepage based. It was a catastrophe and
the amount of crap, especially in Greek data, was staggering. The
issues H.S.Teoh described above were indeed a real pain point. In
greek text it was very frequent to have mixed Latin characters
with Greek character from codepage 1253. Was the A an alpha or a
\x41. This crap made a lot of algorithms that were used
downstream from the database (CAT tools, automatic translation
etc.) completely bonkers.
For the 2004 extension of the EU we had to support one alphabet
more (Cyrillic for Bulgarian) and 4 codepages more (CP-1250
Latin-2 Extended-A, CP-1251 Cyrillic, CP-1257 Baltic and
ISO-8859-3 Maltese). It would have been such a mess that we
decided to convert everything to Unicode.
We don't have these crap data anymore. Our code is not perfect,
far from it, but adopting Unicode through and throug and dropping
all support for the old coding crap simplified our lives
tremendously.
When we got in 2010 the request from the EEAS (European External
Action Service) to support also other languages than the 24
official EU languages, i.e. Russian, Arabic and Chinese, we
didn't break a sweat to implement it, thanks to Unicode.
>
> As for not working in real life, that's Unicode.
Unicode works much, much better than anything that existed
before. The issue is that not a lot of people work in a
multi-language environment and don't have a clue of the unholy
mess it was before.
More information about the Digitalmars-d
mailing list