Fix Phobos dependencies on autodecoding
Patrick Schluter
Patrick.Schluter at bbox.fr
Sat Aug 17 11:03:21 UTC 2019
On Friday, 16 August 2019 at 21:05:44 UTC, Walter Bright wrote:
> On 8/16/2019 9:32 AM, xenon325 wrote:
>> On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright
>> wrote:
>>> And yet somehow people manage to read printed material
>>> without all these problems.
>>
>> If same glyphs had same codes, what will you do with these:
>>
>> 1) Sort string.
>>
>> In my phone's contact lists there are entries in russian, in
>> english and mixed.
>> Now they are sorted as:
>> A (latin), B (latin), C, А (ru), Б, В (ru).
>> Wich is pretty easy to search/navigate.
>
> Except that there's no guarantee that whoever entered the data
> used the right code point.
From my experience, that was an issue that WE encountered often
before Unicode the uppercase letters in Greek texts that were
mixes of ASCII (A 0x41) and Greek (Α 0xC1 in CP-1253). It was so
bad that the Greek translation department didn't use Euramis for
a significant amount. It was only when we got completely rid of
this crap (and also the RTF file format) and embraced Unicode
that we got rid of this issue of mis-used encoding.
While I get that Unicode is (over-)complicated and in some
aspects silly. It has nonetheless 2 essential virtues that all
other encoding schemes never were able achieve:
- it is a norm that is widely used, almost universal.
- it is a norm that is widely used, almost universal.
Yeah, I'm lame, I repeated it twice :-)
The fact that it is widely adopted even in far east makes it
really something essential. Could they have defined things
differently or simpler? Maybe but I doubt it, as the complexity
of Unicode comes from the complexity of language themselves.
>
> The pragmatic solution, again, is to use context. I.e. if a
> glyphy is surrounded by russian characters, it's likely a
> russian glyph. If it is surrounded by characters that form a
> common russian word, it's likely a russian glyph.
No, that doesn't work for panaché documents, we've been there, we
had that and it sucks. UTF was such a relief.
Here little example from our configuration. The regular
expression used to detect a document reference in a text as a
replaceable:
0:UN:EC_N:((№|č.|nr.|št.|αριθ.|No|nr|N:o|Uimh.|br.|n.|Nr.|Nru|[Nn][º°o]|[Nn].[º°o])[ ][0-9]+/[0-9]+/(EC|ES|EF|EG|EK|EΚ|CE|EÜ|EY|CE|EZ|EB|KE|WE))
What is the context here? Btw the EC is Cyrillic and the first EK
is Greek
and their substitution expressions
T:BG:EC_N:№\2/ЕС
T:CS:EC_N:č.\2/ES
T:DA:EC_N:nr.\2/EF
T:DE:EC_N:Nr.\2/EG
T:EL:EC_N:αριθ.\2/EΚ
T:EN:EC_N:No\2/EC
T:ES:EC_N:nº\2/CE
T:ET:EC_N:nr\2/EÜ
T:FI:EC_N:N:o\2/EY
T:FR:EC_N:nº\2/CE
T:GA:EC_N:Uimh.\2/CE
T:HR:EC_N:br.\2/EZ
T:IT:EC_N:n.\2/CE
T:LT:EC_N:Nr.\2/EB
T:LV:EC_N:Nr.\2/EK
T:MT:EC_N:Nru\2/KE
T:NL:EC_N:nr.\2/EG
T:PL:EC_N:nr\2/WE
T:PT:EC_N:n.º\2/CE
T:RO:EC_N:nr.\2/CE
T:SK:EC_N:č.\2/ES
T:SL:EC_N:št.\2/ES
T:SV:EC_N:nr\2/EG
and as said before, such a number can be in a citation in the
language of the citation not in the language of the document.
>
> Of course it isn't perfect, but I bet using context will work
> better than expecting the code points to have been entered
> correctly.
>
> I note that you had to tag В with (ru), because otherwise no
> human reader or OCR would know what it was. This is exactly the
> problem I'm talking about.
Yeah, but what you propose makes it even worse not better.
>
> Writing software that relies on invisible semantic information
> is never going to work.
Invisible to your eyes, not invisible to the machines, that's the
whole point. Why do we need to annotate all the functions in D
with these annoying attributes if the compiler can detect them
automagically via context? Because in general it can't, the
semantic information must be provided somehow.
More information about the Digitalmars-d
mailing list