Fix Phobos dependencies on autodecoding

Patrick Schluter Patrick.Schluter at bbox.fr
Sat Aug 17 11:03:21 UTC 2019


On Friday, 16 August 2019 at 21:05:44 UTC, Walter Bright wrote:
> On 8/16/2019 9:32 AM, xenon325 wrote:
>> On Thursday, 15 August 2019 at 22:23:13 UTC, Walter Bright 
>> wrote:
>>> And yet somehow people manage to read printed material 
>>> without all these problems.
>> 
>> If same glyphs had same codes, what will you do with these:
>> 
>> 1) Sort string.
>> 
>> In my phone's contact lists there are entries in russian, in 
>> english and mixed.
>> Now they are sorted as:
>> A (latin), B (latin), C, А (ru), Б, В (ru).
>> Wich is pretty easy to search/navigate.
>
> Except that there's no guarantee that whoever entered the data 
> used the right code point.

 From my experience, that was an issue that WE encountered often 
before Unicode the uppercase letters in Greek texts that were 
mixes of ASCII (A 0x41) and Greek (Α 0xC1 in CP-1253). It was so 
bad that the Greek translation department didn't use Euramis for 
a significant amount. It was only when we got completely rid of 
this crap (and also the RTF file format) and embraced Unicode 
that we got rid of this issue of mis-used encoding.
While I get that Unicode is (over-)complicated and in some 
aspects silly. It has nonetheless 2 essential virtues that all 
other encoding schemes never were able achieve:
- it is a norm that is widely used, almost universal.
- it is a norm that is widely used, almost universal.

Yeah, I'm lame, I repeated it twice :-)

The fact that it is widely adopted even in far east makes it 
really something essential. Could they have defined things 
differently or simpler? Maybe but I doubt it, as the complexity 
of Unicode comes from the complexity of language themselves.


>
> The pragmatic solution, again, is to use context. I.e. if a 
> glyphy is surrounded by russian characters, it's likely a 
> russian glyph. If it is surrounded by characters that form a 
> common russian word, it's likely a russian glyph.

No, that doesn't work for panaché documents, we've been there, we 
had that and it sucks. UTF was such a relief.
Here little example from our configuration. The regular 
expression used to detect a document reference in a text as a 
replaceable:

0:UN:EC_N:((№|č.|nr.|št.|αριθ.|No|nr|N:o|Uimh.|br.|n.|Nr.|Nru|[Nn][º°o]|[Nn].[º°o])[  ][0-9]+/[0-9]+/(EC|ES|EF|EG|EK|EΚ|CE|EÜ|EY|CE|EZ|EB|KE|WE))

What is the context here? Btw the EC is Cyrillic and the first EK 
is Greek

and their substitution expressions
T:BG:EC_N:№\2/ЕС
T:CS:EC_N:č.\2/ES
T:DA:EC_N:nr.\2/EF
T:DE:EC_N:Nr.\2/EG
T:EL:EC_N:αριθ.\2/EΚ
T:EN:EC_N:No\2/EC
T:ES:EC_N:nº\2/CE
T:ET:EC_N:nr\2/EÜ
T:FI:EC_N:N:o\2/EY
T:FR:EC_N:nº\2/CE
T:GA:EC_N:Uimh.\2/CE
T:HR:EC_N:br.\2/EZ
T:IT:EC_N:n.\2/CE
T:LT:EC_N:Nr.\2/EB
T:LV:EC_N:Nr.\2/EK
T:MT:EC_N:Nru\2/KE
T:NL:EC_N:nr.\2/EG
T:PL:EC_N:nr\2/WE
T:PT:EC_N:n.º\2/CE
T:RO:EC_N:nr.\2/CE
T:SK:EC_N:č.\2/ES
T:SL:EC_N:št.\2/ES
T:SV:EC_N:nr\2/EG

and as said before, such a number can be in a citation in the 
language of the citation not in the language of the document.

>
> Of course it isn't perfect, but I bet using context will work 
> better than expecting the code points to have been entered 
> correctly.
>
> I note that you had to tag В with (ru), because otherwise no 
> human reader or OCR would know what it was. This is exactly the 
> problem I'm talking about.

Yeah, but what you propose makes it even worse not better.

>
> Writing software that relies on invisible semantic information 
> is never going to work.

Invisible to your eyes, not invisible to the machines, that's the 
whole point. Why do we need to annotate all the functions in D 
with these annoying attributes if the compiler can detect them 
automagically via context? Because in general it can't, the 
semantic information must be provided somehow.






More information about the Digitalmars-d mailing list