Today's programming challenge - How's your Range-Fu ?

Shachar Shemesh via Digitalmars-d digitalmars-d at puremagic.com
Sun Apr 19 23:55:01 PDT 2015


On 19/04/15 22:58, ketmar wrote:
> On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:
>
> it's not crazy, it's just broken in all possible ways:
> http://file.bestmx.net/ee/articles/uni_vs_code.pdf
>

This is not a very accurate depiction of Unicode.

For example:
And, moreover, BOM is meaningless without mentioning of encoding. So we 
have to specify encoding anyway.

No. BOM is what lets your auto-detect the encoding. If you know you will 
be using UTF-8, 16 or 32 with an unknown encoding, BOM will tell you 
which it is. That is its entire purpose, in fact.

There, pretty much, goes point #1.

And then:
Unicode contains at least “writing direction” control symbols (LTR is 
U+200E and RTL is U+200F) which role is IDENTICAL to the role of 
codepage-switching symbols with the associated disadvantages.

That's just ignorance of how the UBA (TR#9) works. LRM and RLM are mere 
invisible characters with defined directionality. Cutting them away from 
a substring would not invalidate your text more than cutting away actual 
text would under the same conditions. In any case, unlike page switching 
symbols, it would only affect your display, not your understanding of 
the text.

So point #2 is out.

He has some valid argument under point #3, but also lots of !(@#&$ 
nonsense. He is right, I think, that denoting units with separate code 
points makes no sense, but the rest of his arguments seem completely 
off. For example, asking Latin and Cyrillic to share the same region 
merely because some letters look alike makes no sense, implementation wise.


Points #4, #5, #6 and #7 are the same point. The main objection I have 
there is his assumption that the situation is, somehow, worse than it 
was. Yes, if you knew your encoding was Windows-1255, you could assume 
the text is Hebrew.

Or Yiddish.

And this, I think, is one of the encodings with the least number of 
languages riding on it. Windows-1256 has Arabic, Persian, Urdu and 
others. Windows-1251 has the entire western Europe script. As pointed 
out elsewhere in this thread, Spanish and French treat case folding of 
accented letters differently.

Also, we see that the solution he thinks would work better actually 
doesn't. People living in France don't switch to a QWERTY keyboard when 
they want to type English. They type English with their AZERTY keyboard. 
There simply is no automatic way to tell what language something is 
typed in without a human telling you (or applying content based heuristics).

Microsoft Word stores, for each letter, which was the keyboard language 
it was typed with. This causes great problems when copying to other 
editors, performing searches, or simply trying to get bidirectional text 
to appear correctly. The problem is so bad that phone numbers where the 
prefix appears after the actual number is not considered bad form or 
unusual, even in official PR material or when sending resumes.

In fact, the only time you can count on someone to switch keyboards is 
when they need to switch to a language with a different alphabet. No 
Russian speaker will type English using the Russian layout, even if what 
she has to say happens to use letters with the same glyphs. You simply 
do not plan that much ahead.

The point I'm driving at is that just because some posted some rant on 
the Internet doesn't mean it's correct. When someone says something is 
broken, always ask them what they suggest instead.

Shachar


More information about the Digitalmars-d mailing list