Today's programming challenge - How's your Range-Fu ?
Shachar Shemesh via Digitalmars-d
digitalmars-d at puremagic.com
Sun Apr 19 23:55:01 PDT 2015
On 19/04/15 22:58, ketmar wrote:
> On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:
>
> it's not crazy, it's just broken in all possible ways:
> http://file.bestmx.net/ee/articles/uni_vs_code.pdf
>
This is not a very accurate depiction of Unicode.
For example:
And, moreover, BOM is meaningless without mentioning of encoding. So we
have to specify encoding anyway.
No. BOM is what lets your auto-detect the encoding. If you know you will
be using UTF-8, 16 or 32 with an unknown encoding, BOM will tell you
which it is. That is its entire purpose, in fact.
There, pretty much, goes point #1.
And then:
Unicode contains at least “writing direction” control symbols (LTR is
U+200E and RTL is U+200F) which role is IDENTICAL to the role of
codepage-switching symbols with the associated disadvantages.
That's just ignorance of how the UBA (TR#9) works. LRM and RLM are mere
invisible characters with defined directionality. Cutting them away from
a substring would not invalidate your text more than cutting away actual
text would under the same conditions. In any case, unlike page switching
symbols, it would only affect your display, not your understanding of
the text.
So point #2 is out.
He has some valid argument under point #3, but also lots of !(@#&$
nonsense. He is right, I think, that denoting units with separate code
points makes no sense, but the rest of his arguments seem completely
off. For example, asking Latin and Cyrillic to share the same region
merely because some letters look alike makes no sense, implementation wise.
Points #4, #5, #6 and #7 are the same point. The main objection I have
there is his assumption that the situation is, somehow, worse than it
was. Yes, if you knew your encoding was Windows-1255, you could assume
the text is Hebrew.
Or Yiddish.
And this, I think, is one of the encodings with the least number of
languages riding on it. Windows-1256 has Arabic, Persian, Urdu and
others. Windows-1251 has the entire western Europe script. As pointed
out elsewhere in this thread, Spanish and French treat case folding of
accented letters differently.
Also, we see that the solution he thinks would work better actually
doesn't. People living in France don't switch to a QWERTY keyboard when
they want to type English. They type English with their AZERTY keyboard.
There simply is no automatic way to tell what language something is
typed in without a human telling you (or applying content based heuristics).
Microsoft Word stores, for each letter, which was the keyboard language
it was typed with. This causes great problems when copying to other
editors, performing searches, or simply trying to get bidirectional text
to appear correctly. The problem is so bad that phone numbers where the
prefix appears after the actual number is not considered bad form or
unusual, even in official PR material or when sending resumes.
In fact, the only time you can count on someone to switch keyboards is
when they need to switch to a language with a different alphabet. No
Russian speaker will type English using the Russian layout, even if what
she has to say happens to use letters with the same glyphs. You simply
do not plan that much ahead.
The point I'm driving at is that just because some posted some rant on
the Internet doesn't mean it's correct. When someone says something is
broken, always ask them what they suggest instead.
Shachar
More information about the Digitalmars-d
mailing list