Today's programming challenge - How's your Range-Fu ?

Shachar Shemesh via Digitalmars-d digitalmars-d at puremagic.com
Sat Apr 18 19:20:00 PDT 2015


On 18/04/15 21:40, Walter Bright wrote:
>
> I'm not arguing against the existence of the Unicode standard, I'm
> saying I can't figure any justification for standardizing different
> encodings of the same thing.
>

A lot of areas in Unicode are due to pre-Unicode legacy.

I'm guessing here, but looking at the code points, é (U00e9 - Latin 
small letter E with acute), which comes from Latin-1, which is designed 
to follow ISO-8859-1. U0301 (Combining acute accent) comes from 
"Combining diacritical marks".

The way I understand things, Unicode would really prefer to use 
U0065+U0301 rather than U00e9. Because of legacy systems, and because 
they would rather have the ISO-8509 code pages be 1:1 mappings, rather 
than 1:n mappings, they introduced code points they really would rather 
do without.

This also explains the "presentation forms" code pages (e.g. 
http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be 
glyphs, rather than code points. Due to legacy reasons, it was not 
possible to simply discard them. They received code points, with a 
warning not to use these code points directly.

Also, notice that some letters can only be achieved using multiple code 
points. Hebrew diacritics, for example, do not, typically, have a 
composite form. My name fully spelled (which you rarely would do), שַׁחַר, 
cannot be represented with less than 6 code points, despite having only 
three letters.

The last paragraph isn't strictly true. You can use UFB2C + U05B7 for 
the first letter instead of U05E9 + U05C2 + U05B7. You would be using 
the presentation form which, as pointed above, is only there for legacy.

Shachar
or shall I say
שחר


More information about the Digitalmars-d mailing list