Today's programming challenge - How's your Range-Fu ?

Panke via Digitalmars-d digitalmars-d at puremagic.com
Mon Apr 20 10:48:16 PDT 2015


> This can lead to subtle bugs, cf. length of random and e_one. 
> You have to convert everything to dstring to get the "expected" 
> result. However, this is not always desirable.

There are three things that you need to be aware of when handling 
unicode: code units, code points and graphems.

In general the length of one guarantees anything about the length 
of the other, except for utf32, which is a 1:1 mapping between 
code units and code points.

In this thread, we were discussing the relationship between code 
points and graphemes. You're examples however apply to the 
relationship between code units and code points.

To measure the columns needed to print a string, you'll need the 
number of graphemes. (d|)?string.length gives you the number of 
code units.

If you normalize a string (in the sequence of 
characters/codepoints sense, not object.string) to NFC, it will 
decompose every precomposed character in the string (like é, 
single codeunit), establish a defined order between the composite 
characters and then recompose a selected few graphemes (like é). 
This way é always ends up as a single code unit in NFC. There are 
dozens of other combinations where you'll still have n:1 mapping 
between code points and graphemes left after normalization.

Example given already in this thread: putting an arrow over an 
latin letter is typical in math and always more than one 
codepoint.



More information about the Digitalmars-d mailing list