Today's programming challenge - How's your Range-Fu ?

Mon Apr 20 11:39:54 PDT 2015

On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
>> This can lead to subtle bugs, cf. length of random and e_one. 
>> You have to convert everything to dstring to get the 
>> "expected" result. However, this is not always desirable.
>
> There are three things that you need to be aware of when 
> handling unicode: code units, code points and graphems.

This is why I use a helper function that uses byCodePoint and 
byGrapheme. At least for my use cases it returns the correct 
length. However, I might think about an alternative version based 
on the discussion here.

> In general the length of one guarantees anything about the 
> length of the other, except for utf32, which is a 1:1 mapping 
> between code units and code points.
>
> In this thread, we were discussing the relationship between 
> code points and graphemes. You're examples however apply to the 
> relationship between code units and code points.
>
> To measure the columns needed to print a string, you'll need 
> the number of graphemes. (d|)?string.length gives you the 
> number of code units.
>
> If you normalize a string (in the sequence of 
> characters/codepoints sense, not object.string) to NFC, it will 
> decompose every precomposed character in the string (like é, 
> single codeunit), establish a defined order between the 
> composite characters and then recompose a selected few 
> graphemes (like é). This way é always ends up as a single code 
> unit in NFC. There are dozens of other combinations where 
> you'll still have n:1 mapping between code points and graphemes 
> left after normalization.
>
> Example given already in this thread: putting an arrow over an 
> latin letter is typical in math and always more than one 
> codepoint.