How to detect start of Unicode symbol and count amount of graphemes

Mon Oct 6 10:48:11 PDT 2014

On Mon, Oct 06, 2014 at 05:28:43PM +0000, Uranuz via Digitalmars-d-learn wrote:
> >
> >Have a look here [1]. For example, if you have a byte that is between
> >U+0080 and U+07FF you know that you need two bytes to get that whole
> >code point.
> >
> >[1] http://en.wikipedia.org/wiki/UTF-8#Description
> 
> Thanks. I solved it myself already for UTF-8 encoding. There choosed
> approach with using bitbask. Maybe it is not best with eficiency but
> it works)
> 
> ( str[index] & 0b10000000 ) == 0 ||
> ( str[index] & 0b11100000 ) == 0b11000000 ||
> ( str[index] & 0b11110000 ) == 0b11100000 ||
> ( str[index] & 0b11111000 ) == 0b11110000
> 
> If it is true it means that first byte of sequence found and I can
> count them. Am I right that it equals to number of graphemes, or are
> there some exceptions from this rule?
> 
> For UTF-32 number of codeUnits is just equal to number of graphemes.
> And what about UTF-16? Is it possible to detect first codeUnit of
> encoding sequence?

This looks wrong to me. Are you sure this finds *all* possible
graphemes? Keep in mind that combining diacritic sequences are treated
as a single grapheme; for example the sequence 'A' U+0301 U+0302 U+0303.
There are several different codepoint ranges that have the combining
diacritic property, and they are definitely more complicated than what
you have here.

Furthermore, there are more complicated things like the Devanagari
sequences (e.g., KA + VIRAMA + TA + VOWEL SIGN U), that your code
certainly doesn't look like it would handle correctly.

As somebody else has said, it's generally a bad idea to work with
Unicode byte sequences yourself, because Unicode is complicated, and
many apparently-simple concepts actually require a lot of care to get it
right.

T

-- 
It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall