How to detect start of Unicode symbol and count amount of graphemes

Uranuz via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Mon Oct 6 10:28:43 PDT 2014


>
> Have a look here [1]. For example, if you have a byte that is 
> between U+0080 and U+07FF you know that you need two bytes to 
> get that whole code point.
>
> [1] http://en.wikipedia.org/wiki/UTF-8#Description

Thanks. I solved it myself already for UTF-8 encoding. There 
choosed approach with using bitbask. Maybe it is not best with 
eficiency but it works)

( str[index] & 0b10000000 ) == 0 ||
( str[index] & 0b11100000 ) == 0b11000000 ||
( str[index] & 0b11110000 ) == 0b11100000 ||
( str[index] & 0b11111000 ) == 0b11110000

If it is true it means that first byte of sequence found and I 
can count them. Am I right that it equals to number of graphemes, 
or are there some exceptions from this rule?

For UTF-32 number of codeUnits is just equal to number of 
graphemes. And what about UTF-16? Is it possible to detect first 
codeUnit of encoding sequence?


More information about the Digitalmars-d-learn mailing list