How to detect start of Unicode symbol and count amount of graphemes
Uranuz via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Mon Oct 6 10:28:43 PDT 2014
>
> Have a look here [1]. For example, if you have a byte that is
> between U+0080 and U+07FF you know that you need two bytes to
> get that whole code point.
>
> [1] http://en.wikipedia.org/wiki/UTF-8#Description
Thanks. I solved it myself already for UTF-8 encoding. There
choosed approach with using bitbask. Maybe it is not best with
eficiency but it works)
( str[index] & 0b10000000 ) == 0 ||
( str[index] & 0b11100000 ) == 0b11000000 ||
( str[index] & 0b11110000 ) == 0b11100000 ||
( str[index] & 0b11111000 ) == 0b11110000
If it is true it means that first byte of sequence found and I
can count them. Am I right that it equals to number of graphemes,
or are there some exceptions from this rule?
For UTF-32 number of codeUnits is just equal to number of
graphemes. And what about UTF-16? Is it possible to detect first
codeUnit of encoding sequence?
More information about the Digitalmars-d-learn
mailing list