How to detect start of Unicode symbol and count amount of graphemes

anonymous via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Mon Oct 6 11:09:35 PDT 2014


On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote:
> ( str[index] & 0b10000000 ) == 0 ||
> ( str[index] & 0b11100000 ) == 0b11000000 ||
> ( str[index] & 0b11110000 ) == 0b11100000 ||
> ( str[index] & 0b11111000 ) == 0b11110000
>
> If it is true it means that first byte of sequence found and I 
> can count them. Am I right that it equals to number of 
> graphemes, or are there some exceptions from this rule?
>
> For UTF-32 number of codeUnits is just equal to number of 
> graphemes. And what about UTF-16? Is it possible to detect 
> first codeUnit of encoding sequence?

I think your idea of graphemes is off.

A grapheme is made up of one or more code points. This is the
same for all UTF encodings.
A code point is made of one or more code units. UTF8: between 1
and 4 I think, UTF16: 1 or 2, UTF32: always 1.
A code unit is made up of a fixed number of bytes. UTF8: 1,
UTF16: 2, UTF32: 4.

So, the number of UTF8 bytes in a sequence has no relation to
graphemes. The number of leading ones in a UTF8 start byte is
equal to the total number of bytes in that sequence. I.e. when
you see a 0b1110_0000 byte, the following two bytes should be
continuation bytes (0b10xx_xxxx), and the three of them together
encode a *code point*.

And in UTF32, the number of code units is equal to the number of
*code points*, not graphemes.


More information about the Digitalmars-d-learn mailing list