Accented Characters and Counting Syllables
H. S. Teoh via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Sat Dec 6 15:09:43 PST 2014
On Sat, Dec 06, 2014 at 10:37:17PM +0000, "Nordlöw" via Digitalmars-d-learn wrote:
> Given the fact that
>
> static assert("é".length == 2);
>
> I was surprised that
>
> static assert("é".byCodeUnit.length == 2);
> static assert("é".byCodePoint.length == 2);
>
> Isn't there a way to iterate over accented characters (in my case
> UTF-8) in D? Or is this an inherent problem in Unicode? I need this in
> a syllable counting algorithm that needs to distinguish accented and
> non-accented variants of vowels. For example café (2 syllables)
> compared to babe (one syllable.
This is a Unicode issue. What you want is neither byCodeUnit nor
byCodePoint, but byGrapheme. A grapheme is the Unicode equivalent of
what lay people would call a "character". A Unicode character (or more
precisely, a "code point") is not necessarily a complete grapheme, as
your example above shows; it's just a numerical value that uniquely
identifies an entry in the Unicode character database.
T
--
There are 10 kinds of people in the world: those who can count in binary, and those who can't.
More information about the Digitalmars-d-learn
mailing list