Accented Characters and Counting Syllables

Sat Dec 6 15:09:43 PST 2014

On Sat, Dec 06, 2014 at 10:37:17PM +0000, "Nordlöw" via Digitalmars-d-learn wrote:
> Given the fact that
> 
>     static assert("é".length == 2);
> 
> I was surprised that
> 
>     static assert("é".byCodeUnit.length == 2);
>     static assert("é".byCodePoint.length == 2);
> 
> Isn't there a way to iterate over accented characters (in my case
> UTF-8) in D? Or is this an inherent problem in Unicode? I need this in
> a syllable counting algorithm that needs to distinguish accented and
> non-accented variants of vowels. For example café (2 syllables)
> compared to babe (one syllable.

This is a Unicode issue. What you want is neither byCodeUnit nor
byCodePoint, but byGrapheme. A grapheme is the Unicode equivalent of
what lay people would call a "character". A Unicode character (or more
precisely, a "code point") is not necessarily a complete grapheme, as
your example above shows; it's just a numerical value that uniquely
identifies an entry in the Unicode character database.

T

-- 
There are 10 kinds of people in the world: those who can count in binary, and those who can't.