Major performance problem with std.array.front()

Michel Fortin michel.fortin at michelf.ca
Sat Mar 8 18:15:58 PST 2014


On 2014-03-08 23:50:43 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

>> Graphemes do not appear to have a 1:1 mapping with dchars, and any
>> attempt to do so would likely be a giant mistake.
> 
> I think they may be comparable to dchar.

Dchar, aka code points, are much clearly defined than graphemes. A 
quick search shows me there's more than one way to segment a string 
into graphemes. There's the legacy and extended boundary algorithms for 
general processing, and then there are some tailored algorithms that 
can segment code points differently depending on the locale, or other 
considerations.

Reference:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

There are three examples of local-specific graphemes in the table in 
the section linked above. "Ch" is one of them. Quoting Wikipedia: "Ch 
is a digraph in the Latin script. It is treated as a letter of its own 
in Chamorro, Czech, Slovak, Igbo, Quechua, Guarani, Welsh, Cornish, 
Breton and Belarusian Łacinka alphabets."
https://en.wikipedia.org/wiki/Ch_(digraph)

Also, there's some code points that represent ligatures (such as “fl”), 
which are in theory two graphemes. I'm not sure that the general 
algorithm does with that, but the depending on what you're doing 
(counting characters? spell checking?) you might want to split it in 
two.

So basically you just can't make make an algorithm capable of counting 
letters/graphemes/characters in a universal fashion. There's no such 
thing as a universal grapheme segmentation algorithm, even though there 
is a general one. It'd be wise for any API to expose this subtlety 
whenever segmenting graphemes.

Text is an interesting topic for never-ending discussions.

-- 
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca



More information about the Digitalmars-d mailing list