Major performance problem with std.array.front()
Michel Fortin
michel.fortin at michelf.ca
Sat Mar 8 18:15:58 PST 2014
On 2014-03-08 23:50:43 +0000, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> said:
>> Graphemes do not appear to have a 1:1 mapping with dchars, and any
>> attempt to do so would likely be a giant mistake.
>
> I think they may be comparable to dchar.
Dchar, aka code points, are much clearly defined than graphemes. A
quick search shows me there's more than one way to segment a string
into graphemes. There's the legacy and extended boundary algorithms for
general processing, and then there are some tailored algorithms that
can segment code points differently depending on the locale, or other
considerations.
Reference:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
There are three examples of local-specific graphemes in the table in
the section linked above. "Ch" is one of them. Quoting Wikipedia: "Ch
is a digraph in the Latin script. It is treated as a letter of its own
in Chamorro, Czech, Slovak, Igbo, Quechua, Guarani, Welsh, Cornish,
Breton and Belarusian Łacinka alphabets."
https://en.wikipedia.org/wiki/Ch_(digraph)
Also, there's some code points that represent ligatures (such as “fl”),
which are in theory two graphemes. I'm not sure that the general
algorithm does with that, but the depending on what you're doing
(counting characters? spell checking?) you might want to split it in
two.
So basically you just can't make make an algorithm capable of counting
letters/graphemes/characters in a universal fashion. There's no such
thing as a universal grapheme segmentation algorithm, even though there
is a general one. It'd be wise for any API to expose this subtlety
whenever segmenting graphemes.
Text is an interesting topic for never-ending discussions.
--
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca
More information about the Digitalmars-d
mailing list