Major performance problem with std.array.front()
Peter Alexander
peter.alexander.au at gmail.com
Sat Mar 8 08:30:42 PST 2014
On Saturday, 8 March 2014 at 16:00:38 UTC, Vladimir Panteleev
wrote:
> On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescu
> wrote:
>> Why? Couldn't the grapheme 'compare true with the character?
>> I.e. the byGrapheme iteration normalizes on the fly.
>
> Grapheme segmentation and normalization are distinct Unicode
> algorithms:
>
> http://www.unicode.org/reports/tr15/
> http://www.unicode.org/reports/tr29/
>
> There are also several normalization algorithms.
>
> http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
How about this?
s.normalize!NFKD
To return a range of normalized code points?
Clearly, no definition of string can handle this natively. As you
say, there are multiple algorithms, so there is no one 'right'
answer. byGrapheme is useful, but doesn't and cannot solve the
normalization issue.
I feel this discussion is tangential to main debate: whether
strings should be ranges of code points or code units. By code
unit is faster by default, and simpler to implement in Phobos (no
more special code). By code point works better when searching for
individual code points, but as you rightly point out this might
not be useful in practice as you rarely search for individual
non-ASCII code points, and it isn't a complete solution anyway
because of normalization.
There's a few problems with by code unit:
1. Searching string/wstring for dchar fails silently. You have
suggested making this a compilation error, but Andrei argues this
would break lots of code. You counter that it's possible that
people rarely search for dchar anyway, so may not matter.
2. It's a fundamental change. Regardless of which is better, we
need to consider the impact of such a change.
3. Ranges of code units are random access and sliceable, which
means they will be accepted by algorithms such as sort, which
will just produce garbage strings. Maybe this isn't an issue.
More information about the Digitalmars-d
mailing list