Major performance problem with std.array.front()

Peter Alexander peter.alexander.au at gmail.com
Sat Mar 8 08:30:42 PST 2014


On Saturday, 8 March 2014 at 16:00:38 UTC, Vladimir Panteleev 
wrote:
> On Saturday, 8 March 2014 at 15:33:34 UTC, Andrei Alexandrescu 
> wrote:
>> Why? Couldn't the grapheme 'compare true with the character? 
>> I.e. the byGrapheme iteration normalizes on the fly.
>
> Grapheme segmentation and normalization are distinct Unicode 
> algorithms:
>
> http://www.unicode.org/reports/tr15/
> http://www.unicode.org/reports/tr29/
>
> There are also several normalization algorithms.
>
> http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

How about this?

s.normalize!NFKD

To return a range of normalized code points?

Clearly, no definition of string can handle this natively. As you 
say, there are multiple algorithms, so there is no one 'right' 
answer. byGrapheme is useful, but doesn't and cannot solve the 
normalization issue.

I feel this discussion is tangential to main debate: whether 
strings should be ranges of code points or code units. By code 
unit is faster by default, and simpler to implement in Phobos (no 
more special code). By code point works better when searching for 
individual code points, but as you rightly point out this might 
not be useful in practice as you rarely search for individual 
non-ASCII code points, and it isn't a complete solution anyway 
because of normalization.

There's a few problems with by code unit:

1. Searching string/wstring for dchar fails silently. You have 
suggested making this a compilation error, but Andrei argues this 
would break lots of code. You counter that it's possible that 
people rarely search for dchar anyway, so may not matter.

2. It's a fundamental change. Regardless of which is better, we 
need to consider the impact of such a change.

3. Ranges of code units are random access and sliceable, which 
means they will be accepted by algorithms such as sort, which 
will just produce garbage strings. Maybe this isn't an issue.


More information about the Digitalmars-d mailing list