Major performance problem with std.array.front()

Sun Mar 9 07:57:30 PDT 2014

On Sunday, 9 March 2014 at 13:00:46 UTC, monarch_dodra wrote:
> IMO, the "normalization" argument is overrated. I've yet to 
> encounter a real-world case of normalization: only hand written 
> counter-examples. Not saying it doesn't exist, just that:
> 1. It occurs only in special cases that the program should be 
> aware of before hand.
> 2. Arguably, be taken care of eagerly, or in a special pass.
>
> As for "the belief that iterating by code point has utility." I 
> have to strongly disagree. Unicode is composed of codepoints, 
> and that is what we handle. The fact that it can be be encoded 
> and stored as UTF is implementation detail.

We don't "handle" code points (when have you ever wanted to 
handle a combining character separate to the character it 
combines with?)

You are just thinking of a subset of languages and locales.

Normalization is an issue any time you have a user enter text 
into your program and you then want to search for that text. I 
hope we can agree this isn't a rare occurrence.

>> AFAIK, there is only one exception, stuff like s.all!(c => c 
>> == 'é'), but as Vladimir correctly points out: (a) by code 
>> point, this is still broken in the face of normalization, and 
>> (b) are there any real applications that search a string for a 
>> specific non-ASCII character?
>
> But *what* other kinds of algorithms are there? AFAIK, the 
> *only* type of algorithm that doesn't need decoding is 
> searching, and you know what? std.algorithm.find does it 
> perfectly well. This trickles into most other algorithms too: 
> split, splitter or findAmong don't decode if they don't have 
> too.

Searching, equality testing, copying, sorting, hashing, 
splitting, joining...

I can't think of a single use-case for searching for a non-ASCII 
code point. You can search for strings, but searching by code 
unit is just as good (and fast by default).

> AFAIK, the most common algorithm "case insensitive search" 
> *must* decode.

But it must also normalize and take locales into account, so by 
code point is insufficient (unless you are willing to ignore 
languages like Turkish). See Turkish I.

http://en.wikipedia.org/wiki/Turkish_I

Sure, if you just want to ignore normalization and several 
languages then by code point is just fine... but that's the 
point: by code point is incorrect in general.

> There may still be cases where it is still not working as 
> intended in the face of normalization, but it is still leaps 
> and bounds better than what we get iterating with codeunits.
>
> To turn it the other way around, *what* are you guys doing, 
> that doesn't require decoding, and where performance is such a 
> killer?

Searching, equality testing, copying, sorting, hashing, 
splitting, joining...

The performance thing can be fixed in the library, but my concern 
is (a) it takes a significant amount of code to do so (b) 
complicates implementations. There are many, many algorithms in 
Phobos that are special cased for strings, and I don't think it 
needs to be that way.

>> To those that think the status quo is better, can you give an 
>> example of a real-life use case that demonstrates this?
>
> I do not know of a single bug report in regards to buggy phobos 
> code that used front/popFront. Not_a_single_one (AFAIK).
>
> On the other hand, there are plenty of cases of bugs for 
> attempting to not decode strings, or incorrectly decoding 
> strings. They are being corrected on a continuous basis.

Can you provide a link to a bug?

Also, you haven't answered the question :-)  Can you give a 
real-life example of a case where code point decoding was 
necessary where code units wouldn't have sufficed?

You have mentioned case-insensitive searching, but I think I've 
adequately demonstrated that this doesn't work in general by code 
point: you need to normalize and take locales into account.