Major performance problem with std.array.front()

monarch_dodra monarchdodra at gmail.com
Sun Mar 9 06:00:45 PDT 2014


On Sunday, 9 March 2014 at 11:34:31 UTC, Peter Alexander wrote:
> On Sunday, 9 March 2014 at 08:32:09 UTC, monarch_dodra wrote:
>> On topic, I think D's implicit default decode to dchar is 
>> *infinity* times better than C++'s char-based strings. While 
>> imperfect in terms of grapheme, it was still a design decision 
>> made of win.
>>
>> I'd be tempted to not ask "how do we back out", but rather, 
>> "how can we take this further"? I'd love to ditch the whole 
>> "char"/"dchar" thing altogether, and work with graphemes. But 
>> that would be massive involvement.
>
> Why do you think it is better?
>
> Let's be clear here: if you are searching/iterating/comparing 
> by code point then your program is either not correct, or no 
> better than doing so by code unit. Graphemes don't really fix 
> this either.
>
> I think this is the main confusion: the belief that iterating 
> by code point has utility.
>
> If you care about normalization then neither by code unit, by 
> code point, nor by grapheme are correct (except in certain 
> language subsets).
>
> If you don't care about normalization then by code unit is just 
> as good as by code point, but you don't need to specialise 
> everywhere in Phobos.

IMO, the "normalization" argument is overrated. I've yet to 
encounter a real-world case of normalization: only hand written 
counter-examples. Not saying it doesn't exist, just that:
1. It occurs only in special cases that the program should be 
aware of before hand.
2. Arguably, be taken care of eagerly, or in a special pass.

As for "the belief that iterating by code point has utility." I 
have to strongly disagree. Unicode is composed of codepoints, and 
that is what we handle. The fact that it can be be encoded and 
stored as UTF is implementation detail.

As for the grapheme thing, I'm not actually so sure about it 
myself, so don't take it too seriously.

> AFAIK, there is only one exception, stuff like s.all!(c => c == 
> 'é'), but as Vladimir correctly points out: (a) by code point, 
> this is still broken in the face of normalization, and (b) are 
> there any real applications that search a string for a specific 
> non-ASCII character?

But *what* other kinds of algorithms are there? AFAIK, the *only* 
type of algorithm that doesn't need decoding is searching, and 
you know what? std.algorithm.find does it perfectly well. This 
trickles into most other algorithms too: split, splitter or 
findAmong don't decode if they don't have too.

AFAIK, the most common algorithm "case insensitive search" *must* 
decode.

There may still be cases where it is still not working as 
intended in the face of normalization, but it is still leaps and 
bounds better than what we get iterating with codeunits.

To turn it the other way around, *what* are you guys doing, that 
doesn't require decoding, and where performance is such a killer?

> To those that think the status quo is better, can you give an 
> example of a real-life use case that demonstrates this?

I do not know of a single bug report in regards to buggy phobos 
code that used front/popFront. Not_a_single_one (AFAIK).

On the other hand, there are plenty of cases of bugs for 
attempting to not decode strings, or incorrectly decoding 
strings. They are being corrected on a continuous basis.

Seriously, Bearophile suggested "ABCD".sort(), and it took about 
6 pages (!) for someone to point out this would be wrong. Even 
Walter pointed out that such code should work. *Maybe* it is 
still wrong in regards to graphemes and normalization, but at 
*least*, the result is not a corrupted UTF-8 stream.

Walter keeps grinding on about "myCharArray.put('é')" not 
working, but I'm not sure he realizes how dangerous it would 
actually be to allow such a thing to work.

In particular, in all these cases, a simple call to 
"representation" will deactivate the feature, giving you the 
tools you want.

> I do think it's probably too late to change this, but I think 
> there is value in at least getting everyone on the same page.

Me too. I do see the value in being able to do decode-less 
iteration. I just think the *default* behavior has the advantage 
of being correct *most* of the time, and definitely much more 
correct than without decoding.

I think opt-out of decoding is just a much much much saner 
approach to string handling.


More information about the Digitalmars-d mailing list