Major performance problem with std.array.front()

Sat Mar 8 18:14:35 PST 2014

On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu 
wrote:
> On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
>> On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu 
>> wrote:
>>> My only claim is that recognizing and iterating strings by 
>>> code point
>>> is better than doing things by the octet.
>>
>> Considering or disregarding the disadvantages of this choice?
>
> Doing my best to weigh everything with the right measures.

I think it would be good to get a comparison of the two 
approaches, and list the arguments presented so far. I'll look 
into starting a Wiki page.

> Okay, though when you opened with "devastating" I was hoping 
> for nothing short of death and dismemberment.

In proportion. To the best of my knowledge, no one here writes 
software for military or industrial robots in D. Security issues 
rank as the worst kind of bugs in software on my scale.

> Anyhow the fix is obvious per this brief tutorial: 
> http://www.youtube.com/watch?v=hkDD03yeLnU

I don't get it.

>> I'm quite sure that std.range and std.algorithm will lose a 
>> LOT of
>> weight if they were fixed to not treat strings specially.
>
> I'm not so sure. Most of the string-specific optimizations 
> simply detect certain string cases and forward them to array 
> algorithms that need be written anyway. You would, indeed, save 
> a fair amount of isSomeString conditionals and stuff (thus 
> simplifying on scaffolding), but probably not a lot of code. 
> That's not useless work - it'd go somewhere in any design.

One way to find out.

>>> Besides if you want to do Unicode you gotta crack some eggs.
>>
>> No, I can't see how this justifies the choice. An explicit 
>> decoding
>> range would have simplified things greatly while offering much 
>> of the
>> same advantages.
>
> My point there is that there's no useless or duplicated code 
> that would be thrown away. A better design would indeed make 
> for better modular separation - would be great if the 
> string-related optimizations in std.algorithm went elsewhere. 
> They wouldn't disappear.

Why? Isn't the whole issue that std.range presents strings as 
dchar ranges, and std.algorithm needs to detect dchar ranges and 
then treat them as char arrays? As opposed to std.algorithm just 
detecting arrays and treating them all as arrays (which it should 
be doing now anyway)?

>>>> 3. Hidden, difficult-to-detect performance problems. The 
>>>> reason why this
>>>> thread was started. I've had to deal with them in several 
>>>> places myself.
>>>
>>> I disagree with "hidden, difficult to detect".
>>
>> Why? You can only find out that an algorithm is slower than it 
>> needs to
>> be via either profiling (at which point you're wondering why 
>> the @#$%
>> the thing is so slow), or feeding it invalid UTF. If you had 
>> made a
>> different choice for Unicode in D, this problem would not 
>> exist altogether.
>
> Disagree.

Could you please elaborate? This is the second uninformative 
reply to this argument.

>> Except we already do. Arguments have already been presented in 
>> this
>> thread that demonstrate correctness problems with the current 
>> approach.
>> I don't think that these can stand up to the problems that the 
>> simpler
>> by-char iteration approach would have.
>
> Sure there are, and you yourself illustrated a misuse of the 
> APIs.

If UTF decoding was explicit, the problem would stand out. I 
don't think this is a valid argument.

> My point is: code point is better than code unit

This was debated... people should not be looking at individual 
code points, unless they really know what they're doing.

> Grapheme is better than code point but a lot slower.

We are going in circles. People should have very good reasons for 
looking at individual graphemes as well.

> It seems we're quite in a sweet spot here wrt 
> performance/correctness.

This does not seem like an objective summary of this thread's 
arguments so far.

I guess I'll get working on that wiki page to organize the 
arguments. This discussion is starting to feel like a quicksand 
roundabout.

> With what has been put forward so far, that's not even close to 
> justifying a breaking change. If that great better design is 
> just get back to code unit iteration, the change will not 
> happen while I work on D. It is possible, however, that a much 
> better idea comes forward, and I'd be looking forward to such.

Actually, could you post some examples of real-world code that 
would be broken by a hypothetical sudden switch? I think I would 
be hard-pressed to find some in my own code, but I'd need to 
check for sure to find out.

> 2. Add byChar that returns a random-access range iterating a 
> string by character. Add byWchar that does on-the-fly 
> transcoding to UTF16. Add byDchar that accepts any range of 
> char and does decoding. And such stuff. Then whenever one wants 
> to go through a string by code point can just use str.byChar.

This is confusing. Did you mean to say that byChar iterates a 
string by code unit (not character / code point)?