Major performance problem with std.array.front()

Vladimir Panteleev vladimir at thecybershadow.net
Sat Mar 8 16:42:13 PST 2014


On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu 
wrote:
> My only claim is that recognizing and iterating strings by code 
> point is better than doing things by the octet.

Considering or disregarding the disadvantages of this choice?

>> 1. Eliminating dangerous constructs, such as s.countUntil and 
>> s.indexOf
>> both returning integers, yet possibly having different values 
>> in
>> circumstances that the developer may not foresee.
>
> I disagree there's any danger. They deal in code points, end of 
> story.

Perhaps I did not explain clearly enough.

auto pos = s.countUntil(sub);
writeln(s[pos..$]);

This will compile, and work for English text. For someone without 
complete knowledge of Phobos functions and how D handles Unicode, 
it is not obvious that this code is actually wrong. In certain 
situations, this can have devastating effects: consider, for 
example, if this code is extracting a slice from a string that 
elsewhere contains sensitive data (e.g. a configuration file 
containing, among other data, a password). An attacker could 
supply an Unicode string where the developer did not expect it, 
thus causing "pos" to have a smaller value than the corresponding 
indexOf result, thus revealing a slice of "s" which was not 
intended to be visible. Thus, a developer currently needs to 
tread very carefully wherever he is slicing strings, so as to not 
accidentally use indices obtained from functions that count code 
points.

>> 2. Very high complexity of implementations (the 
>> ElementEncodingType
>> problem previously mentioned).
>
> I disagree with "very high".

I'm quite sure that std.range and std.algorithm will lose a LOT 
of weight if they were fixed to not treat strings specially.

> Besides if you want to do Unicode you gotta crack some eggs.

No, I can't see how this justifies the choice. An explicit 
decoding range would have simplified things greatly while 
offering much of the same advantages. Whether the fact that it is 
there "by default" an advantage of the current approach at all is 
debatable.

>> 3. Hidden, difficult-to-detect performance problems. The 
>> reason why this
>> thread was started. I've had to deal with them in several 
>> places myself.
>
> I disagree with "hidden, difficult to detect".

Why? You can only find out that an algorithm is slower than it 
needs to be via either profiling (at which point you're wondering 
why the @#$% the thing is so slow), or feeding it invalid UTF. If 
you had made a different choice for Unicode in D, this problem 
would not exist altogether.

> Also I'd add that I'd rather not have hidden, difficult to 
> detect correctness problems.

Except we already do. Arguments have already been presented in 
this thread that demonstrate correctness problems with the 
current approach. I don't think that these can stand up to the 
problems that the simpler by-char iteration approach would have.

>> 4. Encourage D programmers to write Unicode-capable code that 
>> is correct
>> in the full sense of the word.
>
> I disagree we are presently discouraging them.

I did not say we are. The problem is that we aren't encouraging 
them either - we are instead setting an example of how to do it 
in a wrong (incomplete) way.

> I do agree a change would make certain things clearer.

I have an issue with all the counter-arguments presented in this 
thread being shoved behind the one word "clearer".

> But not enough to nearly make up for the breakage.

I would still like to go ahead with my suggestion to attempt some 
possible changes without releasing them. I'm going to try them 
with my own programs first to see how much it will break. I 
believe that you are too eagerly dismissing all proposals without 
even evaluating them.

>> I think the above list has enough weight to merit at least 
>> considering
>> *some* breaking changes.
>
> I think a better approach is to figure what to add.

This is obvious:
- more Unicode algorithms (normalization, segmentation, etc.)
- better documentation


More information about the Digitalmars-d mailing list