Major performance problem with std.array.front()
Vladimir Panteleev
vladimir at thecybershadow.net
Sat Mar 8 16:42:13 PST 2014
On Saturday, 8 March 2014 at 23:59:15 UTC, Andrei Alexandrescu
wrote:
> My only claim is that recognizing and iterating strings by code
> point is better than doing things by the octet.
Considering or disregarding the disadvantages of this choice?
>> 1. Eliminating dangerous constructs, such as s.countUntil and
>> s.indexOf
>> both returning integers, yet possibly having different values
>> in
>> circumstances that the developer may not foresee.
>
> I disagree there's any danger. They deal in code points, end of
> story.
Perhaps I did not explain clearly enough.
auto pos = s.countUntil(sub);
writeln(s[pos..$]);
This will compile, and work for English text. For someone without
complete knowledge of Phobos functions and how D handles Unicode,
it is not obvious that this code is actually wrong. In certain
situations, this can have devastating effects: consider, for
example, if this code is extracting a slice from a string that
elsewhere contains sensitive data (e.g. a configuration file
containing, among other data, a password). An attacker could
supply an Unicode string where the developer did not expect it,
thus causing "pos" to have a smaller value than the corresponding
indexOf result, thus revealing a slice of "s" which was not
intended to be visible. Thus, a developer currently needs to
tread very carefully wherever he is slicing strings, so as to not
accidentally use indices obtained from functions that count code
points.
>> 2. Very high complexity of implementations (the
>> ElementEncodingType
>> problem previously mentioned).
>
> I disagree with "very high".
I'm quite sure that std.range and std.algorithm will lose a LOT
of weight if they were fixed to not treat strings specially.
> Besides if you want to do Unicode you gotta crack some eggs.
No, I can't see how this justifies the choice. An explicit
decoding range would have simplified things greatly while
offering much of the same advantages. Whether the fact that it is
there "by default" an advantage of the current approach at all is
debatable.
>> 3. Hidden, difficult-to-detect performance problems. The
>> reason why this
>> thread was started. I've had to deal with them in several
>> places myself.
>
> I disagree with "hidden, difficult to detect".
Why? You can only find out that an algorithm is slower than it
needs to be via either profiling (at which point you're wondering
why the @#$% the thing is so slow), or feeding it invalid UTF. If
you had made a different choice for Unicode in D, this problem
would not exist altogether.
> Also I'd add that I'd rather not have hidden, difficult to
> detect correctness problems.
Except we already do. Arguments have already been presented in
this thread that demonstrate correctness problems with the
current approach. I don't think that these can stand up to the
problems that the simpler by-char iteration approach would have.
>> 4. Encourage D programmers to write Unicode-capable code that
>> is correct
>> in the full sense of the word.
>
> I disagree we are presently discouraging them.
I did not say we are. The problem is that we aren't encouraging
them either - we are instead setting an example of how to do it
in a wrong (incomplete) way.
> I do agree a change would make certain things clearer.
I have an issue with all the counter-arguments presented in this
thread being shoved behind the one word "clearer".
> But not enough to nearly make up for the breakage.
I would still like to go ahead with my suggestion to attempt some
possible changes without releasing them. I'm going to try them
with my own programs first to see how much it will break. I
believe that you are too eagerly dismissing all proposals without
even evaluating them.
>> I think the above list has enough weight to merit at least
>> considering
>> *some* breaking changes.
>
> I think a better approach is to figure what to add.
This is obvious:
- more Unicode algorithms (normalization, segmentation, etc.)
- better documentation
More information about the Digitalmars-d
mailing list