Major performance problem with std.array.front()
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Sat Mar 8 19:26:43 PST 2014
On 3/8/14, 6:14 PM, Vladimir Panteleev wrote:
> On Sunday, 9 March 2014 at 01:23:27 UTC, Andrei Alexandrescu wrote:
>> On 3/8/14, 4:42 PM, Vladimir Panteleev wrote:
>> My point there is that there's no useless or duplicated code that
>> would be thrown away. A better design would indeed make for better
>> modular separation - would be great if the string-related
>> optimizations in std.algorithm went elsewhere. They wouldn't disappear.
>
> Why? Isn't the whole issue that std.range presents strings as dchar
> ranges, and std.algorithm needs to detect dchar ranges and then treat
> them as char arrays? As opposed to std.algorithm just detecting arrays
> and treating them all as arrays (which it should be doing now anyway)?
That's scaffolding, not actual executable code.
>>> Why? You can only find out that an algorithm is slower than it needs to
>>> be via either profiling (at which point you're wondering why the @#$%
>>> the thing is so slow), or feeding it invalid UTF. If you had made a
>>> different choice for Unicode in D, this problem would not exist
>>> altogether.
>>
>> Disagree.
>
> Could you please elaborate? This is the second uninformative reply to
> this argument.
What can I say? The answer is obvious. It's not hard to figure for me.
Performance of D's UTF strings has never been a mystery to me. From
where I stand all this "hidden, difficult-to-detect performance
problems" drama is just posturing. We'd do good to wean such out of the
discussion.
No bug myriad of bug reports "D strings are awfully slow" on bugzilla.
No long threads "Why are D strings so slow" on stack overflow.
No trolling on reddit or hackernews "D? Just look at their strings. How
could anyone think that's a good idea lol."
And it's not like people aren't talking. In contrast, D has been (and
often rightly) criticized in the past for things like floating point
performance and garbage collection. No evidence we are having an acute
performance problem with UTF strings.
>> Sure there are, and you yourself illustrated a misuse of the APIs.
>
> If UTF decoding was explicit, the problem would stand out. I don't think
> this is a valid argument.
Yours? Indeed isn't, if what you want is iterate by code unit (=
meaningless for all but ASCII strings) by default.
>> My point is: code point is better than code unit
>
> This was debated... people should not be looking at individual code
> points, unless they really know what they're doing.
Should they be looking at code units instead?
>> Grapheme is better than code point but a lot slower.
>
> We are going in circles. People should have very good reasons for
> looking at individual graphemes as well.
And it's good we have increasing support for graphemes. I don't think
they should be the default.
>> It seems we're quite in a sweet spot here wrt performance/correctness.
>
> This does not seem like an objective summary of this thread's arguments
> so far.
What is an objective summary? Those who want to inflict massive breakage
are not even done arguing we have a better design.
> I guess I'll get working on that wiki page to organize the arguments.
> This discussion is starting to feel like a quicksand roundabout.
That's great. Yes, we're exchanging jabs right now which is not our best
use of time. Also in the interest of time, please understand you'd need
to show the second coming if you want to break backward compatibility.
Additions are a much better path.
>> With what has been put forward so far, that's not even close to
>> justifying a breaking change. If that great better design is just get
>> back to code unit iteration, the change will not happen while I work
>> on D. It is possible, however, that a much better idea comes forward,
>> and I'd be looking forward to such.
>
> Actually, could you post some examples of real-world code that would be
> broken by a hypothetical sudden switch? I think I would be hard-pressed
> to find some in my own code, but I'd need to check for sure to find out.
I'm afraid burden of proof is on you. Far as I'm concerned every
breakage of string processing is unacceptable or at least very undesirable.
>> 2. Add byChar that returns a random-access range iterating a string by
>> character. Add byWchar that does on-the-fly transcoding to UTF16. Add
>> byDchar that accepts any range of char and does decoding. And such
>> stuff. Then whenever one wants to go through a string by code point
>> can just use str.byChar.
>
> This is confusing. Did you mean to say that byChar iterates a string by
> code unit (not character / code point)?
Unit. s.byChar.front is a (possibly ref, possibly qualified) char.
Andrei
More information about the Digitalmars-d
mailing list