Major performance problem with std.array.front()
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Sat Mar 8 12:05:42 PST 2014
On 3/8/14, 9:33 AM, Sean Kelly wrote:
> On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
>> Andrei suggests that this change would destroy D by breaking too much
>> existing code. He might be right. Can we afford the risk that he is
>> right?
>
> Perhaps not. But I think the current approach is totally broken, it's
> just also happens to be what people have coded to.
I think that's an exaggeration poorly supported by evidence. My
definition of "totally broken" would be "essentially unusable" and I
think we're well past the point we need to prove that. Virtually all
applications need to deal with strings to some extent, and I myself
wrote a couple of relatively string-heavy ones. D strings work well.
Even the most ardent detractors of D on e.g. reddit.com admit by
omission that string processing is one if its strengths. Though they'll
probably pick up on this thread soon :o).
> Andrei used
> algorithms operating on a code point level as an example of what would
> break if this change were made, and in that he's absolutely correct.
> But what bothers me is whether it's appropriate to perform this sort of
> character-based operation on Unicode strings in the first place.
Searching for characters in strings would be difficult to deem
inappropriate.
When I designed std.algorithm I recall I put the following options on
the table:
1. All algorithms would by default operate on strings at char/wchar
level (i.e. code unit). That would cause the usual issues and confusions
I was aware of from C++. Certain algorithms would require specialization
and/or the user using byDchar for correctness. At some point I swear
I've had a byDchar definition somewhere; I've searched the recent git
history for it, no avail.
2. All algorithms would by default operate at code point level. That way
correctness would be achieved by default, and certain algorithms would
require specialization for efficiency. (Back then I didn't know about
graphemes and normalization. I'm not sure how that would have affected
the final decision.)
3. Change the alias string, wstring etc. to be some type that requires
explicit access for code units/code points etc. instead of implicitly
mixing the two.
My fave was (3). And not mine only - several people suggested
alternative definitions of the "default" string type. Back then however
we were in the middle of the D1/D2 transition and one more aftershock
didn't seem like a good idea at all. Walter opposed such a change, and
didn't really have to convince me.
From experience with C++ I knew (1) had a bad track record, and (2)
"generically conservative, specialize for speed" was a successful pattern.
What would you have chosen given that context?
> The current approach is a cut above treating strings as arrays of bytes
> for some languages, and still utterly broken for others. If I'm
> operating on a right to left language like Hebrew, what would I expect
> the result to be from something like countUntil?
The entire string processing paraphernalia is left to right. I figure
RTL languages are under-supported, but s.retro.countUntil comes to mind.
> And how useful would
> such a result be?
I don't know.
> I'm inclined to say that the correct approach is to
> state that algorithms operate explicitly on a T.sizeof basis and that if
> the data contained in a particular range has some multi-element encoding
> then separate, specialized routines should be used with the T.sizeof
> behavior will not produce the desired result.
That sounds quite like C++ plus ICU. It doesn't strike me as the golden
standard for Unicode integration.
> So the problem to me is that we're stuck not fixing something that's
> horribly broken just because it's broken in a way that people presumably
> now expect.
Clearly I'm being subjective here but again I'd find it difficult to get
convinced we have something horribly broken from the evidence I gathered
inside and outside Facebook.
> I'd personally like to see this fixed and I think the new behavior is
> preferable overall, but I do share Andrei's concern that such a big
> change might hurt the language anyway.
I've said this once and I'm saying it again: the best way to convert
this discussion into something useful is to devise ideas for useful
non-breaking additions.
Andrei
More information about the Digitalmars-d
mailing list