Major performance problem with std.array.front()

Sat Mar 8 12:05:42 PST 2014

On 3/8/14, 9:33 AM, Sean Kelly wrote:
> On Saturday, 8 March 2014 at 00:22:05 UTC, Walter Bright wrote:
>> Andrei suggests that this change would destroy D by breaking too much
>> existing code. He might be right. Can we afford the risk that he is
>> right?
>
> Perhaps not.  But I think the current approach is totally broken, it's
> just also happens to be what people have coded to.

I think that's an exaggeration poorly supported by evidence. My 
definition of "totally broken" would be "essentially unusable" and I 
think we're well past the point we need to prove that. Virtually all 
applications need to deal with strings to some extent, and I myself 
wrote a couple of relatively string-heavy ones. D strings work well. 
Even the most ardent detractors of D on e.g. reddit.com admit by 
omission that string processing is one if its strengths. Though they'll 
probably pick up on this thread soon :o).

> Andrei used
> algorithms operating on a code point level as an example of what would
> break if this change were made, and in that he's absolutely correct.
> But what bothers me is whether it's appropriate to perform this sort of
> character-based operation on Unicode strings in the first place.

Searching for characters in strings would be difficult to deem 
inappropriate.

When I designed std.algorithm I recall I put the following options on 
the table:

1. All algorithms would by default operate on strings at char/wchar 
level (i.e. code unit). That would cause the usual issues and confusions 
I was aware of from C++. Certain algorithms would require specialization 
and/or the user using byDchar for correctness. At some point I swear 
I've had a byDchar definition somewhere; I've searched the recent git 
history for it, no avail.

2. All algorithms would by default operate at code point level. That way 
correctness would be achieved by default, and certain algorithms would 
require specialization for efficiency. (Back then I didn't know about 
graphemes and normalization. I'm not sure how that would have affected 
the final decision.)

3. Change the alias string, wstring etc. to be some type that requires 
explicit access for code units/code points etc. instead of implicitly 
mixing the two.

My fave was (3). And not mine only - several people suggested 
alternative definitions of the "default" string type. Back then however 
we were in the middle of the D1/D2 transition and one more aftershock 
didn't seem like a good idea at all. Walter opposed such a change, and 
didn't really have to convince me.

 From experience with C++ I knew (1) had a bad track record, and (2) 
"generically conservative, specialize for speed" was a successful pattern.

What would you have chosen given that context?

> The current approach is a cut above treating strings as arrays of bytes
> for some languages, and still utterly broken for others. If I'm
> operating on a right to left language like Hebrew, what would I expect
> the result to be from something like countUntil?

The entire string processing paraphernalia is left to right. I figure 
RTL languages are under-supported, but s.retro.countUntil comes to mind.

> And how useful would
> such a result be?

I don't know.

> I'm inclined to say that the correct approach is to
> state that algorithms operate explicitly on a T.sizeof basis and that if
> the data contained in a particular range has some multi-element encoding
> then separate, specialized routines should be used with the T.sizeof
> behavior will not produce the desired result.

That sounds quite like C++ plus ICU. It doesn't strike me as the golden 
standard for Unicode integration.

> So the problem to me is that we're stuck not fixing something that's
> horribly broken just because it's broken in a way that people presumably
> now expect.

Clearly I'm being subjective here but again I'd find it difficult to get 
convinced we have something horribly broken from the evidence I gathered 
inside and outside Facebook.

> I'd personally like to see this fixed and I think the new behavior is
> preferable overall, but I do share Andrei's concern that such a big
> change might hurt the language anyway.

I've said this once and I'm saying it again: the best way to convert 
this discussion into something useful is to devise ideas for useful 
non-breaking additions.

Andrei