Major performance problem with std.array.front()

Fri Mar 7 16:44:41 PST 2014

On 3/7/14, 12:26 PM, H. S. Teoh wrote:
> On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
>> s.canFind('é') currently works as expected. Proposed: fails silently.
>
> The problem is that the current implementation of this correct behaviour
> leaves a lot to be desired in terms of performance. Ideally, you should
> not need to decode every single character in s just to see if it happens
> to contain é. Rather, canFind, et al should convert the dchar literal
> 'é' into a UTF-8 (resp. UTF-16) sequence and do a substring search
> instead. Decoding every character in s, while correct, is also
> needlessly inefficient.

That's an optimization that fits the current design and goes in the 
library transparently, i.e. the good stuff.

>> 5.
>>
>> s.count() currently works as expected. Proposed: fails silently.
>
> Wrong. The current behaviour of s.count() does not work as expected, it
> only gives an illusion that it does.

Depends on what one expects :o).

> Its return value is misleading when
> combining diacritics and other such Unicode "niceness" are involved.
> Arguably, such things should be prohibited altogether, and more
> semantically transparent algorithms used, namely s.countCodePoints,
> s.countGraphemes, etc..

I think s.byGrapheme.count is the right way instead of specializing a 
bunch of algorithms to work with graphemes.

>> s.endsWith('é') currently works as expected. Proposed: fails silently.
>
> Arguable, because it imposes a performance hit by needless decoding.
> Ideally, you should have 3 overloads:
>
> 	bool endsWith(string s, char asciiChar);
> 	bool endsWith(string s, wchar wideChar);
> 	bool endsWith(string s, dchar codepoint);

Nice idea. Fits current design. Then interesting complications arise 
with things like bool endsWith(string, wstring) etc.

> [...]
>> I designed the range behavior of strings after much thinking and
>> consideration back in the day when I designed std.algorithm. It was
>> painfully obvious (but it seems to have been forgotten now that it's
>> working so well) that approaching strings as arrays of char[] would
>> break almost every single algorithm leaving us essentially in the
>> pre-UTF C++aveman era.
>
> I agree, but it is also painfully obvious that the current
> implementation is lackluster in terms of performance.

It's not painfully obvious to me at all. What is obvious to me is people 
are happy campers with the way D's strings work, including UTF support 
and performance. I don't remember people bringing this up in forums and 
here at Facebook "yeah, just look at the crappy way they handle 
strings..." Silent approval is easy to forget about.

Walter has been working on an application in which anything slower than 
2x baseline would have been a failure. In that app (which I know very 
well) the right option from day 1 would have been ubyte[], which he 
discovered the hard way. His incomplete understanding of how D strings 
work is the single largest problem there, and indicates an issue with 
the documentation.

He discovered that, was surprised, and overreacted. No need to amplify 
that into mass hysteria. There are improvements that can be made, in the 
form of additions, not breaking changes that would inflict massive 
breakage on the community. This is the way in which this discussion can 
have a positive outcome. (I've shared in fact a few ideas with Walter.)

>> Clearly one might argue that their app has no business dealing with
>> diacriticals or Asian characters. But that's the typical provincial
>> view that marred many languages' approach to UTF and
>> internationalization. If you know your string is ASCII, the remedy
>> is simple - don't use char[] and friends. From day 1, the type
>> "char" was meant to mean "code unit of UTF characters".
>
> Yes, but currently Phobos support for non-UTF strings is rather poor,
> and requires many explicit casts to/from ubyte[].

Non-UTF strings are currently modeled as ubyte[], so I don't see what 
you'd be casting to and fro. You have absolutely no business 
representing anything non-UTF with char and char[] etc.

>> So please ponder the above before going to do surgery on the patient
>> that's going to kill him.
> [...]
>
> Yeah I was surprised Walter was actually seriously going to pursue this.
> It's a change of a far vaster magnitude than many of the other DIPs and
> other proposals that have been rejected because they were deemed to
> cause too much breakage of existing code.

Compared with what's going on now with D at Facebook, this agitation is 
but a little side show. We have way bigger fish to fry.

Andrei