Major performance problem with std.array.front()
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Fri Mar 7 16:44:41 PST 2014
On 3/7/14, 12:26 PM, H. S. Teoh wrote:
> On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
>> s.canFind('é') currently works as expected. Proposed: fails silently.
>
> The problem is that the current implementation of this correct behaviour
> leaves a lot to be desired in terms of performance. Ideally, you should
> not need to decode every single character in s just to see if it happens
> to contain é. Rather, canFind, et al should convert the dchar literal
> 'é' into a UTF-8 (resp. UTF-16) sequence and do a substring search
> instead. Decoding every character in s, while correct, is also
> needlessly inefficient.
That's an optimization that fits the current design and goes in the
library transparently, i.e. the good stuff.
>> 5.
>>
>> s.count() currently works as expected. Proposed: fails silently.
>
> Wrong. The current behaviour of s.count() does not work as expected, it
> only gives an illusion that it does.
Depends on what one expects :o).
> Its return value is misleading when
> combining diacritics and other such Unicode "niceness" are involved.
> Arguably, such things should be prohibited altogether, and more
> semantically transparent algorithms used, namely s.countCodePoints,
> s.countGraphemes, etc..
I think s.byGrapheme.count is the right way instead of specializing a
bunch of algorithms to work with graphemes.
>> s.endsWith('é') currently works as expected. Proposed: fails silently.
>
> Arguable, because it imposes a performance hit by needless decoding.
> Ideally, you should have 3 overloads:
>
> bool endsWith(string s, char asciiChar);
> bool endsWith(string s, wchar wideChar);
> bool endsWith(string s, dchar codepoint);
Nice idea. Fits current design. Then interesting complications arise
with things like bool endsWith(string, wstring) etc.
> [...]
>> I designed the range behavior of strings after much thinking and
>> consideration back in the day when I designed std.algorithm. It was
>> painfully obvious (but it seems to have been forgotten now that it's
>> working so well) that approaching strings as arrays of char[] would
>> break almost every single algorithm leaving us essentially in the
>> pre-UTF C++aveman era.
>
> I agree, but it is also painfully obvious that the current
> implementation is lackluster in terms of performance.
It's not painfully obvious to me at all. What is obvious to me is people
are happy campers with the way D's strings work, including UTF support
and performance. I don't remember people bringing this up in forums and
here at Facebook "yeah, just look at the crappy way they handle
strings..." Silent approval is easy to forget about.
Walter has been working on an application in which anything slower than
2x baseline would have been a failure. In that app (which I know very
well) the right option from day 1 would have been ubyte[], which he
discovered the hard way. His incomplete understanding of how D strings
work is the single largest problem there, and indicates an issue with
the documentation.
He discovered that, was surprised, and overreacted. No need to amplify
that into mass hysteria. There are improvements that can be made, in the
form of additions, not breaking changes that would inflict massive
breakage on the community. This is the way in which this discussion can
have a positive outcome. (I've shared in fact a few ideas with Walter.)
>> Clearly one might argue that their app has no business dealing with
>> diacriticals or Asian characters. But that's the typical provincial
>> view that marred many languages' approach to UTF and
>> internationalization. If you know your string is ASCII, the remedy
>> is simple - don't use char[] and friends. From day 1, the type
>> "char" was meant to mean "code unit of UTF characters".
>
> Yes, but currently Phobos support for non-UTF strings is rather poor,
> and requires many explicit casts to/from ubyte[].
Non-UTF strings are currently modeled as ubyte[], so I don't see what
you'd be casting to and fro. You have absolutely no business
representing anything non-UTF with char and char[] etc.
>> So please ponder the above before going to do surgery on the patient
>> that's going to kill him.
> [...]
>
> Yeah I was surprised Walter was actually seriously going to pursue this.
> It's a change of a far vaster magnitude than many of the other DIPs and
> other proposals that have been rejected because they were deemed to
> cause too much breakage of existing code.
Compared with what's going on now with D at Facebook, this agitation is
but a little side show. We have way bigger fish to fry.
Andrei
More information about the Digitalmars-d
mailing list