Major performance problem with std.array.front()

Fri Mar 7 12:26:00 PST 2014

On Fri, Mar 07, 2014 at 11:57:23AM -0800, Andrei Alexandrescu wrote:
> On 3/6/14, 6:37 PM, Walter Bright wrote:
> >In "Lots of low hanging fruit in Phobos" the issue came up about the
> >automatic encoding and decoding of char ranges.
> [snip]
> >Is there any hope of fixing this?
> 
> There's nothing to fix.

:D  I knew this was going to happen.

> Allow me to enumerate the functions of std.algorithm and how they
> work today and how they'd work with the proposed change. Let s be a
> variable of some string type.
> 
> 1.
> 
> s.all!(x => x == 'é') currently works as expected. Proposed: fails silently.
> 
> 2.
> 
> s.any!(x => x == 'é') currently works as expected. Proposed: fails silently.
> 
> 3.
> 
> s.canFind!(x => x == 'é') currently works as expected. Proposed:
> fails silently.
> 
> 4.
> 
> s.canFind('é') currently works as expected. Proposed: fails silently.

The problem is that the current implementation of this correct behaviour
leaves a lot to be desired in terms of performance. Ideally, you should
not need to decode every single character in s just to see if it happens
to contain é. Rather, canFind, et al should convert the dchar literal
'é' into a UTF-8 (resp. UTF-16) sequence and do a substring search
instead. Decoding every character in s, while correct, is also
needlessly inefficient.

> 5.
> 
> s.count() currently works as expected. Proposed: fails silently.

Wrong. The current behaviour of s.count() does not work as expected, it
only gives an illusion that it does. Its return value is misleading when
combining diacritics and other such Unicode "niceness" are involved.
Arguably, such things should be prohibited altogether, and more
semantically transparent algorithms used, namely s.countCodePoints,
s.countGraphemes, etc..

> 6.
> 
> s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("é")
> currently works as expected (with the known issues of lowercase
> conversion). Proposed: fails silently.

Again, I don't like this. It sweeps the issues of comparing unicode
strings under the carpet and gives the programmer a false sense of code
correctness. Users instead should be encouraged to use proper Unicode
collation functions that are actually correct, instead of giving an
illusion of correctness.

> 7.
> 
> s.count('é') currently works as expected. Proposed: fails silently.

This is a repetition of #5. :)

> 8.
> 
> s.countUntil("a") currently work as expected. Proposed: fails
> silently. This applies to all variations of countUntil.

Whether this is correct or not depends on what the intention is. If
you're looking to slice a string, this most definitely does NOT work as
expected. If you're looking to count graphemes, this doesn't work as
expected either. This only works if you just so happen to be counting
code points. The correct approach, IMO, is to help the user make a
conscious choice between these different semantics:

	s.indexOf("a");			// for slicing
	s.byCodepoint.countUntil("a");	// count code points
	s.byGrapheme.countUntil("a");	// count graphemes

Things like s.countUntil("a") are misleading and lead to subtle Unicode
bugs.

> 9.
> 
> s.endsWith('é') currently works as expected. Proposed: fails silently.

Arguable, because it imposes a performance hit by needless decoding.
Ideally, you should have 3 overloads:

	bool endsWith(string s, char asciiChar);
	bool endsWith(string s, wchar wideChar);
	bool endsWith(string s, dchar codepoint);

In the wchar and dchar overloads you'd do substring search. There is no
need to decode.

> 10.
> 
> s.find('é') currently works as expected. Proposed: fails silently.
> This applies to other variations of find that include custom
> predicates.

Not necessarily. Arguably we should be overloading on needle type to
eliminate needless decoding:

	string find(string s, char c); // ubyte search
	string find(string s, wchar c); // substring search with char[2]
	string find(string s, dchar c); // substring search with char[4]

This makes sense to me because string is immutable(char)[], so from the
point of view of being an array, searching for wchar is not something
that is obvious (how do you search for a value of type T in an array of
elements of type U?), so explicit overloads for handling those cases
make sense.

Decoding every single character in s is a lot of needless work.

[...]
> I designed the range behavior of strings after much thinking and
> consideration back in the day when I designed std.algorithm. It was
> painfully obvious (but it seems to have been forgotten now that it's
> working so well) that approaching strings as arrays of char[] would
> break almost every single algorithm leaving us essentially in the
> pre-UTF C++aveman era.

I agree, but it is also painfully obvious that the current
implementation is lackluster in terms of performance.

> Making strings bidirectional ranges has been a very good choice
> within the constraints. There was already a string type, and that
> was immutable(char)[], and a bunch of code depended on that
> definition.
> 
> Clearly one might argue that their app has no business dealing with
> diacriticals or Asian characters. But that's the typical provincial
> view that marred many languages' approach to UTF and
> internationalization. If you know your string is ASCII, the remedy
> is simple - don't use char[] and friends. From day 1, the type
> "char" was meant to mean "code unit of UTF characters".

Yes, but currently Phobos support for non-UTF strings is rather poor,
and requires many explicit casts to/from ubyte[].

> So please ponder the above before going to do surgery on the patient
> that's going to kill him.
[...]

Yeah I was surprised Walter was actually seriously going to pursue this.
It's a change of a far vaster magnitude than many of the other DIPs and
other proposals that have been rejected because they were deemed to
cause too much breakage of existing code.

T

-- 
Having a smoking section in a restaurant is like having a peeing section
in a swimming pool. -- Edward Burr