Major performance problem with std.array.front()

Wed Mar 12 23:08:59 PDT 2014

On Thursday, March 06, 2014 18:37:13 Walter Bright wrote:
> Is there any hope of fixing this?

I agree with Andrei. I don't think that there's really anything to fix. The 
problem is that there's roughly 3 levels at which string operations can be 
done

1. By code unit
2. By code point
3. By grapheme

and which is correct depends on what you're trying to do. Phobos attempts to 
go for correctness by default without seriously impacting performance, so it 
treats all strings as ranges of dchar (so, level #2). If we went with #1, then 
pretty much any algorithm which operated on individual characters would be 
broken, as unless your strings are ASCII-only, code units are very much the 
wrong level to be operating on if you're trying to deal with characters. If we 
went with #3, then we'd have full correctness, but we'd tank performance. With 
#2, we're far more correct than is typically the case with C++ while still 
being reasonably performant. And those who want full performance can use 
immutable(ubyte)[] to get #1, and those who want #3 can use the grapheme 
support in std.uni.

We've gone to great lengths in Phobos to specialize on narrow strings in order 
to make it more efficient while still maintaining correctness, and anyone who 
really wants performance can do the same. But by operating on the code point 
level, we at least get a reasonable level of unicode-correctness by default. 
With your suggestion, I'd fully expect most D programs to be wrong with 
regards to Unicode, because most programmers don't know or care about how 
Unicode works. And changing what we're doing now would be code breakage of 
astronomical proportions. It will essentially break all uses of range-based 
string code. Certainly, it would be largest code breakage that D has seen is 
years if not ever. So, it's almost certainly a bad idea, but if it isn't, we 
need to be darn sure that what we change to is significantly better and worth 
the huge amount of code breakage that it will cause.

I really don't think that there's any way to get this right. Regardless of 
which level you operate at by default - be it code unit, code point, or 
grapheme - it will be wrong a good chunk of the time. So, it becomes a 
question which of the three has the best tradeoffs, and I think that our 
current solution of operating on code points by default does that. If there 
are things that we can do to better support operating on code units or 
graphemes for those who want it, then great. And it's great if we can find 
ways to make operating at the code point level more efficient or less prone to 
bugs due to not operating at the grapheme level. But I think that operating on 
the code point level like we currently do is by far the best approach.

If anything, it's the fact that the language doesn't do that that's a bigger 
concern IMHO - the main place where that's an issue being the fact that 
foreach iterates by code unit by default. But I don't know of a good way to 
solve that other than treating all arrays of char, wchar, and dchar specially, 
and disable their array operations like ranges do so that you have to convert 
them to code units via the representation function in order to operate on them 
as code units - which Andrei has suggested a number of times before, but 
you've shot him down each time. If that were fixed, then at least we'd be 
consistent, which is usually the biggest complaint with regards to how D 
treats strings. But I really don't think that there's a magical fix for range-
based string operations, and I think that our current approach is a good one.

- Jonathan M Davis