Major performance problem with std.array.front()
Jonathan M Davis
jmdavisProg at gmx.com
Wed Mar 12 23:08:59 PDT 2014
On Thursday, March 06, 2014 18:37:13 Walter Bright wrote:
> Is there any hope of fixing this?
I agree with Andrei. I don't think that there's really anything to fix. The
problem is that there's roughly 3 levels at which string operations can be
done
1. By code unit
2. By code point
3. By grapheme
and which is correct depends on what you're trying to do. Phobos attempts to
go for correctness by default without seriously impacting performance, so it
treats all strings as ranges of dchar (so, level #2). If we went with #1, then
pretty much any algorithm which operated on individual characters would be
broken, as unless your strings are ASCII-only, code units are very much the
wrong level to be operating on if you're trying to deal with characters. If we
went with #3, then we'd have full correctness, but we'd tank performance. With
#2, we're far more correct than is typically the case with C++ while still
being reasonably performant. And those who want full performance can use
immutable(ubyte)[] to get #1, and those who want #3 can use the grapheme
support in std.uni.
We've gone to great lengths in Phobos to specialize on narrow strings in order
to make it more efficient while still maintaining correctness, and anyone who
really wants performance can do the same. But by operating on the code point
level, we at least get a reasonable level of unicode-correctness by default.
With your suggestion, I'd fully expect most D programs to be wrong with
regards to Unicode, because most programmers don't know or care about how
Unicode works. And changing what we're doing now would be code breakage of
astronomical proportions. It will essentially break all uses of range-based
string code. Certainly, it would be largest code breakage that D has seen is
years if not ever. So, it's almost certainly a bad idea, but if it isn't, we
need to be darn sure that what we change to is significantly better and worth
the huge amount of code breakage that it will cause.
I really don't think that there's any way to get this right. Regardless of
which level you operate at by default - be it code unit, code point, or
grapheme - it will be wrong a good chunk of the time. So, it becomes a
question which of the three has the best tradeoffs, and I think that our
current solution of operating on code points by default does that. If there
are things that we can do to better support operating on code units or
graphemes for those who want it, then great. And it's great if we can find
ways to make operating at the code point level more efficient or less prone to
bugs due to not operating at the grapheme level. But I think that operating on
the code point level like we currently do is by far the best approach.
If anything, it's the fact that the language doesn't do that that's a bigger
concern IMHO - the main place where that's an issue being the fact that
foreach iterates by code unit by default. But I don't know of a good way to
solve that other than treating all arrays of char, wchar, and dchar specially,
and disable their array operations like ranges do so that you have to convert
them to code units via the representation function in order to operate on them
as code units - which Andrei has suggested a number of times before, but
you've shot him down each time. If that were fixed, then at least we'd be
consistent, which is usually the biggest complaint with regards to how D
treats strings. But I really don't think that there's a magical fix for range-
based string operations, and I think that our current approach is a good one.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list