Major performance problem with std.array.front()
w0rp
devw0rp at gmail.com
Sun Mar 9 04:47:31 PDT 2014
On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote:
>
> I'm leaning the same way too. But I also think Andrei is right
> that, at this point in time, it'd be a terrible move to change
> things so that "by code unit" is default. For better or worse,
> that ship has sailed.
>
> Perhaps we *can* deal with the auto-decoding problem not by
> killing auto-decoding, but by marginalizing it in an additive
> way:
>
> Convincing arguments have been made that any string-processing
> code which *isn't* done entirely with the official Unicode
> algorithms is likely wrong *regardless* of whether
> std.algorithm defaults to per-code-unit or per-code-point.
>
> So...How's this?: We add any of these Unicode algorithms we may
> be missing, encourage their use for strings, discourage use of
> std.algorithm for string processing, and in the meantime, just
> do our best to reduce unnecessary decoding wherever possible.
> Then we call it a day and all be happy :)
I've been watching this discussion for the last few days, and I'm
kind of a nobody jumping in pretty late, but I think after
thinking about the problem for a while I would aggree on a
solution along the lines of what you have suggested.
I think Vladimir is definitely right when he's saying that when
you have algorithms that deal with natural languages, simply
working on the basis of a code unit isn't enough. I think it is
also true that you need to select a particular algorithm for
dealing with strings of characters, as there are many different
algorithms you can use for different languages which behave
differently, perhaps several in a single langauge. I also think
Andrei is right when he is saying we need to minimise code
breakage, and that the string decoding and encoding by default
isn't the biggest of performance problems.
I think our best option is to offer a function which creates a
range in std.array for getting a range over raw character data,
without decoding to code points.
myArray.someAlgorithm; // std.array .front used today with decode
calls
myArray.rawData.someAlgorithm; // New range which doesn't decode.
Then we could look at creating algorithms for string processing
which don't use the existing dchar abstraction.
myArray.rawData.byNaturalSymbol!SomeIndianEncodingHere; // Range
of strings, maybe range of range of characters, not dchars
Or even specialise the new algorithm so it looks for arrays and
turns them into the ranges for you via the transformation myArray
-> myArray.rawData.
myArray.byNaturalSymbol!SomeIndianEncodingHere;
Honestly, I'd leave the details of such an algorithm to Vladimir
and not myself, because he's spent far more time looking into
Unicode processing than myself. My knowledge of Unicode pretty
much just comes from having to deal with foreign language
customers and discovering the problems with the code unit
abstraction most languages seem to use. (Java and Python suffer
from similar issues, but they don't really have algorithms in the
way that we do.)
This new set of algorithms taking settings for different
encodings could be first implemented in a third party library,
tested there, and eventually submitted to Phobos, probably in
std.string.
There's my input, I'll duck before I'm beheaded.
More information about the Digitalmars-d
mailing list