Major performance problem with std.array.front()

Sun Mar 9 04:47:31 PDT 2014

On Sunday, 9 March 2014 at 09:24:02 UTC, Nick Sabalausky wrote:
>
> I'm leaning the same way too. But I also think Andrei is right 
> that, at this point in time, it'd be a terrible move to change 
> things so that "by code unit" is default. For better or worse, 
> that ship has sailed.
>
> Perhaps we *can* deal with the auto-decoding problem not by 
> killing auto-decoding, but by marginalizing it in an additive 
> way:
>
> Convincing arguments have been made that any string-processing 
> code which *isn't* done entirely with the official Unicode 
> algorithms is likely wrong *regardless* of whether 
> std.algorithm defaults to per-code-unit or per-code-point.
>
> So...How's this?: We add any of these Unicode algorithms we may 
> be missing, encourage their use for strings, discourage use of 
> std.algorithm for string processing, and in the meantime, just 
> do our best to reduce unnecessary decoding wherever possible. 
> Then we call it a day and all be happy :)

I've been watching this discussion for the last few days, and I'm 
kind of a nobody jumping in pretty late, but I think after 
thinking about the problem for a while I would aggree on a 
solution along the lines of what you have suggested.

I think Vladimir is definitely right when he's saying that when 
you have algorithms that deal with natural languages, simply 
working on the basis of a code unit isn't enough. I think it is 
also true that you need to select a particular algorithm for 
dealing with strings of characters, as there are many different 
algorithms you can use for different languages which behave 
differently, perhaps several in a single langauge. I also think 
Andrei is right when he is saying we need to minimise code 
breakage, and that the string decoding and encoding by default 
isn't the biggest of performance problems.

I think our best option is to offer a function which creates a 
range in std.array for getting a range over raw character data, 
without decoding to code points.

myArray.someAlgorithm; // std.array .front used today with decode 
calls
myArray.rawData.someAlgorithm; // New range which doesn't decode.

Then we could look at creating algorithms for string processing 
which don't use the existing dchar abstraction.

myArray.rawData.byNaturalSymbol!SomeIndianEncodingHere; // Range 
of strings, maybe range of range of characters, not dchars

Or even specialise the new algorithm so it looks for arrays and 
turns them into the ranges for you via the transformation myArray 
-> myArray.rawData.

myArray.byNaturalSymbol!SomeIndianEncodingHere;

Honestly, I'd leave the details of such an algorithm to Vladimir 
and not myself, because he's spent far more time looking into 
Unicode processing than myself. My knowledge of Unicode pretty 
much just comes from having to deal with foreign language 
customers and discovering the problems with the code unit 
abstraction most languages seem to use. (Java and Python suffer 
from similar issues, but they don't really have algorithms in the 
way that we do.)

This new set of algorithms taking settings for different 
encodings could be first implemented in a third party library, 
tested there, and eventually submitted to Phobos, probably in 
std.string.

There's my input, I'll duck before I'm beheaded.