Range of chars (narrow string ranges)

Wed Apr 29 08:13:13 PDT 2015

On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
> This sounds like a good starting point for a transition plan. 
> One important thing, though, would be to do some benchmarking 
> with and without autodecoding, to see if it really boosts 
> performance in a way that would justify the transition.

Well, personally, I think that it's worth it even if the 
performance is identical (and it's a guarantee that it's going to 
be better without autodecoding - it's just a question of how much 
better - since it's going to have less work to do without 
autodecoding). Simply operating at the code point level like we 
do now is the worst of all worlds in terms of flexibility and 
correctness. As long as the Unicode is normalized, operating at 
the code unit level is the most efficient, and decoding is often 
unnecessary for correctness, and if you need to decode, then you 
really need to go up to the grapheme level in order to be 
operating on the full character, meaning that operating on code 
points really has the same problems as operating on code units as 
far as correctness goes. So, it's less performant without 
actually being correct. It just gives the illusion of correctness.

By treating strings as ranges of code units, you don't take a 
performance hit when you don't need to, and it forces you to 
actually consider something like byDchar or byGrapheme if you 
want to operate on full, Unicode characters. It's similar to how 
operating on UTF-16 code units as if they were characters (as 
Java and C# generally do) frequently gives the incorrect 
impression that you're handling Unicode correctly, because you 
have to work harder at coming up with characters that can't fit 
in a single code unit, whereas with UTF-8, anything but ASCII is 
screwed if you treat code units as code points. Treating code 
points as if they were full characters like we're doing now in 
Phobos with ranges just makes it that much harder to notice that 
you're not handling Unicode correctly.

Also, treating strings as ranges of code units makes it so that 
they're not so special and actually are treated like every other 
type of array, which eliminates a lot of the special casing that 
we're forced to do right now, and it eliminates all of the 
confusion that folks keep running into when string doesn't work 
with many functions, because it's not a random-access range or 
doesn't have length, or because the resulting range isn't the 
same type (copy would be a prime example of a function that 
doesn't work with char[] when it should). By leaving in 
autodecoding, we're basically leaving in technical debt in D 
permanently. We'll forever have to be explaining it to folks and 
forever have to be working around it in order to achieve either 
performance or correctness.

What we have now isn't performant, correct, or flexible, and 
we'll be forever paying for that if we don't get rid of 
autodecoding.

I don't criticize Andrei in the least for coming up with it, 
since if you don't take graphemes into account (and he didn't 
know about them at the time), it seems like a great idea and 
allows us to be correct by default and performant if we put some 
effort into, but after having seen how it's worked out, how much 
code has to be special-cased, how much confusion there is over 
it, and how it's not actually correct anyway, I think that it's 
quite clear that autodecoding was a mistake. And at this point, 
it's mainly a question of how we can get rid of it without being 
too disruptive and whether we can convince Andrei that it makes 
sense to make the change, since he seems to still think that 
autodecoding is fine in spite of the fact that it's neither 
performant nor correct.

It may be that the decision will be that it's too disruptive to 
remove autodecoding, but I think that that's really a question of 
whether we can find a way to do it that doesn't break tons of 
code rather than whether it's worth the performance or 
correctness gain.

- Jonathan M Davis