Range of chars (narrow string ranges)

Wed Apr 29 09:01:42 PDT 2015

On Wednesday, 29 April 2015 at 15:13:15 UTC, Jonathan M Davis 
wrote:
> On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
>> This sounds like a good starting point for a transition plan. 
>> One important thing, though, would be to do some benchmarking 
>> with and without autodecoding, to see if it really boosts 
>> performance in a way that would justify the transition.
>
> Well, personally, I think that it's worth it even if the 
> performance is identical (and it's a guarantee that it's going 
> to be better without autodecoding - it's just a question of how 
> much better - since it's going to have less work to do without 
> autodecoding). Simply operating at the code point level like we 
> do now is the worst of all worlds in terms of flexibility and 
> correctness. As long as the Unicode is normalized, operating at 
> the code unit level is the most efficient, and decoding is 
> often unnecessary for correctness, and if you need to decode, 
> then you really need to go up to the grapheme level in order to 
> be operating on the full character, meaning that operating on 
> code points really has the same problems as operating on code 
> units as far as correctness goes. So, it's less performant 
> without actually being correct. It just gives the illusion of 
> correctness.
>
> By treating strings as ranges of code units, you don't take a 
> performance hit when you don't need to, and it forces you to 
> actually consider something like byDchar or byGrapheme if you 
> want to operate on full, Unicode characters. It's similar to 
> how operating on UTF-16 code units as if they were characters 
> (as Java and C# generally do) frequently gives the incorrect 
> impression that you're handling Unicode correctly, because you 
> have to work harder at coming up with characters that can't fit 
> in a single code unit, whereas with UTF-8, anything but ASCII 
> is screwed if you treat code units as code points. Treating 
> code points as if they were full characters like we're doing 
> now in Phobos with ranges just makes it that much harder to 
> notice that you're not handling Unicode correctly.
>
> Also, treating strings as ranges of code units makes it so that 
> they're not so special and actually are treated like every 
> other type of array, which eliminates a lot of the special 
> casing that we're forced to do right now, and it eliminates all 
> of the confusion that folks keep running into when string 
> doesn't work with many functions, because it's not a 
> random-access range or doesn't have length, or because the 
> resulting range isn't the same type (copy would be a prime 
> example of a function that doesn't work with char[] when it 
> should). By leaving in autodecoding, we're basically leaving in 
> technical debt in D permanently. We'll forever have to be 
> explaining it to folks and forever have to be working around it 
> in order to achieve either performance or correctness.
>
> What we have now isn't performant, correct, or flexible, and 
> we'll be forever paying for that if we don't get rid of 
> autodecoding.
>
> I don't criticize Andrei in the least for coming up with it, 
> since if you don't take graphemes into account (and he didn't 
> know about them at the time), it seems like a great idea and 
> allows us to be correct by default and performant if we put 
> some effort into, but after having seen how it's worked out, 
> how much code has to be special-cased, how much confusion there 
> is over it, and how it's not actually correct anyway, I think 
> that it's quite clear that autodecoding was a mistake. And at 
> this point, it's mainly a question of how we can get rid of it 
> without being too disruptive and whether we can convince Andrei 
> that it makes sense to make the change, since he seems to still 
> think that autodecoding is fine in spite of the fact that it's 
> neither performant nor correct.
>
> It may be that the decision will be that it's too disruptive to 
> remove autodecoding, but I think that that's really a question 
> of whether we can find a way to do it that doesn't break tons 
> of code rather than whether it's worth the performance or 
> correctness gain.
>
> - Jonathan M Davis

Ok, I see. Well, if we don't want to repeat C++'s mistakes, we 
should fix it before it's too late. Since I'm dealing a lot with 
strings (non ASCII) and depend on Unicode (and correctness!), I 
would be more than happy to test any changes to Phobos with my 
programs to see if it screws up anything.