Range of chars (narrow string ranges)
Jonathan M Davis via Digitalmars-d
digitalmars-d at puremagic.com
Wed Apr 29 08:13:13 PDT 2015
On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
> This sounds like a good starting point for a transition plan.
> One important thing, though, would be to do some benchmarking
> with and without autodecoding, to see if it really boosts
> performance in a way that would justify the transition.
Well, personally, I think that it's worth it even if the
performance is identical (and it's a guarantee that it's going to
be better without autodecoding - it's just a question of how much
better - since it's going to have less work to do without
autodecoding). Simply operating at the code point level like we
do now is the worst of all worlds in terms of flexibility and
correctness. As long as the Unicode is normalized, operating at
the code unit level is the most efficient, and decoding is often
unnecessary for correctness, and if you need to decode, then you
really need to go up to the grapheme level in order to be
operating on the full character, meaning that operating on code
points really has the same problems as operating on code units as
far as correctness goes. So, it's less performant without
actually being correct. It just gives the illusion of correctness.
By treating strings as ranges of code units, you don't take a
performance hit when you don't need to, and it forces you to
actually consider something like byDchar or byGrapheme if you
want to operate on full, Unicode characters. It's similar to how
operating on UTF-16 code units as if they were characters (as
Java and C# generally do) frequently gives the incorrect
impression that you're handling Unicode correctly, because you
have to work harder at coming up with characters that can't fit
in a single code unit, whereas with UTF-8, anything but ASCII is
screwed if you treat code units as code points. Treating code
points as if they were full characters like we're doing now in
Phobos with ranges just makes it that much harder to notice that
you're not handling Unicode correctly.
Also, treating strings as ranges of code units makes it so that
they're not so special and actually are treated like every other
type of array, which eliminates a lot of the special casing that
we're forced to do right now, and it eliminates all of the
confusion that folks keep running into when string doesn't work
with many functions, because it's not a random-access range or
doesn't have length, or because the resulting range isn't the
same type (copy would be a prime example of a function that
doesn't work with char[] when it should). By leaving in
autodecoding, we're basically leaving in technical debt in D
permanently. We'll forever have to be explaining it to folks and
forever have to be working around it in order to achieve either
performance or correctness.
What we have now isn't performant, correct, or flexible, and
we'll be forever paying for that if we don't get rid of
autodecoding.
I don't criticize Andrei in the least for coming up with it,
since if you don't take graphemes into account (and he didn't
know about them at the time), it seems like a great idea and
allows us to be correct by default and performant if we put some
effort into, but after having seen how it's worked out, how much
code has to be special-cased, how much confusion there is over
it, and how it's not actually correct anyway, I think that it's
quite clear that autodecoding was a mistake. And at this point,
it's mainly a question of how we can get rid of it without being
too disruptive and whether we can convince Andrei that it makes
sense to make the change, since he seems to still think that
autodecoding is fine in spite of the fact that it's neither
performant nor correct.
It may be that the decision will be that it's too disruptive to
remove autodecoding, but I think that that's really a question of
whether we can find a way to do it that doesn't break tons of
code rather than whether it's worth the performance or
correctness gain.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list