Range of chars (narrow string ranges)
Chris via Digitalmars-d
digitalmars-d at puremagic.com
Wed Apr 29 09:01:42 PDT 2015
On Wednesday, 29 April 2015 at 15:13:15 UTC, Jonathan M Davis
wrote:
> On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
>> This sounds like a good starting point for a transition plan.
>> One important thing, though, would be to do some benchmarking
>> with and without autodecoding, to see if it really boosts
>> performance in a way that would justify the transition.
>
> Well, personally, I think that it's worth it even if the
> performance is identical (and it's a guarantee that it's going
> to be better without autodecoding - it's just a question of how
> much better - since it's going to have less work to do without
> autodecoding). Simply operating at the code point level like we
> do now is the worst of all worlds in terms of flexibility and
> correctness. As long as the Unicode is normalized, operating at
> the code unit level is the most efficient, and decoding is
> often unnecessary for correctness, and if you need to decode,
> then you really need to go up to the grapheme level in order to
> be operating on the full character, meaning that operating on
> code points really has the same problems as operating on code
> units as far as correctness goes. So, it's less performant
> without actually being correct. It just gives the illusion of
> correctness.
>
> By treating strings as ranges of code units, you don't take a
> performance hit when you don't need to, and it forces you to
> actually consider something like byDchar or byGrapheme if you
> want to operate on full, Unicode characters. It's similar to
> how operating on UTF-16 code units as if they were characters
> (as Java and C# generally do) frequently gives the incorrect
> impression that you're handling Unicode correctly, because you
> have to work harder at coming up with characters that can't fit
> in a single code unit, whereas with UTF-8, anything but ASCII
> is screwed if you treat code units as code points. Treating
> code points as if they were full characters like we're doing
> now in Phobos with ranges just makes it that much harder to
> notice that you're not handling Unicode correctly.
>
> Also, treating strings as ranges of code units makes it so that
> they're not so special and actually are treated like every
> other type of array, which eliminates a lot of the special
> casing that we're forced to do right now, and it eliminates all
> of the confusion that folks keep running into when string
> doesn't work with many functions, because it's not a
> random-access range or doesn't have length, or because the
> resulting range isn't the same type (copy would be a prime
> example of a function that doesn't work with char[] when it
> should). By leaving in autodecoding, we're basically leaving in
> technical debt in D permanently. We'll forever have to be
> explaining it to folks and forever have to be working around it
> in order to achieve either performance or correctness.
>
> What we have now isn't performant, correct, or flexible, and
> we'll be forever paying for that if we don't get rid of
> autodecoding.
>
> I don't criticize Andrei in the least for coming up with it,
> since if you don't take graphemes into account (and he didn't
> know about them at the time), it seems like a great idea and
> allows us to be correct by default and performant if we put
> some effort into, but after having seen how it's worked out,
> how much code has to be special-cased, how much confusion there
> is over it, and how it's not actually correct anyway, I think
> that it's quite clear that autodecoding was a mistake. And at
> this point, it's mainly a question of how we can get rid of it
> without being too disruptive and whether we can convince Andrei
> that it makes sense to make the change, since he seems to still
> think that autodecoding is fine in spite of the fact that it's
> neither performant nor correct.
>
> It may be that the decision will be that it's too disruptive to
> remove autodecoding, but I think that that's really a question
> of whether we can find a way to do it that doesn't break tons
> of code rather than whether it's worth the performance or
> correctness gain.
>
> - Jonathan M Davis
Ok, I see. Well, if we don't want to repeat C++'s mistakes, we
should fix it before it's too late. Since I'm dealing a lot with
strings (non ASCII) and depend on Unicode (and correctness!), I
would be more than happy to test any changes to Phobos with my
programs to see if it screws up anything.
More information about the Digitalmars-d
mailing list