Dicebot on leaving D: It is anarchy driven development in all its glory.
Jonathan M Davis
newsgroup.d at jmdavisprog.com
Sat Sep 8 20:00:32 UTC 2018
On Saturday, September 8, 2018 8:05:04 AM MDT Laeeth Isharc via Digitalmars-
d wrote:
> On Thursday, 6 September 2018 at 20:15:22 UTC, Jonathan M Davis
>
> wrote:
> > On Thursday, September 6, 2018 1:04:45 PM MDT aliak via
> >
> > Digitalmars-d wrote:
> >> D makes the code-point case default and hence that becomes the
> >> simplest to use. But unfortunately, the only thing I can think
> >> of
> >> that requires code point representations is when dealing
> >> specifically with unicode algorithms (normalization, etc).
> >> Here's
> >> a good read on code points:
> >> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to
> >> -un icode-code-points/ -
> >>
> >> tl;dr: application logic does not need or want to deal with
> >> code points. For speed units work, and for correctness,
> >> graphemes work.
> >
> > I think that it's pretty clear that code points are objectively
> > the worst level to be the default. Unfortunately, changing it
> > to _anything_ else is not going to be an easy feat at this
> > point. But if we can first ensure that Phobos in general
> > doesn't rely on it (i.e. in general, it can deal with ranges of
> > char, wchar, dchar, or graphemes correctly rather than assuming
> > that all ranges of characters are ranges of dchar), then maybe
> > we can figure something out. Unfortunately, while some work has
> > been done towards that, what's mostly happened is that folks
> > have complained about auto-decoding without doing much to
> > improve the current situation. There's a lot more to this than
> > simply ripping out auto-decoding even if every D user on the
> > planet agreed that outright breaking almost every existing D
> > program to get rid of auto-decoding was worth it. But as with
> > too many things around here, there's a lot more talking than
> > working. And actually, as such, I should probably stop
> > discussing this and go do something useful.
>
> A tutorial page linked from the front page with some examples
> would go a long way to making it easier for people. If I had
> time and understood strings enough to explain to others I would
> try to make a start, but unfortunately neither are true.
Writing up an article on proper Unicode handling in D is on my todo list,
but my todo list of things to do for D is long enough that I don't know then
I'm going to get to it.
> And if we are doing things right with RCString, then isn't it
> easier to make the change with that first - which is new so can't
> break code - and in some years when people are used to working
> that way update Phobos (compiler switch in beginning and have big
> transition a few years after that).
Well, I'm not actually convinced that what we have for RCString right now
_is_ doing the right thing, but even if it is, that doesn't fix the issue
that string doesn't do the right thing, and code needs to take that into
account - especially if it's generic code. The better job we do at making
Phobos code work with arbitrary ranges of characters, the less of an issue
that is, but you're still pretty much forced to deal with it in a number of
cases if you want your code to be efficient or if you want a function to be
able to accept a string and return a string rather than a wrapper range.
Using RCString in your code would reduce how much you had to worry about it,
but it doesn't completely solve the problem. And if you're doing stuff like
writing a library for other people to use, then you definitely can't just
ignore the issue. So, an RCString that handles Unicode sanely will
definitely help, but it's not really a fix. And plenty of code is still
going to be written to use strings (especially when -betterC is involved).
RCString is going to be another option, but it's not going to replace
string. Even if RCString became the most common string type to use (which I
question is going to ever happen), dynamic arrays of char, wchar, etc. are
still going to exist in the language and are still going to have to be
handled correctly.
Phobos won't be able to assume that all of the code out there is using
RCString and not string. The combination of improving Phobos so that it
works properly with ranges of characters in general (and not just strings or
ranges of dchar) and having an alternate string type that does the right
thing will definitely help and need to be done if we have any hope of
actually removing auto-decoding, but even with all of that, I don't see how
it would be possible to really deprecate the old behavior. We _might_ be
able to do something if we're willing to deprecate std.algorithm and
std.range (since std.range gives you the current definitions of the range
primitives for arrays, and std.algorithm publicly imports std.range), but
you still then have the problem of two different definitions of the range
primitives for arrays and all of the problems that that causes (even if it's
only for the deprecation period). So, strings would end up behaving
drastically differently with range-based functions depending on which module
you imported. I don't know that that problem is insurmountable, but it's not
at all clear that there is a path to fixing auto-decoding that doesn't
outright break old code. If we're willing to break old code, then we could
defnitely do it, but if we don't want to risk serious problems, we really
need a way to have a more gradual transition, and that's the big problem
that no one has a clean solution for.
> Isn't this one of the challenges created by the tension between D
> being both a high-level and low-level language. The higher the
> aim, the more problems you will encounter getting there. That's
> okay.
>
> And isn't the obstacle to breaking auto-decoding because it seems
> to be a monolithic challenge of overwhelming magnitude, whereas
> if we could figure out some steps to eat the elephant one
> mouthful at a time (which might mean start with RCString) then it
> will seem less intimidating. It will take years anyway perhaps -
> but so what?
Well, I think that it's clear at this point that before we can even consider
getting rid of auto-decoding, we need to make sure that Phobos in general
works with arbitrary ranges of code units, code points, and graphemes. With
that done, we would have a standard library that could work with strings as
ranges of code units if that's what they were. So, in theory, at that point,
the only issue would be how on earth to make strings work as ranges of code
units without just pulling the rug out from under everyone. I'm not at all
convinced that that's possible, but I am very much convinced that unless we
improve first Phobos so that it's fully correct in spite of the
auto-decoding issues, we definitely can't remove auto-decoding. And as a
group, we haven't done a good enough job with that. Most of us agree that
auto-decoding was a huge mistake, but there hasn't been enough work done
towards fixing what we have, and there's plenty of work there that needs to
be done whether we later try to remove auto-decoding or not.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list