Dicebot on leaving D: It is anarchy driven development in all its glory.

Sat Sep 8 19:34:32 UTC 2018

On Thursday, September 6, 2018 3:15:59 PM MDT aliak via Digitalmars-d wrote:
> On Thursday, 6 September 2018 at 20:15:22 UTC, Jonathan M Davis
>
> wrote:
> > On Thursday, September 6, 2018 1:04:45 PM MDT aliak via
> >
> > Digitalmars-d wrote:
> >> D makes the code-point case default and hence that becomes the
> >> simplest to use. But unfortunately, the only thing I can think
> >> of
> >> that requires code point representations is when dealing
> >> specifically with unicode algorithms (normalization, etc).
> >> Here's
> >> a good read on code points:
> >> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to
> >> -un icode-code-points/ -
> >>
> >> tl;dr: application logic does not need or want to deal with
> >> code points. For speed units work, and for correctness,
> >> graphemes work.
> >
> > I think that it's pretty clear that code points are objectively
> > the worst level to be the default. Unfortunately, changing it
> > to _anything_ else is not going to be an easy feat at this
> > point. But if we can first ensure that Phobos in general
> > doesn't rely on it (i.e. in general, it can deal with ranges of
> > char, wchar, dchar, or graphemes correctly rather than assuming
> > that all ranges of characters are ranges of dchar), then maybe
> > we can figure something out. Unfortunately, while some work has
> > been done towards that, what's mostly happened is that folks
> > have complained about auto-decoding without doing much to
> > improve the current situation. There's a lot more to this than
> > simply ripping out auto-decoding even if every D user on the
> > planet agreed that outright breaking almost every existing D
> > program to get rid of auto-decoding was worth it. But as with
> > too many things around here, there's a lot more talking than
> > working. And actually, as such, I should probably stop
> > discussing this and go do something useful.
> >
> > - Jonathan M Davis
>
> Is there a unittest somewhere in phobos you know that one can be
> pointed to that shows the handling of these 4 variations you say
> should be dealt with first? Or maybe a PR that did some of this
> work that one could investigate?
>
> I ask so I can see in code what it means to make something not
> rely on autodecoding and deal with ranges of char, wchar, dchar
> or graphemes.
>
> Or a current "easy" bugzilla issue maybe that one could try a
> hand at?

Not really. The handling of this has generally been too ad-hoc. There are
plenty of examples of handling different string types, and there are a few
handling different ranges of character types, but there's a distinct lack of
tests involving graphemes. And the correct behavior for each is going to
depend on what exactly the function does - e.g. almost certainly, the
correct thing for filter to do is to not do anything special for ranges of
characters at all and just filter on the element type of the range (even
though it would almost always be incorrect to filter a range of char unless
it's known to be all ASCII), while on the other hand, find is clearly
designed to handle different encodings. So, it needs to be able to find a
dchar or grapheme in a range of char. And of course, there's the issue of
how normalization should be handled (if at all).

A number of the tests in std.utf and std.string do a good job of testing
Unicode strings of varying encodings, and std.utf does a good job overall of
testing ranges of char, wchar, and dchar which aren't strings, but I'm not
sure that anything in Phobos outside of std.uni currently does anything with
ranges of graphemes.

std.conv.to does have some tests for ranges of char, wchar, and dchar due to
a bug fix. e.g.

// bugzilla 15800
@safe unittest
{
    import std.utf : byCodeUnit, byChar, byWchar, byDchar;

    assert(to!int(byCodeUnit("10")) == 10);
    assert(to!int(byCodeUnit("10"), 10) == 10);
    assert(to!int(byCodeUnit("10"w)) == 10);
    assert(to!int(byCodeUnit("10"w), 10) == 10);

    assert(to!int(byChar("10")) == 10);
    assert(to!int(byChar("10"), 10) == 10);
    assert(to!int(byWchar("10")) == 10);
    assert(to!int(byWchar("10"), 10) == 10);
    assert(to!int(byDchar("10")) == 10);
    assert(to!int(byDchar("10"), 10) == 10);
}

but there are no grapheme tests, and no Unicode characters are involved
(though I'm not sure that much in std.conv really needs to worry about
Unicode characters).

So, there are tests scattered all over the place which do pieces of what
they need to be doing, but I'm not sure that there are currently any that
test the full range of character ranges that they really need to be testing.
As with testing reference type ranges, such tests have generally been added
only when fixing a specific bug, and there hasn't been a sufficient effort
to just go through all of the affected functions and add appropriate tests.

And unfortunately, unlike with reference type ranges, the correct behavior
of a function when faced with ranges of different character types is going
to be highly dependent on what they do. Some of them shouldn't be doing
anything special for processing ranges of characters, some shouldn't be
doing anything special for processing arbitrary ranges of characters, but
they still need to do something special for strings because of efficiency
issues caused by auto-decoding, and yet others need to actually take Unicode
into account and operate on each range type differently depending on whether
it's a range of code units, code points, or graphemes.

So, completely aside from auto-decoding issues, it's a bit of a daunting
task. I keep meaning to take the time to work on it, I've done some of the
critical work for supporting arbitrary ranges of char, wchar, and dchar
rather than just string types (as have some other folks), but I haven't
spent the time to start going through the functions one by one and add the
appropriate tests and fixes, and no one else has gone that far either. So, I
can't really point towards a specific set of tests and say "here, do what
these do." And even if I could, whether what those tests do would be correct
for another function would depend on what the functions do. So, sorry that I
can't be more helpful.

Actually, what you could probably do if you're looking for something related
to this to do, and you don't feel that you know enough to just start adding
tests, you could try byCodeUnit, byDchar, and byGrapheme with various
functions and see what happens. If the function doesn't even compile (which
will probably be the case at least some of the time), then that's an easy
bug report. If the function does compile, then it will require a greater
understanding to know whether it's doing the right thing, but in at least
some cases, it may be obvious, and if the result is obviously wrong, you can
create a bug report for that.

Ultimately though, a pretty solid understanding of ranges and Unicode is
going to be required to write a lot of these tests. And worse, a pretty
solid understanding of ranges and Unicode is going to be required to use any
of these functions correctly even if they all work correctly and have all of
the necessary tests to prove it. Unicode is just plain too complicated, and
trying to make things "just work" with it is frequently difficult -
especially if efficiency matters, but even when efficiency doesn't matter,
it's not always obvious how to make it "just work." :(

- Jonathan M Davis