More fun with autodecoding

Mon Sep 10 08:35:08 UTC 2018

On Saturday, September 8, 2018 9:36:25 AM MDT Steven Schveighoffer via 
Digitalmars-d wrote:
> On 8/9/18 2:44 AM, Walter Bright wrote:
> > On 8/8/2018 2:01 PM, Steven Schveighoffer wrote:
> >> Here's where I'm struggling -- because a string provides indexing,
> >> slicing, length, etc. but Phobos ignores that. I can't make a new type
> >> that does the same thing. Not only that, but I'm finding the
> >> specializations of algorithms only work on the type "string", and
> >> nothing else.
> >
> > One of the worst things about autodecoding is it is special, it *only*
> > steps in for strings. Fortunately, however, that specialness enabled us
> > to save things with byCodePoint and byCodeUnit.
>
> So it turns out that technically the problem here, even though it seemed
> like an autodecoding problem, is a problem with splitter.
>
> splitter doesn't deal with encodings of character ranges at all.
>
> For instance, when you have this:
>
> "abc 123".byCodeUnit.splitter;
>
> What happens is splitter only has one overload that takes one parameter,
> and that requires a character *array*, not a range.
>
> So the byCodeUnit result is aliased-this to its original, and surprise!
> the elements from that splitter are string.
>
> Next, I tried to use a parameter:
>
> "abc 123".byCodeUnit.splitter(" ");
>
> Nope, still devolves to string. It turns out it can't figure out how to
> split character ranges using a character array as input.
>
> The only thing that does seem to work is this:
>
> "abc 123".byCodeUnit.splitter(" ".byCodeUnit);
>
> But this goes against most algorithms in Phobos that deal with character
> ranges -- generally you can use any width character range, and it just
> works. Having a drop-in replacement for string would require splitter to
> handle these transcodings (and I think in general, algorithms should be
> able to handle them as well). Not only that, but the specialized
> splitter that takes no separator can split on multiple spaces, a feature
> I want to have for my drop-in replacement.
>
> I'll work on adding some issues to the tracker, and potentially doing
> some PRs so they can be fixed.

Well, plenty of algorithms don't care one whit about strings specifically
and thus their behavior is really dependent on what the element type of the
range is (e.g. for byCodeUnit, filter would filter code units, and sort
would sort code units, and arguably, that's what they should do). However, a
big problem with with a number of the functions in Phobos that specifically
operate on ranges of characters is that they tend to assume that a range of
characters means a range of dchar. Some of the functions in Phobos have been
fixed to be more flexible and operate on arbitrary ranges of char, wchar, or
dchar, but it's mostly happened because of a bug report about a particular
function not working with something like byCodeUnit, whereas what we really
need to happen is to have tests added for all of the functions in Phobos
which specifically operate on ranges of characters to ensure that they do
the correct thing when given a range of char, wchar, dchar - or graphemes
(much as we talk about graphemes being the correct level for a some types of
string processing, nothing in Phobos outside of std.uni currently does
anything with byGrapheme, even in tests).

And of course, with those tests, we'll inevitably find that a number of
those functions won't work correctly and will need to be fixed. But as
annoying as all of that is, it's work that needs to be done regardless of
the situation with auto-decoding, since these functions need to work with
arbitrary ranges of characters and not just ranges of dchar. And for those
functions that don't need to try to avoid auto-decoding, they should then
not even care whether strings are ranges of code units or code points, which
should then reduce the impact of auto-decoding. And actually, a lot of the
code that specializes on narrow strings to avoid auto-decoding would
probably work whether auto-decoding was there or not. So, once we've
actually managed to ensure that Phobos in general works with arbitrary
ranges of characters, the main breakage that would be caused by removing
auto-decoding (in Phobos at least) would be any code that used strings with
functions that weren't specifically written to do something special for
strings, and while I'm not at all convinced that we then have a path towards
removing auto-decoding, it would minimize auto-decoding's impact, and with
auto-decoding's impact minimized as much as possible, maybe at some point,
we'll actually manage to figure out how to remove it.

But in any case, the issues that you're running into with splitter are a
symptom of a larger problem with how Phobos currently handles ranges of
characters. And when this sort of thing comes up, I'm reminded that I should
take the time to start adding the appropriate tests to Phobos, and then I
never get around to it - as with too many things. I really should fix that.
:|

- Jonathan M Davis