The Case Against Autodecode

Thu Jun 2 16:21:38 PDT 2016

On Thursday, June 02, 2016 18:23:19 Andrei Alexandrescu via Digitalmars-d 
wrote:
> On 06/02/2016 05:58 PM, Walter Bright wrote:
> > On 6/2/2016 1:27 PM, Andrei Alexandrescu wrote:
> >> The lambda returns bool. -- Andrei
> >
> > Yes, I was wrong about that. But the point still stands with:
> >  > * s.balancedParens('〈', '〉') works only with autodecoding.
> >  > * s.canFind('ö') works only with autodecoding. It returns always
> >
> > false without.
> >
> > Can be made to work without autodecoding.
>
> By special casing? Perhaps. I seem to recall though that one major issue
> with autodecoding was that it special-cases certain algorithms. So you'd
> need to go through all of std.algorithm and make sure you can
> special-case your way out of situations that work today.

Yeah, I believe that you do have to do some special casing, though it would
be special casing on ranges of code units in general and not strings
specifically, and a lot of those functions are already special cased on
string in an attempt be efficient. In particular, with a function like find
or canFind, you'd take the needle and encode it to match the haystack it was
passed so that you can do the comparisons via code units. So, you incur the
encoding cost once when encoding the needle rather than incurring the
decoding cost of each code point or grapheme as you iterate over the
haystack. So, you end up with something that's correct and efficient. It's
also much friendlier to code that only operates on ASCII.

The one issue that I'm not quite sure how we'd handle in that case is
normalization (which auto-decoding doesn't handle either), since you'd need
to normalize the needle to match the haystack (which also assumes that the
haystack was already normalized). Certainly, it's the sort of thing that
makes it so that you kind of wish you were dealing with a string type that
had the normalization built into it rather than either an array of code
units or an arbitrary range of code units. But maybe we could assume the NFC
normalization like std.uni.normalize does and provide an optional template
argument for the normalization scheme.

In any case, while it's not entirely straightforward, it is quite possible
to write some algorithms in a way which works on arbitrary ranges of code
units and deals with Unicode correctly without auto-decoding or requiring
that the user convert it to a range of code points or graphemes in order to
properly handle the full range of Unicode. And even if we keep
auto-decoding, we pretty much need to fix it so that std.algorithm and
friends are Unicode-aware in this manner so that ranges of code units work
in general without requiring that you use byGrapheme. So, this sort of thing
could have a large impact on RCStr, even if we keep auto-decoding for narrow
strings.

Other algorithms, however, can't be made to work automatically with Unicode
- at least not with the current range paradigm. filter, for instance, really
needs to operate on graphemes to filter on characters, but with a range of
code units, that would mean operating on groups of code units as a single
element, which you can't do with something like a range of char, since that
essentially becomes a range of ranges. It has to be wrapped in a range
that's going to provide graphemes - and of course, if you know that you're
operating only on ASCII, then you wouldn't want to deal with graphemes
anyway, so automatically converting to graphemes would be undesirable. So,
for a function like filter, it really does have to be up to the programmer
to indicate what level of Unicode they want to operate at.

But if we don't make functions Unicode-aware where possible, then we're
going to take a performance hit by essentially forcing everyone to use
explicit ranges of code points or graphemes even when they should be
unnecessary. So, I think that we're stuck with some level of special casing,
but it would then be for ranges of code units and code points and not
strings. So, it would work efficiently for stuff like RCStr, which the
current scheme does not.

I think that the reality of the matter is that regardless of whether we keep
auto-decoding for narrow strings in place, we need to make Phobos operate on
arbitrary ranges of code units and code points, since even stuff like RCStr
won't work efficiently otherwise, and stuff like byCodeUnit won't be usuable
in as many cases otherwise, because if a generic function isn't
Unicode-aware, then in many cases, byCodeUnit will be very wrong, just like
byCodePoint would be wrong. So, as far as Phobos goes, I'm not sure that the
question of auto-decoding matters much for what we need to do at this point.
If we do what we need to do, then Phobos will work whether we have
auto-decoding or not (working in a Unicode-aware manner where possible and
forcing the user to decide the correct level of Unicode to work at where
not), and then it just becomes a question of whether we can or should
deprecate auto-decoding once all that's done.

- Jonathan M Davis