ranges of characters and the overabundance of traits that go with them

Tue Mar 14 19:21:05 PDT 2017

On Tuesday, March 14, 2017 11:38:21 H. S. Teoh via Digitalmars-d wrote:
> On Mon, Mar 13, 2017 at 04:00:57PM -0700, Jonathan M Davis via 
Digitalmars-d wrote:
> > On Monday, March 13, 2017 23:40:55 Dmitry Olshansky via Digitalmars-d 
wrote:
> [...]
>
> > > This is IMHO the right way forward. We (Phobos maintainers) created
> > > the mess, now it's time to cleanup but not at the expense of the
> > > user.  All the isSomeString, isSomeOtherString can just be a
> > > reminiscent of the old days that is internal to the implementation.
> >
> > That makes sense for the APIs of public-facing functions. However, at
> > least some portion of programmers who are not writing Phobos functions
> > are going to need to use all of those same traits to write their own
> > code.
>
> In which case even the more we need to clean up our act.  I would dread
> user code starting to use isSomeString, isImplicitlyConvertibleToString,
> and all of those messy same-but-different templates, introducing subtle
> corner cases that the user may not be fully aware of.
>
> > In most cases, we can clean up the APIs so that they hide the
> > specializations, but anyone who needs to write specializations in
> > their own code will need the same traits that Phobos uses. So, we can
> > reduce the complications and negative impact that come from having to
> > deal with all of these traits in generic code, but we can't actually
> > eliminate it. Best case, the folks that aren't going to bother writing
> > any generic code of their own don't have to worry about it, but anyone
> > writing their own generic code - especially if it involves ranges of
> > characters - is still stuck with the problem in their own code.
>
> To be quite honest, in *my* own code I avoid Phobos templates of the
> isSomeString category like the plague.  Part of this whole mess came
> from autodecoding (and here we again see it rear its ugly head), without
> which many of these templates wouldn't need to exist in the first place.
> In user code, I'd say either work directly with char[] and its variants,
> or byGrapheme if you care about visual "characters", and avoid wstring
> and dstring like the plague.  At the most I'd just say:
>
>   auto myFunc(R)(R range)
>       if (isInputRange!R && is(ElementType!R : dchar))
>   ...
>
> and let it be.  This will handle any string, autodecoding or not,
> user-defined char/wchar/dchar ranges, and whatever else you throw at it.
> If it's truly generic code, it really shouldn't care what it is.  And on
> this point, I find Phobos code excessively convoluted -- arguably
> necessitated by the historical blunder of autodecoding -- supposedly
> generic code really *shouldn't* distinguish between a range of char that
> happens to be an array vs. a range of char that isn't.
>
> Actually, come to think of it, the only time isNarrowString, et al are
> *necessary* is in the autodecoding parts of Phobos.  I can't think of
> any other places where generic code should even care whether something
> is a narrow string, or an array-based char range or a non-array-based
> char range.  Didn't we agree to move Phobos away from autodecoding, and
> only those parts that are truly necessary for backward compatibility
> should retain the autodecoding code?  If so, I'd hardly say user code
> should even *know* about isNarrowString and friends.

The reality of the matter is that auto-decoding is here to stay, because we
simply do not have a way to remove it without breaking a large percentage of
the D programs currently in existence. We need to ensure that Phobos work
well with arbitrary ranges of characters rather than assuming that they're
going to be dealing with strings or assuming that any range of characters is
going to be a range of dchar. But we're still stuck with the fact that
narrow strings are auto-decoded by the range primitives.

In some cases, making a function work with arbitrary ranges of characters
means not caring about narrow strings, because it's going to have to decode
them anyway, but if the function doesn't need to do any decoding, and we
don't specialize it on narrow strings, then it's going to incur a
performance hit with strings, and string processing is a _very_ common
programming task. Having byCodeUnit helps, but that results in a new type,
which can be a problem, and it's a given that many folks are simply going to
be using strings without something like byCodeUnit or byGrapheme. So, if we
really care about efficiency (and we should with the standard library), then
we're often going to have to specialize on narrow strings because of that.

It's also a concern, because not specializing on narrow strings will
sometimes result in having a wrapper range rather than a slice of the
original string, which can make incur a performance hit, depending on what
you do with the result (e.g. if you need a string, and you get wrapper
range, you're forced to copy it into a string, and if specilializng on
narrow strings avoided that, then _not_ doing that specialization is causing
a performance hit for code using the function). So, ignoring narrow strings
is going to hurt us.

And yes, this is less of a concern for many D programs than it is for
Phobos, because if you're just writing something to get something specific
done and not writing a library for others to use, you can take shortcuts -
either by not making the code particularly generic or by not caring about
some of the performance issues unless they end up being a large enough
performance hit for your particular program that you can't ignore them.
So, to some extent, D programs can just ignore narrow strings, and Phobos at
least will be efficient with them. But some programs _will_ care, and anyone
writing a library to distribute for others to use (e.g. on code.dlang.org)
is in exactly the same positition as Phobos. If they don't take narrow
strings into account, then there's going to be a performance hit, and it
will affect other people's code.

By doing what we can to maximize how much the Phobos APIs work with
arbitrary ranges of characters and minimizing how much auto-decoding is
involved, we can reduce how much your average D program has to deal with
these problems - and much of this can be hidden inside of the implementation
with static if rather than put in the top-level constraints - but
ultimately, third party code has all of the same problems as Phobos does if
it cares about being generic or efficient. And we can't fix that or hide it.

> > So, we really can't make the various traits internal - at least not
> > without basically saying that everyone needs to rewrite them in their
> > own code, which is just going to create a different set of problems.
>
> [...]
>
> On the contrary, we need to make the *current* traits internal.
> Providing users with similar traits -- a much cleaner, orthogonal set of
> traits that has none of the current mess -- is a different question. In
> principle I agree with doing that, but I don't agree that users should
> be using any of the current almost-the-same-but-subtly-different traits.

I definitely think that isAutodecodableString needs to go, and if we can,
I'd like to see isConvertibleToString go - though without something like it,
you can't deprecate implicit conversions when templatizing a function that
took string like Jack Stouffer wants to do. Still, that only works in cases
where none of the original string was returned from the function, since
AFAIK, the only way to templatize the function safely otherwise is to
templatize on character type (in which case, deprecating it would mean
deprecating the function for strings as well, which obviously is
unacceptable). Overall though, I tend to favor just leaving the implicit
conversions in place for those functions (though not adding them for new
ones) and then just having the overload that templatizes on character type.
And if we do that, we can get rid of isConvertibleToString.

As for isSomeString, the only problem I see with it is that it accepts
enums. Otherwise, it seems like a perfectly sensible trait to have, and its
name fits in with other traits in std.traits such that I don't know what
we'd call it instead other than isExactSomeString (which std.conv does
internally, but it's not exactly an ideal name). Regardless, I completely
disagree about trying to get rid of it or hiding it. If it doesn't need to
be in a template constraint, then it shouldn't be, but it's still a useful
trait to have.

And as for isNarrowString, it's going to continue to be critical for any
code that wants to efficiently deal with strings. Unless we can actually get
rid of auto-decoding (and I don't see how without major breakage), as much
as it sucks, we're forever stuck dealing with it. Some code can just operate
on ranges of characters without caring about the efficiency cost - or even
operate on strings without caring about the efficiency cost - but there _is_
an efficiency cost as well as problems related to the type changing, because
you can't slice a narrow string in range-based code without specializing on
it. So, it really doesn't make sense to ignore narrow strings or to try and
get rid of or hide isNarrowString.

As such, beyond getting rid of isAutodecodableString or
isConvertibleToString, I really don't see anything that makes sense to
change with the string-related traits. It would be nice to fix isSomeString
so that it doesn't accept enums, but that would break code, and it's been
around for so long that adding a trait that didn't have that problem would
simply mean having yet another trait (which arguably makes the problem
worse). And isSomeString is just fine as it is in any code that doesn't
accept enums anyway.

So, I'm in favor of getting rid of isAutodecodableString and
isConvertibleToString, but I don't agree at all with the idea that we're
going to be able to act like narrow strings and auto-decoding are just a
Phobos problem or that isSomeString or isNarrowString is something that
should only be used in Phobos.

Now, if someone by some miracle can come up with a way to remove
auto-decoding without breaking tons of code, then I am all for it. And at
that point, isNarrowString becomes unnecessary, and we can have much cleaner
range-based code. But I don't see how that's possible. At best, we can
reduce the impact that that mistake has. Fixing it just seems impossible,
and IMHO, hiding the helper stuff that allows code to be efficient in spite
of it is a mistake, much as the necessity of that helper stuff really sucks.

- Jonathan M Davis