ranges of characters and the overabundance of traits that go with them

Wed Mar 15 10:17:41 PDT 2017

On Tue, Mar 14, 2017 at 07:21:05PM -0700, Jonathan M Davis via Digitalmars-d wrote:
> On Tuesday, March 14, 2017 11:38:21 H. S. Teoh via Digitalmars-d wrote:
[...]
> > Actually, come to think of it, the only time isNarrowString, et al
> > are *necessary* is in the autodecoding parts of Phobos.  I can't
> > think of any other places where generic code should even care
> > whether something is a narrow string, or an array-based char range
> > or a non-array-based char range.  Didn't we agree to move Phobos
> > away from autodecoding, and only those parts that are truly
> > necessary for backward compatibility should retain the autodecoding
> > code?  If so, I'd hardly say user code should even *know* about
> > isNarrowString and friends.
> 
> The reality of the matter is that auto-decoding is here to stay,
> because we simply do not have a way to remove it without breaking a
> large percentage of the D programs currently in existence. We need to
> ensure that Phobos work well with arbitrary ranges of characters
> rather than assuming that they're going to be dealing with strings or
> assuming that any range of characters is going to be a range of dchar.
> But we're still stuck with the fact that narrow strings are
> auto-decoded by the range primitives.

None of that negates the fact that we don't need to expose traits
specific to autodecoding to the user.  Obviously the autodecoding will
still be there, but as far as the API is concerned, the user ought not
to be concerned about it: Phobos should handle it as transparently as
possible.  Internally, of course, we have to support it for the sake of
backward compatibility.  But that's no reason to expose unnecessary
details in the public API.

> In some cases, making a function work with arbitrary ranges of
> characters means not caring about narrow strings, because it's going
> to have to decode them anyway, but if the function doesn't need to do
> any decoding, and we don't specialize it on narrow strings, then it's
> going to incur a performance hit with strings, and string processing
> is a _very_ common programming task. Having byCodeUnit helps, but that
> results in a new type, which can be a problem, and it's a given that
> many folks are simply going to be using strings without something like
> byCodeUnit or byGrapheme. So, if we really care about efficiency (and
> we should with the standard library), then we're often going to have
> to specialize on narrow strings because of that.

Of course we will need to specialize on narrow strings in Phobos.  But
that does not mean we must expose all the dirty details on the public
API.  This specialization is an implementation detail that should not
affect the public API (to the extent it's possible).

> It's also a concern, because not specializing on narrow strings will
> sometimes result in having a wrapper range rather than a slice of the
> original string, which can make incur a performance hit, depending on
> what you do with the result (e.g. if you need a string, and you get
> wrapper range, you're forced to copy it into a string, and if
> specilializng on narrow strings avoided that, then _not_ doing that
> specialization is causing a performance hit for code using the
> function). So, ignoring narrow strings is going to hurt us.

I think you misunderstood me.  I didn't mean that we should delete
isNarrowString altogether -- obviously we don't want any performance
regressions, or, for that matter, semantic regressions by no longer
distinguishing between narrow strings.  What I mean is that
isNarrowString should be a *private* symbol in Phobos.  It's part of
what I call "implementation details" that the user should not need to
worry about.  Especially so in documentation, because right now the
unreadable overloads of user-facing functions is partly caused by the
need to distinguish between isNarrowString and !isNarrowString, which
doubles the number of overloads. But from the user's POV, all they
really ought to know is that function F accepts a range of some kind.
How Phobos will handle different specific range types is none of their
business -- they don't *want* to know about it, and they shouldn't need
to.

> And yes, this is less of a concern for many D programs than it is for
> Phobos, because if you're just writing something to get something
> specific done and not writing a library for others to use, you can
> take shortcuts - either by not making the code particularly generic or
> by not caring about some of the performance issues unless they end up
> being a large enough performance hit for your particular program that
> you can't ignore them.  So, to some extent, D programs can just ignore
> narrow strings, and Phobos at least will be efficient with them. But
> some programs _will_ care, and anyone writing a library to distribute
> for others to use (e.g. on code.dlang.org) is in exactly the same
> positition as Phobos. If they don't take narrow strings into account,
> then there's going to be a performance hit, and it will affect other
> people's code.

I'd argue that the same reasoning applies to all D libraries, not just
Phobos: implementation details like optimizing for narrow strings ought
to be hidden from the public API.  It's the library author's problem to
figure out how to optimize his library -- Phobos code, after all, is
publicly available for perusal and its license terms permit copying
Phobos snippets for use in their own library code, if they find that
they need, e.g., some form of isNarrowString.  But none of this should
be a factor in the design of the user-facing API.  Phobos needs to set
the example here.

And to counter the argument that library authors might want to use
Phobos' version of isNarrowString: I'd say that it's better for them to
copy the definition of isNarrowString into their own code, because the
current mess of isNarrowString, isSomeString, et al, means that there's
a high likelihood we'd want to revise their definitions, which in turn
means that if they're being used in 3rd party library code, subtle
breakages may be introduced into said libraries because the definition
has changed from when the library author originally looked at it.

That's why, IMO, traits like isNarrowString that are only useful for
library implementation details ought NOT to be exported by Phobos.
Unless we are 100% sure that it has a rock-solid definition that's
well-designed and fits well with other standard traits -- which excludes
most (all?) of the current messy string-related traits.

[...]
> > On the contrary, we need to make the *current* traits internal.
> > Providing users with similar traits -- a much cleaner, orthogonal
> > set of traits that has none of the current mess -- is a different
> > question. In principle I agree with doing that, but I don't agree
> > that users should be using any of the current
> > almost-the-same-but-subtly-different traits.
> 
> I definitely think that isAutodecodableString needs to go, and if we
> can, I'd like to see isConvertibleToString go - though without
> something like it, you can't deprecate implicit conversions when
> templatizing a function that took string like Jack Stouffer wants to
> do.
[...]

Is it possible to hide this stuff behind a "template wall"? I.e., rename
the current overload set to xxxImpl() with whatever sig constraints are
necessary to achieve what we want, and have the original xxx() be just a
generic template that forwards the said overloads?  E.g.:

Change:

	auto xxx(R)(R r) if (isConvertibleToString!R) { ... }
	auto xxx(R)(R r) if (isAutodecodableString!R) { ... }
	auto xxx(R)(R r) if (/* etc. */) { ... }

To:

	auto xxxImpl(R)(R r) if (isConvertibleToString!R) { ... }
	auto xxxImpl(R)(R r) if (isAutodecodableString!R) { ... }
	auto xxxImpl(R)(R r) if (/* etc. */) { ... }

	auto xxx(R)(R r) if (isInputRange!R) { return xxxImpl(r); }

Or does that extra level of syntactic indirection cause problems with
implicit conversions?

> As for isSomeString, the only problem I see with it is that it accepts
> enums. Otherwise, it seems like a perfectly sensible trait to have,
> and its name fits in with other traits in std.traits such that I don't
> know what we'd call it instead other than isExactSomeString (which
> std.conv does internally, but it's not exactly an ideal name).
> Regardless, I completely disagree about trying to get rid of it or
> hiding it. If it doesn't need to be in a template constraint, then it
> shouldn't be, but it's still a useful trait to have.

I don't like the name, because it's unclear what exactly it means and
what exactly it includes. The fact that it accepts enums where other
traits don't is a further code smell IMO.

I think what we really need to do is to sit down and work out a
well-designed, orthogonal set of traits that might be useful for user
code, independently of what the current traits are, and then (and only
then) worry about how to map the current mess to them.

If I were to look at the current traits from a user's POV, I'd be going,
"What, why does Phobos have isNarrowString, isSomeString,
isAutodecodableString...  How many different kinds of strings does D
have?! and how exactly do any of these traits map to string, wstring,
dstring? WTH does "Some" in "isSomeString" refer to? Is it different
from "isAnyString" and "isString"? What kind of crappy ad hoc trait
design is this anyway?!" Not the kind of reaction we'd want newcomers to
have with code that's supposed to set the standard for what D code
should look like.

> And as for isNarrowString, it's going to continue to be critical for
> any code that wants to efficiently deal with strings. Unless we can
> actually get rid of auto-decoding (and I don't see how without major
> breakage), as much as it sucks, we're forever stuck dealing with it.
> Some code can just operate on ranges of characters without caring
> about the efficiency cost - or even operate on strings without caring
> about the efficiency cost - but there _is_ an efficiency cost as well
> as problems related to the type changing, because you can't slice a
> narrow string in range-based code without specializing on it. So, it
> really doesn't make sense to ignore narrow strings or to try and get
> rid of or hide isNarrowString.

We can't get rid of isNarrowString, but I argue that we should "hide"
it. Or, at the very least, it should not be used in any user-facing
Phobos API (though it can, and currently needs to, be used internally
for all the reasons you stated).

[...]
> So, I'm in favor of getting rid of isAutodecodableString and
> isConvertibleToString, but I don't agree at all with the idea that
> we're going to be able to act like narrow strings and auto-decoding
> are just a Phobos problem or that isSomeString or isNarrowString is
> something that should only be used in Phobos.

The first step towards removing autodecoding (in the distant future) is
to reduce dependence of user code on its details.  I argue that
isNarrowString ought to be used only within Phobos itself, and should
not be exposed to user code.  If user code is concerned about
performance, there's already byCodeUnit to bypass autodecoding, and
byCodePoint when the user wants to ensure he's dealing with dchar.

> Now, if someone by some miracle can come up with a way to remove
> auto-decoding without breaking tons of code, then I am all for it. And
> at that point, isNarrowString becomes unnecessary, and we can have
> much cleaner range-based code. But I don't see how that's possible. At
> best, we can reduce the impact that that mistake has. Fixing it just
> seems impossible, and IMHO, hiding the helper stuff that allows code
> to be efficient in spite of it is a mistake, much as the necessity of
> that helper stuff really sucks.
[...]

It's clear that autodecoding can't be *removed* outright, at least not
at this point. But we *can* reduce dependence on it by making it a
Phobos internal implementation detail rather than exposing it everywhere
on the public API with traits like isNarrowString.

Let's not conflate cleaning up Phobos' public API with removing Phobos
internal code or changing Phobos semantics (e.g., the "helper stuff").
The two ought to be more-or-less independent, and where they are not,
it's a sign of weakness in the design of the API, and is where we need
to change things to improve the situation.

T

-- 
Unix was not designed to stop people from doing stupid things, because that would also stop them from doing clever things. -- Doug Gwyn