Investigation: downsides of being generic and correct

Fri May 17 02:41:21 PDT 2013

On Friday, May 17, 2013 11:15:24 Dicebot wrote:
> On Thursday, 16 May 2013 at 19:15:57 UTC, Jonathan M Davis wrote:
> > 1. In general, if you want to operate on ASCII, and you want
> > your code to be
> > fast, use immutable(ubyte)[], not immutable(char)[]. Obviously,
> > that's not
> > gonig to work in this case, because the function is in
> > std.string, but maybe
> > that's a reason for some std.string functions to have ubyte
> > overloads which
> > are ASCII-specific.
> 
> I was thinking exactly about that. Only thing I want to be
> advised on - is it better to add those overloads in std.string or
> separate module is better from the point of self-documentation?

I'm not sure. My first inclination would be to simply put them as overloads in 
the same module, but that probably merits some discussion. And while I think 
that having ubyte overloads for strings for ASCII is something that we should 
at least explore, it probably merits some discussion as well, as we haven't 
really done a lot with handling ASCII outside of std.ascii at this point 
(which currently only operates on characters, not strings). My first 
inclination is to handle ASCII where necessary by accepting arrays of ubytes, 
but others here may have other ideas about that (which may or may not be 
better).

A side note of that is that we might want to consider is having a function 
called assumeASCII which casts from string to immutable(ubyte)[] (similar to 
assumeUnique). I think that that might have been suggested before, but even if 
it has, we've never actually added it.

> > 2. We actually discussed removing all of the pattern stuff
> > completely and
> > replacing it with regexes.
> 
> Is is kind of pre-approved? I am willing to add this to my TODO
> list together with needed benchmarks, but had some doubts that
> std.string depending on std.regex will be tolerated.

AFAIK, there would be no problem with doing so. Maybe Dmitry would have 
something to say about it, since he's the regex guru, but IIRC, the last time 
it was discussed, it was pretty clear that we wanted those functions to be 
using std.regex instead of patterns. So, if you did the work and did it at the 
appropriate quality level, I expect that it would be merged in. And we might 
or might now deprecate the pattern functions at that point (that was 
originally my intention and is why I never fixed their names, but we're not 
deprecating much now, so I don't know if we'll want to in this case).

> I understand that. What I tried to bring attention to is how big
> difference it may be for someone who just picks random functions
> and writes some simple code. It is very tempting to just say
> "Phobos (D) sucks" and don't get into details. In other words I
> consider it more of informational/marketing issue than a
> technical one.

We need to do more to optimize Phobos, but given our stance of correctness by 
default, we're kind of stuck with string functions taking a performance hit in 
a number of common cases simply due to the necessary decoding of code points. 
We can do better at making them fast, and reduce problems like this, but 
ultimately, if you want fast ASCII-only operations, you almost certainly need 
to operate on something like ubyte[] rather than string, and that requires 
educating people. It's one of the costs of trying to be both correct and 
performant.

- Jonathan M Davis