Lots of low hanging fruit in Phobos

Thu Mar 6 17:37:58 PST 2014

On Thu, Mar 06, 2014 at 05:15:45PM -0800, Walter Bright wrote:
> On 3/6/2014 4:43 PM, H. S. Teoh wrote:
> >On Thu, Mar 06, 2014 at 01:26:46PM -0800, Walter Bright wrote:
> >>A major goal for D in the short term is to reduce reliance in Phobos
> >>on the GC. I was looking at std.string last night, and I noticed a
> >>couple things:
> >>
> >>1. The inputs are constrained to being strings. This is overly
> >>restrictive, the inputs should be InputRanges.
> >>
> >>2. The outputs should be a range, too. In fact, the string functions
> >>should become algorithms. Then they won't need to allocate any
> >>memory at all.
> >[...]
> >
> >What about using output ranges?
> 
> A great question. I tend to regard output ranges as suitable for a
> container to expose. An algorithm reads an InputRange, and presents
> its output as another InputRange. This is so that algorithms can be
> easily chained together by the user, and the last algorithm in the
> chain would be std.algorithm.copy(OutputRange).

After some thought, and also some recent experiences, I've come to the
same conclusion.  Output ranges are basically sinks, which means they
are no good for further chaining (attempting to do so is an exercise in
frustration). So pretty much everything that generates or transforms
data should return an input range, and output ranges should only be used
when you're going to stop processing the data at that point, e.g., save
it into a buffer, write to a file, etc..

Unfortunately, input ranges are somewhat tedious to write -- nice
foreach loops have to be broken up into .empty, .front, .popFront, which
is a lot of boilerplate code and not so nice in inner loops (though I
suppose the idea is to improve compiler inlining to handle these cases).
I wonder if a mid- to long-term goal for D would be to ease writing
input ranges by moving some of the boilerplate into the language, or
providing range builder facilities in Phobos. The most nagging part of
writing input ranges in the current language is the inability to use
foreach to generate the returned elements. (Well, you *can* do that to a
buffer array and then return the array, but that kinda defeats the
purpose of GC avoidance.) Some kind of built-in coroutine syntax with
'yield' would help things a LOT.

> For example, we could implement toStringz() as an algorithm that looks
> like:
> 
>     auto toStringz(S)(S s) {
>         return chain(s, repeat(0, 1))
>     }
> 
> and use it like this:
> 
>     string s = ...;
>     char[] buffer = ...;
>     s.toStringz.copy(buffer);
> 
> Note how the algorithm version of toStringz does not allocate any
> memory. buffer would be the output range, but note how what it
> actually is is, and how its memory is allocated, is selected by the
> user.

Yes, I like this. This is the right way to go.

[...]
> You might be tempted to think that these are just little leaks, who
> cares. But I recently wrote a D program that did tens of thousands
> of file operations, and those little leaks turned into a raging
> river.

Yeah, I'm currently working on a program that generates potentially huge
numbers of n-dimensional coordinates, and currently the code does this
using ranges that return array literals (ouch). This is pretty bad for
performance and also for GC pressure, so I'm thinking to restructure the
code some time in the near future to get rid of obligatory allocations,
much like how you describe it above.

T

-- 
Customer support: the art of getting your clients to pay for your own incompetence.