algorithms that take ranges by reference

Wed Dec 30 11:53:33 PST 2009

The grand STL tradition is to _always_ take iterators by value. If 
iterators are computed, they are returned by the algorithm.

My initial approach to defining std.algorithm was to continue that 
tradition for ranges: ranges are values. No algorithm currently takes a 
range by reference. There are a couple of simple functions that 
emphatically do take ref, namely popFrontN and popBackN in std.range.

It is becoming, however, increasingly clear that there are algorithms 
that could and should manipulate ranges by reference. They might take 
and return values, but it's just too messy to do so. (Cue music for the 
"improve built-in tuples choir.)

A concrete case is text processing. Many contemporary libraries use 
index-based processing, but that's difficult when handling multibyte 
characters. To address that, bidirectional ranges are one correct way to 
go. Phobos defines a ByCodeUnit range that spans a string correctly, one 
dchar at a time. With that, you can write:

auto fname = "some-utf8-file.html";
auto txt = byDchar(readText(fname));
// use txt.empty, txt.front, txt.back, txt.popFront, txt.popBack

The front and back methods have type dchar.

So far so good. But then I noticed that many std.algorithms make it 
difficult to play with the range comfortably. For example, say I want to 
match a tag, and special-case for a few particular tags:

while (!txt.empty) {
     auto c = txt.front;
     txt.popFront();
     if (c == '<') {
         if (startsWith(txt, "!--")) {
             // This is a comment
             popFrontN(txt, 3);
             txt = find(txt, "-->");
             popFrontN(txt, 3);
             ...
          } else if (startsWith(txt, "script")) {
          ...
          }
     }
     ...
}

This is already wasteful: startsWith scans a few characters off txt, and 
then popFrontN does it again. In this particular case it's not all that 
inefficient, but clearly the approach doesn't have elegance on its side 
either, so it's a lose-lose.

There's also the case issue with e.g. the "script" tag, but I hope the 
approach outlined in the post "one step towards unification of 
std.algorithm and std.string" will take care of that.

Looking at samples like the above, some needed algorithms look like this 
(simplified by e.g. giving up custom predicates etc.):

/**
If r1 starts with r2, advance r1 past it and return true. Otherwise, 
leave r1 unchanged and return false.
*/
bool skip(R1, R2)(ref R1 r1, R2 r2) {
     auto r = r1; // .save();
     while (!r2.empty && !r.empty && r.front == r2.front) {
         r.popFront();
         r2.popFront();
     }
     if (r2.empty) {
         r1 = r;
         return true;
     }
     return false;
}

/**
If r2 can be find in r1, advance r1 past r2 and return true. Otherwise, 
leave r1 unchanged and return false.
*/
bool findSkip(R1, R2)(ref R1 r1, R2 r2) {
     auto r = r1; // .save();
     while (!r.empty && !startsWith(r1, r2)) {
         r.popFront();
         r2.popFront();
     }
     if (r2.empty) {
         r1 = r;
         return true;
     }
     return false;
}

With such functions the code looks much cleaner:

while (!txt.empty) {
     auto c = txt.front;
     txt.popFront();
     if (c == '<') {
         if (skip(txt, "!--")) {
             // This is a comment
             enforce(findSkip(txt, "-->"));
             ...
          } else if (skip(txt, "script")) {
          ...
          }
     }
     ...
}

But they break the tradition because now an algorithm may alter or not a 
range, and client code must be aware of that - one more thing to worry 
about.

What do you think? Should we go with by-ref passing or not? Other ideas?

Andrei