Chunks, stream fusion, sortedness invariance

Sun Mar 10 08:33:29 PDT 2013

One of the best ways to see how the development of Phobos is 
going is to actually use it in some program.

There are many ways to write a program that generates text with a 
Markov chain. After trying various alternative implementations in 
D, I have produced this short version:

import std.stdio, std.file, std.array, std.string, std.range, 
std.algorithm,
        std.typecons, std.random;

void main() {
     enum K = 7; // K is order + 1.
     enum nGenerate = 1_000;

     const text = "moby_dick.txt"
                  .readText
                  .toLower
                  .replace("\n", " ")
                  .removechars("^a-z ")
                  .squeeze(" ");

     auto pieces = iota(text.length - K + 1)
                   .map!(i => text[i .. i + K])
                   .array
                   .sort
                   .group
                   .array
                   .assumeSorted!((a, b) => a[0][0..K-1] < 
b[0][0..K-1]);

     string precedent = pieces[pieces.map!q{a[1]}.dice][0];
     std.stdio.write(precedent);

     foreach (_; 0 .. nGenerate) {
         alias P = typeof(pieces[0]);
         auto selected = pieces.equalRange(P(precedent[1 .. $] ~ ' 
', 0));
         precedent = selected[selected.map!q{a[1]}.dice][0];
         std.stdio.write(precedent[$ - 1]);
     }
}

It processes about 1.2 MB of text and generates 1000 random chars 
in about 2 seconds. It's not the fastest program, but for my 
purposes it's fast enough. And it's short and not hard to read.

The first part loads the text, and filters it, producing a clean 
string.

The second part generates the overlapping string slices, sorts 
and groups them, finding their frequency. And then assumes the 
groups as sorted.

The assumeSorted is fed with a lambda that is not strictly 
necessary, but it should speed up the program a little, because 
the second field of the tuples is ignored.

Then we create the seed of the Markov chain, considering just the 
frequencies of all slices. We use dice for a biased extraction.

Then the loop uses the precedent slice to extract the correctly 
biased successive slice.

equalRange() performs a binary search on the pairs, and then a 
gallop to find the second end of the interval of the slices that 
will be used by the biased dice. This is very convenient and 
probably fast. I don't know of other languages (maybe beside C++) 
with a so nice equalRange().

This program is nice (and it shows that Phobos has come a long 
way compared to one or two years ago), but there are several 
things that could and should be improved, listed below.

- - - - - - - - - - - - - - -

The program uses this to generate the chunks:

auto pieces = iota(text.length - K + 1)
               .map!(i => text[i .. i + K])

In Phobos there is a function that gives the chunks:
http://dlang.org/phobos/std_range.html#.chunks

Chunks!(Source) chunks(Source)(Source source, size_t chunkSize);

But unfortunately it can't be used here because it lacks a very 
useful nOverlap optional field that tells it how much overlap the 
slices should have:

Chunks!(Source) chunks(Source)(Source source,
                                in size_t chunkSize,
                                in size_t nOverlap = 0);

With that extra argument the code becomes as it should be:

auto pieces = text
               .chunks(K, K - 1)

The relative request from 2011:
http://d.puremagic.com/issues/show_bug.cgi?id=6621

- - - - - - - - - - - - - - -

This shows just the first part of the program, with the text 
cleaning:

import std.file: readText;
import std.string: toLower, replace, removechars, squeeze;

void main() {
     const clean = "text.txt"
                   .readText
                   .toLower
                   .replace("\n", " ")
                   .removechars("^a-z ")
                   .squeeze(" ");

     // import std.stdio; writeln(clean);
}

It's not a fast part of the program. In D there are many ways to 
write it faster. One simple way is to use cast(read()) instead 
readText(), because this program assumes the input to be ASCII 
any way. Another way to speed up the code a little is to use 
inPlaceToLower. Such two ideas halve the creation time of the 
string 'clean'.

But D is a multi-level language, so you can go down to implement 
a finite state machine that uses the program counter and goto to 
jump between about 13 states (and I am not even talking about 
inline asm). If you don't want to go that low, using a simple 
switch is enough to create the string 'clean' about 10 or 12 
times faster than the original program:

import std.file: read;
import std.exception: assumeUnique;

void main() {
     auto txt = cast(char[])read("moby_dick.txt");

     int precedent = -1;
     size_t pos = 0;

     for (size_t i = 0; i < txt.length; i++) {
         immutable c = txt[i];

         switch (c) {
             case 'a': .. case 'z':
                 txt[pos] = c;
                 pos++;
                 precedent = c;
                 break;
             case 'A': .. case 'Z':
                 txt[pos] = cast(char)(c + (cast(char)'a' - 'A'));
                 precedent = txt[pos];
                 pos++;
                 break;
             case ' ':
                 if (precedent != ' ') {
                     txt[pos] = c;
                     pos++;
                 }
                 precedent = c;
                 break;
             case '\n', '\r':
                 if (precedent != ' ') {
                     txt[pos] = ' ';
                     pos++;
                 }
                 precedent = ' ';
                 break;
             default:
         }
     }

     const clean = assumeUnique(txt[0 .. pos]);
     // import std.stdio; writeln(clean);
}

This kind of code is fast, but it's long, and to be written it 
requires a bit of thinking. The original code with the chain of 
string functions is easy to write (if you don't miss a processing 
step!), but it's kind of slow. A modern language should offer a 
third way: simple enough code that is fast enough (even if it's 
not as fast as the low-level handwritten code).

The main problem of code like this is that it creates a new large 
string at each step:

"text.txt".readText.toLower.replace("\n", " ")
.removechars("^a-z ").squeeze(" ")

Lazy functional languages like Haskell show that there is a 
better way. In Phobos there are many lazy functions, but if you 
want to process a string, you have to go back to mostly eager 
functions. So I think there's a need for more lazy string 
processing functions.

But I think adding more lazy string processing functions to 
Phobos is not enough. If you write that code in D with lazy 
functions you risk having a program that's slower than the 
current string processing code. So I think to face this program D 
front end needs some deforestation (also named stream fusion, 
www.google.com/search?q=haskell+deforestation ) logic that works 
on pure lazy ranges. (Deforestation is possible even when there 
is no laziness, like in the original code. But purity is needed 
to simplify the analysis a lot).

- - - - - - - - - - - - - - -

Now let's take another look at the second part of the program:

auto pieces = iota(text.length - K + 1)
               .map!(i => text[i .. i + K])
               .array
               .sort
               .group
               .array
               .assumeSorted!((a, b) => a[0][0..K-1] < 
b[0][0..K-1]);

The missing argument for chunks is not the only visible problem.

We sort, group, convert to an array, and then we have to state 
again that the array of (slice,frequency) pairs is sorted again. 
But both group and array don't change the order of items, so the 
result of the array() is already sorted.

So in theory this should be enough for the successive binary 
searches of equalRange():

auto pieces = iota(text.length - K + 1)
               .map!(i => text[i .. i + K])
               .array
               .sort
               .group
               .array;

I have raised this topic already few days ago, "group sortedness 
invariance":

http://forum.dlang.org/thread/zmiqbifxevljazceowif@forum.dlang.org

I think group(SortedRange) should give a lazy (input range) 
SortedRange of pairs. This will be possible after this 
enhancement, that will allow SortedRange to be an input range too:

http://d.puremagic.com/issues/show_bug.cgi?id=9616

But that's not enough, because array() is going to convert the 
lazy SortedRange into a regular array, forgetting that it's 
sorted.

One simple solution is to introduce a function like sortedArray() 
that converts a lazy range into a random-access eager range that 
is also a SortedRange. But maybe this is a bit too much brute 
solution. Maybe there are more general solutions. Ideas welcome.

Bye,
bearophile