Chunks, stream fusion, sortedness invariance
bearophile
bearophileHUGS at lycos.com
Sun Mar 10 08:33:29 PDT 2013
One of the best ways to see how the development of Phobos is
going is to actually use it in some program.
There are many ways to write a program that generates text with a
Markov chain. After trying various alternative implementations in
D, I have produced this short version:
import std.stdio, std.file, std.array, std.string, std.range,
std.algorithm,
std.typecons, std.random;
void main() {
enum K = 7; // K is order + 1.
enum nGenerate = 1_000;
const text = "moby_dick.txt"
.readText
.toLower
.replace("\n", " ")
.removechars("^a-z ")
.squeeze(" ");
auto pieces = iota(text.length - K + 1)
.map!(i => text[i .. i + K])
.array
.sort
.group
.array
.assumeSorted!((a, b) => a[0][0..K-1] <
b[0][0..K-1]);
string precedent = pieces[pieces.map!q{a[1]}.dice][0];
std.stdio.write(precedent);
foreach (_; 0 .. nGenerate) {
alias P = typeof(pieces[0]);
auto selected = pieces.equalRange(P(precedent[1 .. $] ~ '
', 0));
precedent = selected[selected.map!q{a[1]}.dice][0];
std.stdio.write(precedent[$ - 1]);
}
}
It processes about 1.2 MB of text and generates 1000 random chars
in about 2 seconds. It's not the fastest program, but for my
purposes it's fast enough. And it's short and not hard to read.
The first part loads the text, and filters it, producing a clean
string.
The second part generates the overlapping string slices, sorts
and groups them, finding their frequency. And then assumes the
groups as sorted.
The assumeSorted is fed with a lambda that is not strictly
necessary, but it should speed up the program a little, because
the second field of the tuples is ignored.
Then we create the seed of the Markov chain, considering just the
frequencies of all slices. We use dice for a biased extraction.
Then the loop uses the precedent slice to extract the correctly
biased successive slice.
equalRange() performs a binary search on the pairs, and then a
gallop to find the second end of the interval of the slices that
will be used by the biased dice. This is very convenient and
probably fast. I don't know of other languages (maybe beside C++)
with a so nice equalRange().
This program is nice (and it shows that Phobos has come a long
way compared to one or two years ago), but there are several
things that could and should be improved, listed below.
- - - - - - - - - - - - - - -
The program uses this to generate the chunks:
auto pieces = iota(text.length - K + 1)
.map!(i => text[i .. i + K])
In Phobos there is a function that gives the chunks:
http://dlang.org/phobos/std_range.html#.chunks
Chunks!(Source) chunks(Source)(Source source, size_t chunkSize);
But unfortunately it can't be used here because it lacks a very
useful nOverlap optional field that tells it how much overlap the
slices should have:
Chunks!(Source) chunks(Source)(Source source,
in size_t chunkSize,
in size_t nOverlap = 0);
With that extra argument the code becomes as it should be:
auto pieces = text
.chunks(K, K - 1)
The relative request from 2011:
http://d.puremagic.com/issues/show_bug.cgi?id=6621
- - - - - - - - - - - - - - -
This shows just the first part of the program, with the text
cleaning:
import std.file: readText;
import std.string: toLower, replace, removechars, squeeze;
void main() {
const clean = "text.txt"
.readText
.toLower
.replace("\n", " ")
.removechars("^a-z ")
.squeeze(" ");
// import std.stdio; writeln(clean);
}
It's not a fast part of the program. In D there are many ways to
write it faster. One simple way is to use cast(read()) instead
readText(), because this program assumes the input to be ASCII
any way. Another way to speed up the code a little is to use
inPlaceToLower. Such two ideas halve the creation time of the
string 'clean'.
But D is a multi-level language, so you can go down to implement
a finite state machine that uses the program counter and goto to
jump between about 13 states (and I am not even talking about
inline asm). If you don't want to go that low, using a simple
switch is enough to create the string 'clean' about 10 or 12
times faster than the original program:
import std.file: read;
import std.exception: assumeUnique;
void main() {
auto txt = cast(char[])read("moby_dick.txt");
int precedent = -1;
size_t pos = 0;
for (size_t i = 0; i < txt.length; i++) {
immutable c = txt[i];
switch (c) {
case 'a': .. case 'z':
txt[pos] = c;
pos++;
precedent = c;
break;
case 'A': .. case 'Z':
txt[pos] = cast(char)(c + (cast(char)'a' - 'A'));
precedent = txt[pos];
pos++;
break;
case ' ':
if (precedent != ' ') {
txt[pos] = c;
pos++;
}
precedent = c;
break;
case '\n', '\r':
if (precedent != ' ') {
txt[pos] = ' ';
pos++;
}
precedent = ' ';
break;
default:
}
}
const clean = assumeUnique(txt[0 .. pos]);
// import std.stdio; writeln(clean);
}
This kind of code is fast, but it's long, and to be written it
requires a bit of thinking. The original code with the chain of
string functions is easy to write (if you don't miss a processing
step!), but it's kind of slow. A modern language should offer a
third way: simple enough code that is fast enough (even if it's
not as fast as the low-level handwritten code).
The main problem of code like this is that it creates a new large
string at each step:
"text.txt".readText.toLower.replace("\n", " ")
.removechars("^a-z ").squeeze(" ")
Lazy functional languages like Haskell show that there is a
better way. In Phobos there are many lazy functions, but if you
want to process a string, you have to go back to mostly eager
functions. So I think there's a need for more lazy string
processing functions.
But I think adding more lazy string processing functions to
Phobos is not enough. If you write that code in D with lazy
functions you risk having a program that's slower than the
current string processing code. So I think to face this program D
front end needs some deforestation (also named stream fusion,
www.google.com/search?q=haskell+deforestation ) logic that works
on pure lazy ranges. (Deforestation is possible even when there
is no laziness, like in the original code. But purity is needed
to simplify the analysis a lot).
- - - - - - - - - - - - - - -
Now let's take another look at the second part of the program:
auto pieces = iota(text.length - K + 1)
.map!(i => text[i .. i + K])
.array
.sort
.group
.array
.assumeSorted!((a, b) => a[0][0..K-1] <
b[0][0..K-1]);
The missing argument for chunks is not the only visible problem.
We sort, group, convert to an array, and then we have to state
again that the array of (slice,frequency) pairs is sorted again.
But both group and array don't change the order of items, so the
result of the array() is already sorted.
So in theory this should be enough for the successive binary
searches of equalRange():
auto pieces = iota(text.length - K + 1)
.map!(i => text[i .. i + K])
.array
.sort
.group
.array;
I have raised this topic already few days ago, "group sortedness
invariance":
http://forum.dlang.org/thread/zmiqbifxevljazceowif@forum.dlang.org
I think group(SortedRange) should give a lazy (input range)
SortedRange of pairs. This will be possible after this
enhancement, that will allow SortedRange to be an input range too:
http://d.puremagic.com/issues/show_bug.cgi?id=9616
But that's not enough, because array() is going to convert the
lazy SortedRange into a regular array, forgetting that it's
sorted.
One simple solution is to introduce a function like sortedArray()
that converts a lazy range into a random-access eager range that
is also a SortedRange. But maybe this is a bit too much brute
solution. Maybe there are more general solutions. Ideas welcome.
Bye,
bearophile
More information about the Digitalmars-d
mailing list