Regex DNA benchmark

Sat Dec 1 04:19:06 PST 2007

(The following notes are relative to DMD V.1.024).
The Shooutout benchmarks are very interesting, they offer many ways to learn about the languages, how to write fast code, ecc. One of the many good sides of those benchmarks is that they compare different languages, so you can quickly see when/where your language is "abnormally" slow, helping you to improve that too.

This time I have tried the Regex DNA benchmark, it reads a simulated FASTA file, and processes it with REs. Despite being a toy, this program is related to real code bioinformatics do often. On the Shootout site there are two D versions, one of them uses std.regexp, the other PCRE. Even if the PCRE-based test is slower than the Python/Perl versions (but they use the trick of splitting the RE, so they can't be fully compared), it shows D will need a much much faster std.regexp (10 seconds with PCRE, versus 117 seconds with std.regexp).

I have tried to improve the std.regexp-based version, because I like D for allowing me to do medium-level programming. To create the D version I have followed the Python version: when the std library of D is well done, then the D sourcecode for this benchmark can be as short as the Python one. This gives what the D blurb offers: >D is a systems programming language. Its focus is on combining the power and high performance of C and C++ with the programmer productivity of modern languages like Ruby and Python.<

This is the D code (only 2-space indent), I have left some comments that show some D problems:

import std.stdio, std.string, std.cstream;
import std.regexp: RegExp, search, resplit = split;

void main() {
  string[] sseq;
  int n;
  char[1 << 15] cbuf;

  // auto seq = din.toString(); // SLOW
  while ((n = din.readBlock(cbuf.ptr, cbuf.length)) > 0)
    // sseq ~= cbuf[0 .. n][]; // slow
    sseq ~= cbuf[0 .. n].dup;
  auto seq = sseq.join("");
  auto ilen = seq.length;

  //seq = sub(seq, ">.*\n|\n", "", "g"); // SLOW!!
  seq = resplit(seq, ">.*\n|\n").join("");
  int clen = seq.length;

  foreach(p; split("agggtaaa|tttaccct
                    [cgt]gggtaaa|tttaccc[acg]
                    a[act]ggtaaa|tttacc[agt]t
                    ag[act]gtaaa|tttac[agt]ct
                    agg[act]taaa|ttta[agt]cct
                    aggg[acg]aaa|ttt[cgt]ccct
                    agggt[cgt]aa|tt[acg]accct
                    agggta[cgt]a|t[acg]taccct
                    agggtaa[cgt]|[acg]ttaccct")) {
      int m = 0;
      foreach(_; RegExp(p).search(seq))
        m++;
      writefln(p, ' ', m);
  }

  foreach(el; split("B(c|g|t) D(a|g|t) H(a|c|t) K(g|t) M(a|c)
                     N(a|c|g|t) R(a|g) S(c|g) V(a|c|g) W(a|t) Y(c|t)"))
    // seq = RegExp(el[0..1], "g").replace(seq, el[1..$]); // Slow
    seq = (new RegExp(el[0..1], "g")).replace(seq, el[1..$]);

  writefln("\n", ilen, "\n", clen, "\n", seq.length);
}

Note: if you don't belive my tests and my words you can try replacing the fast things with the slow commented-out ones. (The input data for this program is a FASTA file, you can generate it with the fasta benchmark on the Shootout site itself).

The din.toString() is very slow; in the code I have replaced it with that readBlock+join. If Walter wants to make din.toString() faster he may use something like:

string readAllStdin() {
  string[] seq;
  size_t len, n, len_tot;
  char[1 << 15] cbuf;

  std.gc.disable; // can be nested
  seq.length = 1; // Initial guess
  while ((n = din.readBlock(cbuf.ptr, cbuf.length)) > 0) {
    if (len == seq.length)
      seq.length = seq.length * 2;
    seq[len] = cbuf[0 .. n].dup;
    len++;
    len_tot += n;
  }

  // Join seq
  string result;
  if (len_tot) {
    size_t pos;
    result.length = len_tot;

    foreach (word; seq) {
      result[pos .. pos + word.length] = word;
      pos += word.length;
    }
  }
  std.gc.enable;
  return result;
}

somestring.join("") when the joining sting is empty can become something faster like:

TyElem[] join(TyElem)(TyElem[][] words) {
  TyElem[] result;

  if (words.length) {
    size_t len, pos;
    foreach(word; words)
      len += word.length;

    result.length = len;

    foreach (word; words) {
      result[pos .. pos + word.length] = word;
      pos += word.length;
    }
  }
  return result;
}

>From my tests it seems that:
sseq ~= cbuf[0 .. n][];
is slower than:
sseq ~= cbuf[0 .. n].dup;

This is probably a bug of std.regexp, the sub() seems O(n^2):
seq = sub(seq, ">.*\n|\n", "", "g");
So I have used:
seq = resplit(seq, ">.*\n|\n").join("");

I have also used:
seq = (new RegExp(el[0..1], "g")).replace(seq, el[1..$]);
because this seems a bit slower:
// seq = RegExp(el[0..1], "g").replace(seq, el[1..$]);

This is a Python version (as the fast Perl version it splits some REs to speed up, I haven't done it in the D code because it slows down the D code):

import re, sys

seq = sys.stdin.read()
ilen = len(seq)

seq = re.sub('>.*', '', seq)
seq = re.sub('\n', '', seq)
clen = len(seq)

for f in ('agggtaaa|tttaccct',
          '[cgt]gggtaaa|tttaccc[acg]',
          'a[act]ggtaaa|tttacc[agt]t',
          'ag[act]gtaaa|tttac[agt]ct',
          'agg[act]taaa|ttta[agt]cct',
          'aggg[acg]aaa|ttt[cgt]ccct',
          'agggt[cgt]aa|tt[acg]accct',
          'agggta[cgt]a|t[acg]taccct',
          'agggtaa[cgt]|[acg]ttaccct'):
    s, r = f.split('|')
    print f, len(re.findall(s, seq) + re.findall(r, seq))

sbs = "B(c|g|t) D(a|g|t) H(a|c|t) K(g|t) M(a|c) N(a|c|g|t) R(a|g) S(c|g) V(a|c|g) W(a|t) Y(c|t)"

for el in sbs.split():
    seq = re.sub(el[0], el[1:], seq)

print "\n%d\n%d\n%d" % (ilen, clen, len(seq))

The findall() is quite useful, and other tidbits show that the API of std.regexp (despite being good already) may be improved still a bit.

Bye,
bearophile