Speeding up text file parser (BLAST tabular format)

Mon Sep 14 11:27:50 PDT 2015

I decided to give the code a spin with `gdc -O3 -pg`. Turns out that the
hotspot is in std.array.split, contrary to expectations. :-)  Here are
the first few lines of the gprof output:

-----snip-----
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 19.77      0.43     0.43 76368364     0.00     0.00  nothrow @safe int std.array.split!(char[]).split(char[]).__foreachbody2(ref ulong, ref dchar)
 16.74      0.79     0.36                             nothrow void gc.gc.Gcx.mark(void*, void*, int)
 10.23      1.01     0.22                             _aApplycd2
  8.14      1.18     0.18                             _d_arrayappendcTX
  4.19      1.27     0.09                             nothrow ulong gc.gc.Gcx.fullcollect()
-----snip-----

As you can see, std.array.split takes up almost 20% of the total running
time.

The surprising (or not-so-surprising?) second runner-up for hot spot is
actually the GC's marking algorithm. I'm guessing this is likely because
of the extensive use of small GC-allocated arrays (splitting into
strings, and the like).

The 3rd entry, _aApplycd2, appears to be the druntime implementation of
the foreach loop, so I'm not sure what could be done about it.

Anyway, just for kicks, I tried various ways of reducing the cost of
std.array.split, but didn't get very far.  Replacing it with
std.regex.split didn't help. Looking at its implementation, while it
does allocate a new array each time, it also slices over the input when
constructing the substrings, so it didn't seem as inefficient as I first
thought.

Next I tried disabling the GC with core.memory.GC.disable().
Immediately, I got a 20% performance boost.  Of course, running with a
disabled GC will soon eat up all memory and crash, which isn't an option
in real-world usage, so the next best solution is to manually schedule
GC collection cycles, say call GC.collect() every n iterations of the
parsing loop, for some value of n.

I tried implementing a crude version of this (see code below), and found
that manually calling GC.collect() even as frequently as once every 5000
loop iterations (for a 500,000 line test input file) still gives about
15% performance improvement over completely disabling the GC.  Since
most of the arrays involved here are pretty small, the frequency could
be reduced to once every 50,000 iterations and you'd pretty much get the
20% performance boost for free, and still not run out of memory too
quickly.

Note that all measurements were done with `gdc -O3`.  I did try the
original code with `dmd -O`, and found that it was 20% slower than the
gdc version, so I didn't look further.

Anyway, here's the code with the manual GC collection schedule (I
modified main() slightly so that I could easily test the code with
different input files, so you have to specify the input filename as an
argument to the program when running it):

-------snip-------
#!/usr/bin/env rdmd
// Programmed in the D language
// Fredrik Boulund 2015-09-09
// Modified by H. S. Teoh 2015-09-14

import core.memory; // for GC control
import std.stdio;
import std.array;
import std.conv;
import std.algorithm;

struct Hit 
{
    string target;
    float pid;
    int matches, mismatches, gaps, qstart, qstop, tstart, tstop;
    double evalue, bitscore;
}

enum manualGcCount = 5_000;
ulong ticksToCollect = manualGcCount;

void gcTick()
{
    if (--ticksToCollect == 0)
    {
        GC.collect();
        ticksToCollect = manualGcCount;
    }
}

void custom_parse(string filename)
{
    float min_identity = 90.0;
    int min_matches = 10;
    Hit[][string] hitlists;

    foreach (record; filename
                     .File
                     .byLine
                     .map!split
                     .filter!(l => l[2].to!float >= min_identity && 
                                   l[3].to!int >= min_matches))
    {
        hitlists[record[0].to!string] ~= Hit(record[1].to!string,
                                             record[2].to!float,
                                             record[3].to!int,
                                             record[4].to!int,
                                             record[5].to!int,
                                             record[6].to!int,
                                             record[7].to!int,
                                             record[8].to!int,
                                             record[9].to!int,
                                             record[10].to!double,
                                             record[11].to!double);
        gcTick();
    }

    foreach (query; hitlists.byKey)
    {
        float max_pid = reduce!((a,b) => max(a, b.pid))(-double.max, hitlists[query]);
        float max_pid_diff = 5.00;
        hitlists[query] = hitlists[query].filter!(h => h.pid >= (max_pid - max_pid_diff)).array();
        writeln(query, ": ", hitlists[query].length);

        gcTick();
    }

}

void main(string[] args)
{
    auto filename = args[1];
    //auto filename = "blat.blast8";
    //auto filename = "./QE_150828_46.blast8";

    GC.disable();
    custom_parse(filename);
}
-------snip-------

T

-- 
Stop staring at me like that! It's offens... no, you'll hurt your eyes!