Speeding up text file parser (BLAST tabular format)

Mon Sep 14 05:30:19 PDT 2015

Hi,

This is my first post on Dlang forums and I don't have a lot of 
experience with D (yet). I mainly code bioinformatics-stuff in 
Python on my day-to-day job, but I've been toying with D for a 
couple of years now. I had this idea that it'd be fun to write a 
parser for a text-based tabular data format I tend to read a lot 
of in my programs, but I was a bit stomped that the D 
implementation I created was slower than my Python-version. I 
tried running `dmd -profile` on it but didn't really understand 
what I can do to make it go faster. I guess there's some 
unnecessary dynamic array extensions being made but I can't 
figure out how to do without them, maybe someone can help me out? 
I tried making the examples as small as possible.

Here's the code D code: http://dpaste.com/2HP0ZVA
Here's my Python code for comparison: http://dpaste.com/0MPBK67

Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU 
E5-2670 with RAID6 SAS disks and 192GB of RAM), the D version 
runs in about 20 seconds and the Python version less than 16 
seconds. I've repeated runs at least thrice when testing. This 
holds true even if the D version is compiled with -O.

The file being parsed is the output of a DNA/protein sequence 
mapping algorithm called BLAT, but the tabular output format is 
originally known from the famous BLAST algorithm.
Here's a short example of what the input files looks like: 
http://dpaste.com/017N58F
The format is TAB-delimited: query, target, percent_identity, 
alignment_length, mismatches, gaps, query_start, query_end, 
target_start, target_end, e-value, bitscore
In the example the output is sorted by query, but this cannot be 
assumed to hold true for all cases. The input file varies in 
range from several hundred megabytes to several gigabytes (10+ 
GiB).

A brief explanation on what the code does:
Parse each line,
Only accept records with percent_identity >= min_identity (90.0) 
and alignment_length >= min_matches (10),
Store all such records as tuples (in the D code this is a struct) 
in an array in an associative array indexed by 'query',
For each query, remove any records with percent_id less than 5 
percentage points less than the highest value observed for that 
query,
Write results to stdout (in my real code the data is subject to 
further downstream processing)

This was all just for me learning to do some basic stuff in D, 
e.g. file handling, streaming data from disk, etc. I'm really 
curious what I can do to improve the D code. My original idea was 
that maybe I should compile the performance critical parts of my 
Python codebase to D and call them with PyD or something, but not 
I'm not so sure any more. Help and suggestions appreciated!