Speeding up text file parser (BLAST tabular format)
Fredrik Boulund via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Mon Sep 14 05:30:19 PDT 2015
Hi,
This is my first post on Dlang forums and I don't have a lot of
experience with D (yet). I mainly code bioinformatics-stuff in
Python on my day-to-day job, but I've been toying with D for a
couple of years now. I had this idea that it'd be fun to write a
parser for a text-based tabular data format I tend to read a lot
of in my programs, but I was a bit stomped that the D
implementation I created was slower than my Python-version. I
tried running `dmd -profile` on it but didn't really understand
what I can do to make it go faster. I guess there's some
unnecessary dynamic array extensions being made but I can't
figure out how to do without them, maybe someone can help me out?
I tried making the examples as small as possible.
Here's the code D code: http://dpaste.com/2HP0ZVA
Here's my Python code for comparison: http://dpaste.com/0MPBK67
Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU
E5-2670 with RAID6 SAS disks and 192GB of RAM), the D version
runs in about 20 seconds and the Python version less than 16
seconds. I've repeated runs at least thrice when testing. This
holds true even if the D version is compiled with -O.
The file being parsed is the output of a DNA/protein sequence
mapping algorithm called BLAT, but the tabular output format is
originally known from the famous BLAST algorithm.
Here's a short example of what the input files looks like:
http://dpaste.com/017N58F
The format is TAB-delimited: query, target, percent_identity,
alignment_length, mismatches, gaps, query_start, query_end,
target_start, target_end, e-value, bitscore
In the example the output is sorted by query, but this cannot be
assumed to hold true for all cases. The input file varies in
range from several hundred megabytes to several gigabytes (10+
GiB).
A brief explanation on what the code does:
Parse each line,
Only accept records with percent_identity >= min_identity (90.0)
and alignment_length >= min_matches (10),
Store all such records as tuples (in the D code this is a struct)
in an array in an associative array indexed by 'query',
For each query, remove any records with percent_id less than 5
percentage points less than the highest value observed for that
query,
Write results to stdout (in my real code the data is subject to
further downstream processing)
This was all just for me learning to do some basic stuff in D,
e.g. file handling, streaming data from disk, etc. I'm really
curious what I can do to improve the D code. My original idea was
that maybe I should compile the performance critical parts of my
Python codebase to D and call them with PyD or something, but not
I'm not so sure any more. Help and suggestions appreciated!
More information about the Digitalmars-d-learn
mailing list