Speeding up text file parser (BLAST tabular format)

Mon Sep 14 09:33:21 PDT 2015

On 15/09/15 12:30 AM, Fredrik Boulund wrote:
> Hi,
>
> This is my first post on Dlang forums and I don't have a lot of
> experience with D (yet). I mainly code bioinformatics-stuff in Python on
> my day-to-day job, but I've been toying with D for a couple of years
> now. I had this idea that it'd be fun to write a parser for a text-based
> tabular data format I tend to read a lot of in my programs, but I was a
> bit stomped that the D implementation I created was slower than my
> Python-version. I tried running `dmd -profile` on it but didn't really
> understand what I can do to make it go faster. I guess there's some
> unnecessary dynamic array extensions being made but I can't figure out
> how to do without them, maybe someone can help me out? I tried making
> the examples as small as possible.
>
> Here's the code D code: http://dpaste.com/2HP0ZVA
> Here's my Python code for comparison: http://dpaste.com/0MPBK67
>
> Using a small test file (~550 MB) on my machine (2x Xeon(R) CPU E5-2670
> with RAID6 SAS disks and 192GB of RAM), the D version runs in about 20
> seconds and the Python version less than 16 seconds. I've repeated runs
> at least thrice when testing. This holds true even if the D version is
> compiled with -O.
>
> The file being parsed is the output of a DNA/protein sequence mapping
> algorithm called BLAT, but the tabular output format is originally known
> from the famous BLAST algorithm.
> Here's a short example of what the input files looks like:
> http://dpaste.com/017N58F
> The format is TAB-delimited: query, target, percent_identity,
> alignment_length, mismatches, gaps, query_start, query_end,
> target_start, target_end, e-value, bitscore
> In the example the output is sorted by query, but this cannot be assumed
> to hold true for all cases. The input file varies in range from several
> hundred megabytes to several gigabytes (10+ GiB).
>
> A brief explanation on what the code does:
> Parse each line,
> Only accept records with percent_identity >= min_identity (90.0) and
> alignment_length >= min_matches (10),
> Store all such records as tuples (in the D code this is a struct) in an
> array in an associative array indexed by 'query',
> For each query, remove any records with percent_id less than 5
> percentage points less than the highest value observed for that query,
> Write results to stdout (in my real code the data is subject to further
> downstream processing)
>
> This was all just for me learning to do some basic stuff in D, e.g. file
> handling, streaming data from disk, etc. I'm really curious what I can
> do to improve the D code. My original idea was that maybe I should
> compile the performance critical parts of my Python codebase to D and
> call them with PyD or something, but not I'm not so sure any more. Help
> and suggestions appreciated!
>

A lot of this hasn't been covered I believe.

http://dpaste.dzfl.pl/f7ab2915c3e1

1) You don't need to convert char[] to string via to. No. Too much. Cast it.
2) You don't need byKey, use foreach key, value syntax. That way you 
won't go around modifying things unnecessarily.

Ok, I disabled GC + reserved a bunch of memory. It probably won't help 
much actually. In fact may make it fail so keep that in mind.

Humm what else.

I'm worried about that first foreach. I don't think it needs to exist as 
it does. I believe an input range would be far better. Use a buffer to 
store the Hit[]'s. Have a subset per set of them.

If the first foreach is an input range, then things become slightly 
easier in the second. Now you can turn that into it's own input range.
Also that .array usage concerns me. Many an allocation there! Hence why 
the input range should be the return from it.

The last foreach, is lets assume dummy. Keep in mind, stdout is 
expensive here. DO NOT USE. If you must buffer output then do it large 
quantities.

Based upon what I can see, you are definitely not able to use your cpu's 
to the max. There is no way that is the limiting factor here. Maybe your 
usage of a core is. But not the cpu's itself.

The thing is, you cannot use multiple threads on that first foreach loop 
to speed things up. No. That needs to happen all on one thread.
Instead after that thread you need to push the result into another.

Perhaps, per thread one lock (mutex) + buffer for hits. Go round robin 
over all the threads. If mutex is still locked, you'll need to wait. In 
this situation a locked mutex means all you worker threads are working. 
So you can't do anything more (anyway).

Of course after all this, the HDD may still be getting hit too hard. In 
which case I would recommend you memory mapping it. Which should allow 
the OS to more efficiently handle reading it into memory. But you'll 
need to rework .byLine for that.

Wow that was a lot at 4:30am! So don't take it too seriously. I'm sure 
somebody else will rip that to shreds!