Speeding up text file parser (BLAST tabular format)

Mon Sep 14 06:10:49 PDT 2015

On Monday, 14 September 2015 at 12:50:03 UTC, Fredrik Boulund 
wrote:
> On Monday, 14 September 2015 at 12:44:22 UTC, Edwin van Leeuwen 
> wrote:
>> Sounds like this program is actually IO bound. In that case I 
>> would not expect a really expect an improvement by using D. 
>> What is the CPU usage like when you run this program?
>>
>> Also which dmd version are you using. I think there were some 
>> performance improvements for file reading in the latest 
>> version (2.068)
>
> Hi Edwin, thanks for your quick reply!
>
> I'm using v2.068.1; I actually got inspired to try this out 
> after skimming the changelog :).
>
> Regarding if it is IO-bound. I actually expected it would be, 
> but both the Python and the D-version consume 100% CPU while 
> running, and just copying the file around only takes a few 
> seconds (cf 15-20 sec in runtime for the two programs). There's 
> bound to be some aggressive file caching going on, but I figure 
> that would rather normalize program runtimes at lower times 
> after running them a few times, but I see nothing indicating 
> that.

Two things that you could try:

First hitlists.byKey can be expensive (especially if hitlists is 
big). Instead use:

foreach( key, value ; hitlists )

Also the filter.array.length is quite expensive. You could use 
count instead.
import std.algorithm : count;
value.count!(h => h.pid >= (max_pid - max_pid_diff));