Speeding up text file parser (BLAST tabular format)

Mon Sep 14 10:51:41 PDT 2015

On Monday, 14 September 2015 at 12:30:21 UTC, Fredrik Boulund 
wrote:
> Hi,
>
> This is my first post on Dlang forums and I don't have a lot of 
> experience with D (yet). I mainly code bioinformatics-stuff in 
> Python on my day-to-day job, but I've been toying with D for a 
> couple of years now. I had this idea that it'd be fun to write 
> a parser for a text-based tabular data format I tend to read a 
> lot of in my programs, but I was a bit stomped that the D 
> implementation I created was slower than my Python-version. I 
> tried running `dmd -profile` on it but didn't really understand 
> what I can do to make it go faster. I guess there's some 
> unnecessary dynamic array extensions being made but I can't 
> figure out how to do without them, maybe someone can help me 
> out? I tried making the examples as small as possible.
>
> Here's the code D code: http://dpaste.com/2HP0ZVA
> Here's my Python code for comparison: http://dpaste.com/0MPBK67
>
> clip

I am going to go off the beaten path here.  If you really want 
speed
for a file like this one way of getting that is to read the file
in as a single large binary array of ubytes (or in blocks if its 
too big)
and parse the lines yourself. Should be fairly easy with D's 
array slicing.

I looked at the format and it appears that lines are quite simple 
and use
a limited subset of the ASCII chars.  If that is in fact true 
then you
should be able to speed up reading using this technique.  If you 
can have
UTF8 chars in there, or if the format can be more complex than 
that shown
in your example, then please ignore my suggestion.