Speeding up text file parser (BLAST tabular format)
Fredrik Boulund via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Mon Sep 14 06:55:47 PDT 2015
On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen
wrote:
> Two things that you could try:
>
> First hitlists.byKey can be expensive (especially if hitlists
> is big). Instead use:
>
> foreach( key, value ; hitlists )
>
> Also the filter.array.length is quite expensive. You could use
> count instead.
> import std.algorithm : count;
> value.count!(h => h.pid >= (max_pid - max_pid_diff));
I didn't know that hitlists.byKey was that expensive, that's just
the kind of feedback I was hoping for. I'm just grasping for
straws in the online documentation when I want to do things. With
my Python background it feels as if I can still get things that
work that way.
I realize the filter.array.length thing is indeed expensive. I
find it especially horrendous that the code I've written needs to
allocate a big dynamic array that will most likely be cut down
quite drastically in this step. Unfortunately I haven't figured
out a good way to do this without storing the intermediary
results since I cannot know if there might be yet another hit for
any encountered "query" since the input file might not be sorted.
But the main reason I didn't just count the values like you
suggest is actually that I need the filtered hits in later
downstream analysis. The filtered hits for each query are used as
input to a lowest common ancestor algorithm on the taxonomic tree
(of life).
More information about the Digitalmars-d-learn
mailing list