Speeding up text file parser (BLAST tabular format)
Laeeth Isharc via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Mon Sep 14 07:15:24 PDT 2015
On Monday, 14 September 2015 at 13:55:50 UTC, Fredrik Boulund
wrote:
> On Monday, 14 September 2015 at 13:10:50 UTC, Edwin van Leeuwen
> wrote:
>> Two things that you could try:
>>
>> First hitlists.byKey can be expensive (especially if hitlists
>> is big). Instead use:
>>
>> foreach( key, value ; hitlists )
>>
>> Also the filter.array.length is quite expensive. You could use
>> count instead.
>> import std.algorithm : count;
>> value.count!(h => h.pid >= (max_pid - max_pid_diff));
>
> I didn't know that hitlists.byKey was that expensive, that's
> just the kind of feedback I was hoping for. I'm just grasping
> for straws in the online documentation when I want to do
> things. With my Python background it feels as if I can still
> get things that work that way.
I picked up D to start learning maybe a couple of years ago. I
found Ali's book, Andrei's book, github source code (including
for Phobos), and asking here to be the best resources. The docs
make perfect sense when you have got to a certain level (or
perhaps if you have a computer sciencey background), but can be
tough before that (though they are getting better).
You should definitely take a look at the dlangscience project
organized by John Colvin and others.
If you like ipython/jupyter also see his pydmagic - write D
inline in a notebook.
You may find this series of posts interesting too - another
bioinformatics guy migrating from Python:
http://forum.dlang.org/post/akzdstfiwwzfeoudhshg@forum.dlang.org
> I realize the filter.array.length thing is indeed expensive. I
> find it especially horrendous that the code I've written needs
> to allocate a big dynamic array that will most likely be cut
> down quite drastically in this step. Unfortunately I haven't
> figured out a good way to do this without storing the
> intermediary results since I cannot know if there might be yet
> another hit for any encountered "query" since the input file
> might not be sorted. But the main reason I didn't just count
> the values like you suggest is actually that I need the
> filtered hits in later downstream analysis. The filtered hits
> for each query are used as input to a lowest common ancestor
> algorithm on the taxonomic tree (of life).
Unfortunately I haven't time to read your code, and others will
do better. But do you use .reserve() ? Also these are a nice
fast container library based on Andrei Alexandrescu's allocator:
https://github.com/economicmodeling/containers
More information about the Digitalmars-d-learn
mailing list