How to read fastly files ( I/O operation)

monarch_dodra monarchdodra at gmail.com
Wed Feb 6 08:06:19 PST 2013


On Wednesday, 6 February 2013 at 15:40:39 UTC, bioinfornatics 
wrote:
> It seem in any case is not easy to parse fastly a file in D

I don't think that's true. D provides the same "FILE" primitive 
you'd get in C, so there is no reason for it to be slower than C.

It is the "range" approach that, as convenient as it is, is not 
well adapted for certain things.

As I had said, I tried to write my own program. In it, I devised 
a range that, instead of exposing things to parse character by 
character, parses an entire "object" (a ... "genome" ... maybe ? 
I called them "Q" in my program) at once into an object. I 
decided to use the very simple "byLine" primitive.

 From there, you can query the object for their 
name/sequence/quality. The irony is that by "parsing twice" (once 
to do the io read, once to do the actual processing of the text), 
and taking into account I'm allocating each object individually, 
I'm running twice as fast as my original already improved 
implementation. Not only is it faster, it is also more 
convenient, since you can extract an entire Q object at once, and 
then operate on that as you would so please: Separation of 
algorithm and parsing.

It correctly takes into account that a sequence can be multiple 
lines. It does not strip whitespace because according to 
http://maq.sourceforge.net/fastq.shtml whitespace is not a legal 
character.

Now: Keep in mind that this approach allocates (3) new strings 
for each Q. You could *try* an approach with a pre-allocated 
re-useable buffer. This would mean you can only operate on 1 Q at 
once, but you'd probably iterate on them faster.

In any case, you can try it out:
http://dpaste.dzfl.pl/8bdd0c84



More information about the Digitalmars-d-learn mailing list