How to read fastly files ( I/O operation)

monarch_dodra monarchdodra at gmail.com
Wed Feb 6 12:43:44 PST 2013


On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
> On 2013-02-04 15:04, bioinfornatics wrote:
>> I am looking to parse efficiently huge file but i think D 
>> lacking for this purpose.
>> To parse 12 Go i need 11 minutes wheras fastxtoolkit (written 
>> in c++ ) need 2 min.
>
> Haven't compared to fastxtoolkit, but I have some code for you.
> I have processed the file SRR077487_1.filt.fastq from
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
> and expect this syntax (no multiline sequences or whitespace).
> File takes up almost 6 GB processing took 1m45s - twice as fast 
> as the
> fastest D solution so far

Do you mean my solution above? I tried your solution with dmd, 
with -release -O -inline, and both gave about the same result 
(69s yours, 67s mine).

> Data contains both sequence letter and associated quality 
> information.
> Sequence ID and comment are slices of the buffer, so they have 
> valid info
> until you move to the next sequence (and the number increments).

Hum. Mine allocates new slices, so they are never invalidated :)
Mine also takes into account newlines and and lowercase sequences.

Still, it seems you and I both took different approaches. I had 
mentioned using a re-useable buffer. I'm going to try to consume 
some of your code to see if I can't improve my implementation.

@bioinfornatics

I'm getting real interested on the subject. I'm going to try to 
write an actual library/framework for working with fastq files in 
a D environment.

This means I'll try to write robust and useable code, with both 
stability and performance in mind, as opposed to the "proofs of 
concepts in so far".

For now, I'd like to keep it simple: Would something that only 
knows how to parse/write Sanger FASTQ files be of help to you?

If I write something, can I have you review it?


More information about the Digitalmars-d-learn mailing list