How to read fastly files ( I/O operation)

FG home at fgda.pl
Wed Feb 6 11:20:37 PST 2013


On 2013-02-04 15:04, bioinfornatics wrote:
> I am looking to parse efficiently huge file but i think D lacking for this purpose.
> To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min.

Haven't compared to fastxtoolkit, but I have some code for you.
I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as fast as the
fastest D solution so far -- all compiled with gdc -O3.
I bet your computer has better specs than mine.

Program uses a buffer that should be twice the size of the largest sequence
record (counting id, comment and quality data). A chunk of file is read,
then records are scanned on the buffer until record start pointer passes
the middle of the buffer -- then memcpy is used to move all the rest to
the begining of the buffer and the remaining space at the end is filled with
another chunk read from the file.

Data contains both sequence letter and associated quality information.
Sequence ID and comment are slices of the buffer, so they have valid info
until you move to the next sequence (and the number increments).

This is the code: http://dpaste.1azy.net/8424d4ac
Tell me what timings you can get now.


More information about the Digitalmars-d-learn mailing list