How to read fastly files ( I/O operation)

FG home at fgda.pl
Wed Feb 6 14:55:02 PST 2013


On 2013-02-06 21:43, monarch_dodra wrote:
> On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
>> I have processed the file SRR077487_1.filt.fastq from
>> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
>> and expect this syntax (no multiline sequences or whitespace).
>> File takes up almost 6 GB processing took 1m45s - twice as fast as the
>> fastest D solution so far
>
> Do you mean my solution above? I tried your solution with dmd, with -release -O
> -inline, and both gave about the same result (69s yours, 67s mine).

Yes. Maybe CPU is the bottleneck on my end.
With DMD32 2.060 on win7-64 compiled with same flags I got:
MD: 4m30 / FG: 1m55s - both using 100% of one core.
Quite similar results with GDC64.

You have timed the same file SRR077487_1.filt.fastq at 67s?


> I'm getting real interested on the subject. I'm going to try to write an actual
> library/framework for working with fastq files in a D environment.

Those fastq are contagious. ;)

> This means I'll try to write robust and useable code, with both stability and
> performance in mind, as opposed to the "proofs of concepts in so far".

Yeah, but the big deal was that D is 5.5x slower than C++.

You have mentioned something about using byLine. Well, I would have gladly used
it instead of looking for line ends myself and pushing stuff with memcpy.
But the thing is that while the fgets(char *buf, int bufSize, FILE *f) in fastx
is fast in reading file by line, using file.readln(buf) is unpredictable. :)
I mean that in DMD it's only a bit slower than file.rawRead(buf), but in GDC
can be several times slower. For example just reading in a loop:

     import std.stdio;
     enum uint bufferSize = 4096 - 16;
     void main(string[] args) {
         char[] tmp, buf = new char[bufferSize];
         size_t cnt;
         auto f = File(args[1], "r");
         switch(args[2]) {
             case "raw":
                 do tmp = f.rawRead(buf); while (tmp.length);
                 break;

             case "readln":
                 do cnt = f.readln(buf); while (cnt);
                 break;

             default: writeln("Use parameters: <filename> raw|readln");
         }
     }

Tested on a much smaller SRR077487.filt.fastq:
DMD32 -release -O -inline: raw 94ms / readln 450ms
GDC64 -O3:                 raw 94ms / readln 6.76s

Tested on SRR077487_1.filt.fastq:
DMD32 -release -O -inline: raw 1m44s / readln  1m55s
GDC64 -O3:                 raw 1m48s / readln 14m16s

Why such a big difference between the DMD and GDC (on Windows)?
(or have I missed some switch in GDC?)



More information about the Digitalmars-d-learn mailing list