How to read fastly files ( I/O operation)
FG
home at fgda.pl
Wed Feb 6 14:55:02 PST 2013
On 2013-02-06 21:43, monarch_dodra wrote:
> On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
>> I have processed the file SRR077487_1.filt.fastq from
>> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
>> and expect this syntax (no multiline sequences or whitespace).
>> File takes up almost 6 GB processing took 1m45s - twice as fast as the
>> fastest D solution so far
>
> Do you mean my solution above? I tried your solution with dmd, with -release -O
> -inline, and both gave about the same result (69s yours, 67s mine).
Yes. Maybe CPU is the bottleneck on my end.
With DMD32 2.060 on win7-64 compiled with same flags I got:
MD: 4m30 / FG: 1m55s - both using 100% of one core.
Quite similar results with GDC64.
You have timed the same file SRR077487_1.filt.fastq at 67s?
> I'm getting real interested on the subject. I'm going to try to write an actual
> library/framework for working with fastq files in a D environment.
Those fastq are contagious. ;)
> This means I'll try to write robust and useable code, with both stability and
> performance in mind, as opposed to the "proofs of concepts in so far".
Yeah, but the big deal was that D is 5.5x slower than C++.
You have mentioned something about using byLine. Well, I would have gladly used
it instead of looking for line ends myself and pushing stuff with memcpy.
But the thing is that while the fgets(char *buf, int bufSize, FILE *f) in fastx
is fast in reading file by line, using file.readln(buf) is unpredictable. :)
I mean that in DMD it's only a bit slower than file.rawRead(buf), but in GDC
can be several times slower. For example just reading in a loop:
import std.stdio;
enum uint bufferSize = 4096 - 16;
void main(string[] args) {
char[] tmp, buf = new char[bufferSize];
size_t cnt;
auto f = File(args[1], "r");
switch(args[2]) {
case "raw":
do tmp = f.rawRead(buf); while (tmp.length);
break;
case "readln":
do cnt = f.readln(buf); while (cnt);
break;
default: writeln("Use parameters: <filename> raw|readln");
}
}
Tested on a much smaller SRR077487.filt.fastq:
DMD32 -release -O -inline: raw 94ms / readln 450ms
GDC64 -O3: raw 94ms / readln 6.76s
Tested on SRR077487_1.filt.fastq:
DMD32 -release -O -inline: raw 1m44s / readln 1m55s
GDC64 -O3: raw 1m48s / readln 14m16s
Why such a big difference between the DMD and GDC (on Windows)?
(or have I missed some switch in GDC?)
More information about the Digitalmars-d-learn
mailing list