How to read fastly files ( I/O operation)

monarch_dodra monarchdodra at gmail.com
Wed Feb 13 09:39:10 PST 2013


On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra wrote:
> On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
> wrote:
>>
>> Some time fastq are comressed to gz bz2 or xz as that is often 
>> a
>> huge file.
>> Maybe we need keep in mind this early in developement and use
>> std.zlib
>
> While working on making the parser multi-threaded compatible, I 
> was able to seperate the part that feeds data, and the part 
> that parses data.
>
> Long story short, the parser operates on an input range of 
> ubyte[]: It is not responsible any more for acquisition of data.
>
> The range can be a simple (wrapped) File, a byChunk, an 
> asynchroneus file reader, or a zip decompresser, or just stdin 
> I guess. Range can be transient.
>
> However, now that you mention it, I'll make sure it is 
> correctly supported.
>
> I'll *try* to show you what I have so far tomorow (in about 
> 18h).

Yeah... I played around too much, and the file is dirtier than 
ever.

The good news is that I was able to test out what I was telling 
you about: accepting any range is ok:

I used your ZFile range to plug it into my parser: I can now 
parse zipped files directly.

The good news is that now, I'm not bottle necked by IO anymore! 
The bad news is that I'm now bottle necked by CPU decompressing. 
But since I'm using dmd, you may get better results with LDC or 
GDC.

In any case, I am now parsing the 6Gig packed into 1.5Gig in 
about 53 seconds (down from 61). I also tried doing a 
dual-threaded approach (1 thread to unzip, 1 thread to parse), 
but again, the actual *parse* phase is so ridiculously fast, that 
it changes *nothing* to the final result.

Long story short: 99% of the time is spent acquiring data. The 
last 1% is just copying it into local buffers.

The last good news though is that CPU bottleneck is always better 
than IO bottleneck. If you have multiple cores, you should be 
able to run multiple *instances* (not threads), and be able to 
process several files at once, multiplying your throughput.


More information about the Digitalmars-d-learn mailing list