How to read fastly files ( I/O operation)

Wed Dec 18 14:46:28 PST 2013

On Wednesday, 13 February 2013 at 17:39:11 UTC, monarch_dodra 
wrote:
> On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra 
> wrote:
>> On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics 
>> wrote:
>>>
>>> Some time fastq are comressed to gz bz2 or xz as that is 
>>> often a
>>> huge file.
>>> Maybe we need keep in mind this early in developement and use
>>> std.zlib
>>
>> While working on making the parser multi-threaded compatible, 
>> I was able to seperate the part that feeds data, and the part 
>> that parses data.
>>
>> Long story short, the parser operates on an input range of 
>> ubyte[]: It is not responsible any more for acquisition of 
>> data.
>>
>> The range can be a simple (wrapped) File, a byChunk, an 
>> asynchroneus file reader, or a zip decompresser, or just stdin 
>> I guess. Range can be transient.
>>
>> However, now that you mention it, I'll make sure it is 
>> correctly supported.
>>
>> I'll *try* to show you what I have so far tomorow (in about 
>> 18h).
>
> Yeah... I played around too much, and the file is dirtier than 
> ever.
>
> The good news is that I was able to test out what I was telling 
> you about: accepting any range is ok:
>
> I used your ZFile range to plug it into my parser: I can now 
> parse zipped files directly.
>
> The good news is that now, I'm not bottle necked by IO anymore! 
> The bad news is that I'm now bottle necked by CPU 
> decompressing. But since I'm using dmd, you may get better 
> results with LDC or GDC.
>
> In any case, I am now parsing the 6Gig packed into 1.5Gig in 
> about 53 seconds (down from 61). I also tried doing a 
> dual-threaded approach (1 thread to unzip, 1 thread to parse), 
> but again, the actual *parse* phase is so ridiculously fast, 
> that it changes *nothing* to the final result.
>
> Long story short: 99% of the time is spent acquiring data. The 
> last 1% is just copying it into local buffers.
>
> The last good news though is that CPU bottleneck is always 
> better than IO bottleneck. If you have multiple cores, you 
> should be able to run multiple *instances* (not threads), and 
> be able to process several files at once, multiplying your 
> throughput.

I modified the library unzip to make a parallel unzip a while 
back (at the link below).  The execution time scaled very well 
for the number of cpus for the test case I was using, which was a 
2GB unzip'd distribution containing many small files and 
subdirectories.  The parallel operations were by file.   I think 
the execution time gains on ssd drives were from having multiple 
cores scheduling the writes to separate files in parallel.
https://github.com/jnorwood/file_parallel/blob/master/unzip_parallel.d