How to read fastly files ( I/O operation)
Jay Norwood
jayn at prismnet.com
Wed Dec 18 14:46:28 PST 2013
On Wednesday, 13 February 2013 at 17:39:11 UTC, monarch_dodra
wrote:
> On Tuesday, 12 February 2013 at 22:06:48 UTC, monarch_dodra
> wrote:
>> On Tuesday, 12 February 2013 at 21:41:14 UTC, bioinfornatics
>> wrote:
>>>
>>> Some time fastq are comressed to gz bz2 or xz as that is
>>> often a
>>> huge file.
>>> Maybe we need keep in mind this early in developement and use
>>> std.zlib
>>
>> While working on making the parser multi-threaded compatible,
>> I was able to seperate the part that feeds data, and the part
>> that parses data.
>>
>> Long story short, the parser operates on an input range of
>> ubyte[]: It is not responsible any more for acquisition of
>> data.
>>
>> The range can be a simple (wrapped) File, a byChunk, an
>> asynchroneus file reader, or a zip decompresser, or just stdin
>> I guess. Range can be transient.
>>
>> However, now that you mention it, I'll make sure it is
>> correctly supported.
>>
>> I'll *try* to show you what I have so far tomorow (in about
>> 18h).
>
> Yeah... I played around too much, and the file is dirtier than
> ever.
>
> The good news is that I was able to test out what I was telling
> you about: accepting any range is ok:
>
> I used your ZFile range to plug it into my parser: I can now
> parse zipped files directly.
>
> The good news is that now, I'm not bottle necked by IO anymore!
> The bad news is that I'm now bottle necked by CPU
> decompressing. But since I'm using dmd, you may get better
> results with LDC or GDC.
>
> In any case, I am now parsing the 6Gig packed into 1.5Gig in
> about 53 seconds (down from 61). I also tried doing a
> dual-threaded approach (1 thread to unzip, 1 thread to parse),
> but again, the actual *parse* phase is so ridiculously fast,
> that it changes *nothing* to the final result.
>
> Long story short: 99% of the time is spent acquiring data. The
> last 1% is just copying it into local buffers.
>
> The last good news though is that CPU bottleneck is always
> better than IO bottleneck. If you have multiple cores, you
> should be able to run multiple *instances* (not threads), and
> be able to process several files at once, multiplying your
> throughput.
I modified the library unzip to make a parallel unzip a while
back (at the link below). The execution time scaled very well
for the number of cpus for the test case I was using, which was a
2GB unzip'd distribution containing many small files and
subdirectories. The parallel operations were by file. I think
the execution time gains on ssd drives were from having multiple
cores scheduling the writes to separate files in parallel.
https://github.com/jnorwood/file_parallel/blob/master/unzip_parallel.d
More information about the Digitalmars-d-learn
mailing list