How to read fastly files ( I/O operation)

Tue Feb 12 08:45:33 PST 2013

On Tuesday, 12 February 2013 at 16:28:09 UTC, bioinfornatics 
wrote:
> On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra 
> wrote:
>> On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
>> wrote:
>>> instead to use memcpy I try with slicing ~ lines 136 :
>>> _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
>>> moveSize + _bufPosition];
>>>
>>> I get same perf
>>
>> I think I figured out why I'm getting different results than 
>> you guys are, on my windows machine.
>>
>> AFAIK, file reads in windows are done natively asynchronously.
>>
>> I wrote a multi-threaded version of the parser, with a thread 
>> dedicated to reading the file, while the main thread parses 
>> the read buffers.
>>
>> I'm getting EXACTLY 0% performance improvement. Not better, 
>> not worst, just 0%.
>>
>> I'd have to try again on my SSD. Right now, I'm parsing the 
>> file 6 Gig file in 60 seconds, which is the limit of my HDD. 
>> As a matter of fact, just *reading* the files takes the EXACT 
>> same amount of time as parsing it...
>>
>> This takes 60 seconds.
>> //----
>>    auto input = File(args[1], "rb");
>>    ubyte[] buffer = new ubyte[](BufferSize);
>>    do{
>>        buffer = input.rawRead(buffer);
>>    }while(buffer.length);
>> //----
>>
>> This takes 60 seconds too.
>> //----
>>    Parser parser = new Parser(args[1]);
>>    foreach(q; parser)
>>        foreach(char c; q.sequence)
>>            globalNucleic.collect(c);
>> }
>> //----
>>
>> So at this point, I'd need to test on my Linux box, or publish 
>> the code so you can tell me how I'm doing.
>>
>> I'm still tweaking the code to publish something readable, as 
>> there is a lot of sketchy code right now.
>>
>> I'm also implementing a correct exception handling, so that if 
>> there is an erroneous entry, an exception is thrown. However, 
>> all the erroneous data is parsed out of the file, and placed 
>> inside the exception. This means that:
>> a) You can inspect the erroneous data
>> b) You can skip the erroneous data, and parse the rest of the 
>> file.
>>
>> Once I deliver the code with the multi-threaded code 
>> activated, you should get some better performance on Linux.
>>
>> When "1.0" is ready, I'll create a github project for it, so 
>> work can be done parallel on it.
>
> about threaded version is possible to use get file size 
> function to split it in several thread.
> Use fseek read end of section return it to detect end of split 
> to used

You'd want to have 2 threads reading the same file at once? I 
don't think there is much more to be gained anyways, since the IO 
is the bottleneck anyways.

A better approach would be to have 1 file reader that passes data 
to two simultaneous parsers. This, however, would make things 
scary complicated, and I'd doubt we'd even get much better 
results: I was not able to measure the actual amount of time 
spent working when compared to the time spent reading the file.