How to read fastly files ( I/O operation)
monarch_dodra
monarchdodra at gmail.com
Tue Feb 12 08:45:33 PST 2013
On Tuesday, 12 February 2013 at 16:28:09 UTC, bioinfornatics
wrote:
> On Tuesday, 12 February 2013 at 12:45:26 UTC, monarch_dodra
> wrote:
>> On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics
>> wrote:
>>> instead to use memcpy I try with slicing ~ lines 136 :
>>> _hardBuffer[ 0 .. moveSize] = _hardBuffer[_bufPosition ..
>>> moveSize + _bufPosition];
>>>
>>> I get same perf
>>
>> I think I figured out why I'm getting different results than
>> you guys are, on my windows machine.
>>
>> AFAIK, file reads in windows are done natively asynchronously.
>>
>> I wrote a multi-threaded version of the parser, with a thread
>> dedicated to reading the file, while the main thread parses
>> the read buffers.
>>
>> I'm getting EXACTLY 0% performance improvement. Not better,
>> not worst, just 0%.
>>
>> I'd have to try again on my SSD. Right now, I'm parsing the
>> file 6 Gig file in 60 seconds, which is the limit of my HDD.
>> As a matter of fact, just *reading* the files takes the EXACT
>> same amount of time as parsing it...
>>
>> This takes 60 seconds.
>> //----
>> auto input = File(args[1], "rb");
>> ubyte[] buffer = new ubyte[](BufferSize);
>> do{
>> buffer = input.rawRead(buffer);
>> }while(buffer.length);
>> //----
>>
>> This takes 60 seconds too.
>> //----
>> Parser parser = new Parser(args[1]);
>> foreach(q; parser)
>> foreach(char c; q.sequence)
>> globalNucleic.collect(c);
>> }
>> //----
>>
>> So at this point, I'd need to test on my Linux box, or publish
>> the code so you can tell me how I'm doing.
>>
>> I'm still tweaking the code to publish something readable, as
>> there is a lot of sketchy code right now.
>>
>> I'm also implementing a correct exception handling, so that if
>> there is an erroneous entry, an exception is thrown. However,
>> all the erroneous data is parsed out of the file, and placed
>> inside the exception. This means that:
>> a) You can inspect the erroneous data
>> b) You can skip the erroneous data, and parse the rest of the
>> file.
>>
>> Once I deliver the code with the multi-threaded code
>> activated, you should get some better performance on Linux.
>>
>> When "1.0" is ready, I'll create a github project for it, so
>> work can be done parallel on it.
>
> about threaded version is possible to use get file size
> function to split it in several thread.
> Use fseek read end of section return it to detect end of split
> to used
You'd want to have 2 threads reading the same file at once? I
don't think there is much more to be gained anyways, since the IO
is the bottleneck anyways.
A better approach would be to have 1 file reader that passes data
to two simultaneous parsers. This, however, would make things
scary complicated, and I'd doubt we'd even get much better
results: I was not able to measure the actual amount of time
spent working when compared to the time spent reading the file.
More information about the Digitalmars-d-learn
mailing list