How to read fastly files ( I/O operation)

monarch_dodra monarchdodra at gmail.com
Tue Feb 12 04:45:25 PST 2013


On Tuesday, 12 February 2013 at 12:02:59 UTC, bioinfornatics 
wrote:
> instead to use memcpy I try with slicing ~ lines 136 :
> _hardBuffer[ 0 .. moveSize]  = _hardBuffer[_bufPosition .. 
> moveSize + _bufPosition];
>
> I get same perf

I think I figured out why I'm getting different results than you 
guys are, on my windows machine.

AFAIK, file reads in windows are done natively asynchronously.

I wrote a multi-threaded version of the parser, with a thread 
dedicated to reading the file, while the main thread parses the 
read buffers.

I'm getting EXACTLY 0% performance improvement. Not better, not 
worst, just 0%.

I'd have to try again on my SSD. Right now, I'm parsing the file 
6 Gig file in 60 seconds, which is the limit of my HDD. As a 
matter of fact, just *reading* the files takes the EXACT same 
amount of time as parsing it...

This takes 60 seconds.
//----
     auto input = File(args[1], "rb");
     ubyte[] buffer = new ubyte[](BufferSize);
     do{
         buffer = input.rawRead(buffer);
     }while(buffer.length);
//----

This takes 60 seconds too.
//----
     Parser parser = new Parser(args[1]);
     foreach(q; parser)
         foreach(char c; q.sequence)
             globalNucleic.collect(c);
}
//----

So at this point, I'd need to test on my Linux box, or publish 
the code so you can tell me how I'm doing.

I'm still tweaking the code to publish something readable, as 
there is a lot of sketchy code right now.

I'm also implementing a correct exception handling, so that if 
there is an erroneous entry, an exception is thrown. However, all 
the erroneous data is parsed out of the file, and placed inside 
the exception. This means that:
a) You can inspect the erroneous data
b) You can skip the erroneous data, and parse the rest of the 
file.

Once I deliver the code with the multi-threaded code activated, 
you should get some better performance on Linux.

When "1.0" is ready, I'll create a github project for it, so work 
can be done parallel on it.


More information about the Digitalmars-d-learn mailing list