How to read fastly files ( I/O operation)

Wed Feb 6 04:33:13 PST 2013

On Wednesday, 6 February 2013 at 11:15:22 UTC, monarch_dodra 
wrote:
> I'm going to try and see with some example files if I can't get 
> something running faster.

Benchmarking and tweaking, I was able to find 3 things that 
speeds up your program:

1) Make the computeLocal a compile time constant. This will give 
you a tinsy bit of performance. Depends on if you plan to make it 
a run-time argument switch I guess.

2) Makes things about 10%-20% faster:
Your "nucleic" and "amino" hash tables map a character to an 
index. However, given the range of the characters ('A' to 'Z'), 
you are better off doing a flat array, where each index 
represents a character, eg: A is index 0, B is index 1. This way, 
lookup is a simple array indexing, as opposed to a hash table 
indexing.

You may even get a bigger bang for your buck by simply giving 
your "_stats" structure a simple "A is index 0, B is index 1", 
and only "re-order" the data at the end, when you want to read 
it. (I haven't done this though).

3) Makes things about 100% faster (ran in half the time on my 
machine): I don't know how mmFile works, but a simple File + 
"rawRead" seems to get the job done fast. Also, instead of 
keeping track of an (several) indexes, I merely keep a single 
slice. The only thing I care about, is if my slice is empty, in 
which case I re-fill it.

The modified code is here. I'm apparently getting the same output 
you are, but that doesn't mean there might not be bugs in it. For 
example, I noticed that you don't strip leading whites, if any, 
before the first read.
http://dpaste.dzfl.pl/9b9353b8

----
I'd be tempted to re-write the parser using a "byLine" approach, 
since my quick reading about fastq seems to imply it is a line 
based format. Or just plain try to write a parser from scratch, 
putting my own logic and thought into it (all I did was modify 
your code, without caring about the actual algorithm)