Speeding up text file parser (BLAST tabular format)

Tue Sep 15 01:51:00 PDT 2015

On Monday, 14 September 2015 at 16:33:23 UTC, Rikki Cattermole 
wrote:
>
> A lot of this hasn't been covered I believe.
>
> http://dpaste.dzfl.pl/f7ab2915c3e1
>
> 1) You don't need to convert char[] to string via to. No. Too 
> much. Cast it.
> 2) You don't need byKey, use foreach key, value syntax. That 
> way you won't go around modifying things unnecessarily.
>
> Ok, I disabled GC + reserved a bunch of memory. It probably 
> won't help much actually. In fact may make it fail so keep that 
> in mind.
>
> Humm what else.
>
> I'm worried about that first foreach. I don't think it needs to 
> exist as it does. I believe an input range would be far better. 
> Use a buffer to store the Hit[]'s. Have a subset per set of 
> them.
>
> If the first foreach is an input range, then things become 
> slightly easier in the second. Now you can turn that into it's 
> own input range.
> Also that .array usage concerns me. Many an allocation there! 
> Hence why the input range should be the return from it.
>
> The last foreach, is lets assume dummy. Keep in mind, stdout is 
> expensive here. DO NOT USE. If you must buffer output then do 
> it large quantities.
>
>
> Based upon what I can see, you are definitely not able to use 
> your cpu's to the max. There is no way that is the limiting 
> factor here. Maybe your usage of a core is. But not the cpu's 
> itself.
>
> The thing is, you cannot use multiple threads on that first 
> foreach loop to speed things up. No. That needs to happen all 
> on one thread.
> Instead after that thread you need to push the result into 
> another.
>
> Perhaps, per thread one lock (mutex) + buffer for hits. Go 
> round robin over all the threads. If mutex is still locked, 
> you'll need to wait. In this situation a locked mutex means all 
> you worker threads are working. So you can't do anything more 
> (anyway).
>
> Of course after all this, the HDD may still be getting hit too 
> hard. In which case I would recommend you memory mapping it. 
> Which should allow the OS to more efficiently handle reading it 
> into memory. But you'll need to rework .byLine for that.
>
>
> Wow that was a lot at 4:30am! So don't take it too seriously. 
> I'm sure somebody else will rip that to shreds!

Thanks for your suggestions! That sure is a lot of details. I'll 
have to go through them carefully to understand what to do with 
all this. Going multithreaded sounds fun but would  effectively 
kill of all of my spare time, so I might have to skip that. :)

Using char[] all around might be a good idea, but it doesn't seem 
like the string conversions are really that taxing. What are the 
arguments for working on char[] arrays rather than strings?

I'm aware that printing output like that is a performance killer, 
but it's not supposed to write anything in the final program. 
It's just there for me to be able to compare the results to my 
Python code.