Looking for a Code Review of a Bioinformatics POC

Fri Jun 12 07:25:09 UTC 2020

On Friday, 12 June 2020 at 06:20:59 UTC, H. S. Teoh wrote:
> I glanced over the implementation of byLine.  It appears to be 
> the unhappy compromise of trying to be 100% correct, cover all 
> possible UTF encodings, and all possible types of input streams 
> (on-disk file vs. interactive console).  It does UTF decoding 
> and resizing of arrays, and a lot of other frilly little 
> squirrelly things.  In fact I'm dismayed at how hairy it is, 
> considering the conceptual simplicity of the task!
>
> Given this, it will definitely be much faster to load in large 
> chunks of the file at a time into a buffer, and scanning 
> in-memory for linebreaks. I wouldn't bother with decoding at 
> all; I'd just precompute the byte sequence of the linebreaks 
> for whatever encoding the file is expected to be in, and just 
> scan for that byte pattern and return slices to the data.

This is basically what bufferedByLine in tsv-utils does. See: 
https://github.com/eBay/tsv-utils/blob/master/common/src/tsv_utils/common/utils.d#L793.

tsv-utils has the advantage of only needing to support utf-8 
files with Unix newlines, so the code is simpler. (Windows 
newlines are detected, this occurs separately from 
bufferedByLine.) But as you describe, support for a wider variety 
of input cases could be done without sacrificing basic 
performance. iopipe provides much more generic support, and it is 
quite fast.

> Having said all of that, though: usually in non-trivial 
> programs reading input is the least of your worries, so this 
> kind of micro-optimization is probably unwarranted except for 
> very niche cases and for micro-benchmarks and other such toy 
> programs where the cost of I/O constitutes a significant chunk 
> of running times.  But knowing what byLine does under the hood 
> is definitely interesting information for me to keep in mind, 
> the next time I write an input-heavy program.

tsv-utils tools saw performance gains of 10-40% by moving from 
File.byLine to bufferedByLine, depending on tool and type of file 
(narrow or wide). Gains of 5-20% were obtained by switching from 
File.write to BufferedOutputRange, with some special cases 
improving by 50%. tsv-utils tools aren't micro-benchmarks, but 
they are not typical apps either. Most of the tools go into a 
tight loop of some kind, running a transformation on the input 
and writing to the output. Performance is a real benefit to these 
tools, as they get run on reasonably large data sets.