Looking for a Code Review of a Bioinformatics POC
Jon Degenhardt
jond at noreply.com
Fri Jun 12 07:25:09 UTC 2020
On Friday, 12 June 2020 at 06:20:59 UTC, H. S. Teoh wrote:
> I glanced over the implementation of byLine. It appears to be
> the unhappy compromise of trying to be 100% correct, cover all
> possible UTF encodings, and all possible types of input streams
> (on-disk file vs. interactive console). It does UTF decoding
> and resizing of arrays, and a lot of other frilly little
> squirrelly things. In fact I'm dismayed at how hairy it is,
> considering the conceptual simplicity of the task!
>
> Given this, it will definitely be much faster to load in large
> chunks of the file at a time into a buffer, and scanning
> in-memory for linebreaks. I wouldn't bother with decoding at
> all; I'd just precompute the byte sequence of the linebreaks
> for whatever encoding the file is expected to be in, and just
> scan for that byte pattern and return slices to the data.
This is basically what bufferedByLine in tsv-utils does. See:
https://github.com/eBay/tsv-utils/blob/master/common/src/tsv_utils/common/utils.d#L793.
tsv-utils has the advantage of only needing to support utf-8
files with Unix newlines, so the code is simpler. (Windows
newlines are detected, this occurs separately from
bufferedByLine.) But as you describe, support for a wider variety
of input cases could be done without sacrificing basic
performance. iopipe provides much more generic support, and it is
quite fast.
> Having said all of that, though: usually in non-trivial
> programs reading input is the least of your worries, so this
> kind of micro-optimization is probably unwarranted except for
> very niche cases and for micro-benchmarks and other such toy
> programs where the cost of I/O constitutes a significant chunk
> of running times. But knowing what byLine does under the hood
> is definitely interesting information for me to keep in mind,
> the next time I write an input-heavy program.
tsv-utils tools saw performance gains of 10-40% by moving from
File.byLine to bufferedByLine, depending on tool and type of file
(narrow or wide). Gains of 5-20% were obtained by switching from
File.write to BufferedOutputRange, with some special cases
improving by 50%. tsv-utils tools aren't micro-benchmarks, but
they are not typical apps either. Most of the tools go into a
tight loop of some kind, running a transformation on the input
and writing to the output. Performance is a real benefit to these
tools, as they get run on reasonably large data sets.
More information about the Digitalmars-d-learn
mailing list