Looking for a Code Review of a Bioinformatics POC

Fri Jun 12 06:20:59 UTC 2020

On Fri, Jun 12, 2020 at 03:32:48AM +0000, Jon Degenhardt via Digitalmars-d-learn wrote:
[...]
> I haven't spent much time on results presentation, I know it's not
> that easy to read and interpret the results. Brief summary - On files
> with short lines buffering will result in dramatic throughput
> improvements over the standard phobos facilities. This is true for
> both input and output, through likely for different reasons. For input
> iopipe is the fastest available. tsv-utils buffered facilities are
> materially faster than phobos for both input and output, but not as
> fast as iopipe for input. Combining iopipe for input with tsv-utils
> BufferOutputRange for output works pretty well.
> 
> For files with long lines both iopipe and tsv-utils BufferedByLine are
> materially faster than Phobos File.byLine when reading. For writing
> there wasn't much difference from Phobos File.write.

Interesting.  Based on the OP's posted profile data, I got the
impression that input wasn't a big deal, but output was.  I wonder why.

> A note on File.byLine - I've had many opportunities to compare Phobos
> File.byLine to facilities in other programming languages, and it is
> not bad at all. But it is beatable.

I glanced over the implementation of byLine.  It appears to be the
unhappy compromise of trying to be 100% correct, cover all possible UTF
encodings, and all possible types of input streams (on-disk file vs.
interactive console).  It does UTF decoding and resizing of arrays, and
a lot of other frilly little squirrelly things.  In fact I'm dismayed at
how hairy it is, considering the conceptual simplicity of the task!

Given this, it will definitely be much faster to load in large chunks of
the file at a time into a buffer, and scanning in-memory for linebreaks.
I wouldn't bother with decoding at all; I'd just precompute the byte
sequence of the linebreaks for whatever encoding the file is expected to
be in, and just scan for that byte pattern and return slices to the
data.

> About Memory Mapped Files - The benchmarks don't include compare
> against mmfile. They certainly make sense as a comparison point.
[...]

I'd definitely seriously consider using std.mmfile if I/O is determined
to be a significant bottleneck. Letting the OS page in the file
on-demand for you instead of copying buffers across the C file API
boundary is definitely going to be a lot faster.  Plus it will greatly
simplify the code -- you could just arbitrarily scan and slice over file
data without needing to manually manage buffers on your own, so your
code will be much simpler and conducive for the compiler to squeeze the
last bit of speed juice out of.

I'd definitely avoid stdio.byLine if input was determined to be a
bottleneck: decoding characters from file data just to find linebreaks
seems to me to be definitely a slow way of doing things.

Having said all of that, though: usually in non-trivial programs reading
input is the least of your worries, so this kind of micro-optimization
is probably unwarranted except for very niche cases and for
micro-benchmarks and other such toy programs where the cost of I/O
constitutes a significant chunk of running times.  But knowing what
byLine does under the hood is definitely interesting information for me
to keep in mind, the next time I write an input-heavy program.

(I'm reminded of that one time when, as little diversion, I decided to
see if I could beat GNU wc at counting lines in a file. It was not easy
to beat since wc it's optimized to next year and back, but eventually a
combination of std.mmfile and std.parallelism to scan large chunks of
file simultaneously managed to beat wc by a good margin.  In the
meantime, though, I also discovered that a file of very short lines
triggers poor performance out of wc, whereas a file of very long lines
triggers the best performance -- because glibc's memchr appeared to be
optimized for a micro-benchmark geared towards scanning arrays with only
rare occurrences of the sought character, but typical text files exhibit
much more frequent matches (shorter lines). When fed a file of very
short lines, the overhead of the hyper-optimized code added up
significantly in the latter case, whereas when lines were sufficiently
long it far outweighed the overhead cost. Optimization is a tricky
beast: always make sure to measure and optimize for your actual use case
rather than making your code look good on some artificial
micro-benchmark, else your code may look good on the benchmark but
actually perform poorly on real-world data.)

T

-- 
May you live all the days of your life. -- Jonathan Swift