Looking for a Code Review of a Bioinformatics POC

Jon Degenhardt jond at noreply.com
Fri Jun 12 03:32:48 UTC 2020


On Friday, 12 June 2020 at 00:58:34 UTC, duck_tape wrote:
> On Thursday, 11 June 2020 at 23:45:31 UTC, H. S. Teoh wrote:
>>
>> Hmm, looks like it's not so much input that's slow, but 
>> *output*. In fact, it looks pretty bad, taking almost as much 
>> time as overlap() does in total!
>>
>> [snip...]
>
> I'll play with that a bit tomorrow! I saw a nice implementation 
> on eBay's tsvutils that I may need to look closer at.
>
> Someone else suggested that stdout flushes per line by default. 
> I dug around the stdlib but could confirm that. I also played 
> around with setvbuf but it didn't seem to change anything.
>
> Have you run into that before / know if stdout is flushing 
> every newline? I'm not above opening '/dev/stdout' as a file of 
> that writes faster.

I put some comparative benchmarks in 
https://github.com/jondegenhardt/dcat-perf. It  compares input 
and output using standard Phobos facilities (File.byLine, 
File.write), iopipe (https://github.com/schveiguy/iopipe), and 
the tsv-utils buffered input and buffered output facilities.

I haven't spent much time on results presentation, I know it's 
not that easy to read and interpret the results. Brief summary - 
On files with short lines buffering will result in dramatic 
throughput improvements over the standard phobos facilities. This 
is true for both input and output, through likely for different 
reasons. For input iopipe is the fastest available. tsv-utils 
buffered facilities are materially faster than phobos for both 
input and output, but not as fast as iopipe for input. Combining 
iopipe for input with tsv-utils BufferOutputRange for output 
works pretty well.

For files with long lines both iopipe and tsv-utils 
BufferedByLine are materially faster than Phobos File.byLine when 
reading. For writing there wasn't much difference from Phobos 
File.write.

A note on File.byLine - I've had many opportunities to compare 
Phobos File.byLine to facilities in other programming languages, 
and it is not bad at all. But it is beatable.

About Memory Mapped Files - The benchmarks don't include compare 
against mmfile. They certainly make sense as a comparison point.

--Jon


More information about the Digitalmars-d-learn mailing list