Reading from stdin significantly slower than reading file directly?

Thu Aug 13 07:08:21 UTC 2020

On Wednesday, 12 August 2020 at 22:44:44 UTC, methonash wrote:
> Hi,
>
> Relative beginner to D-lang here, and I'm very confused by the 
> apparent performance disparity I've noticed between programs 
> that do the following:
>
> 1) cat some-large-file | D-program-reading-stdin-byLine()
>
> 2) D-program-directly-reading-file-byLine() using File() struct
>
> The D-lang difference I've noticed from options (1) and (2) is 
> somewhere in the range of 80% wall time taken (7.5s vs 4.1s), 
> which seems pretty extreme.

I don't know enough details of the implementation to really 
answer the question, and I expect it's a bit complicated.

However, it's an interesting question, and I have relevant 
programs and data files, so I tried to get some actuals.

The tests I ran don't directly answer the question posed, but may 
be a useful proxy. I used Unix 'cut' (latest GNU version) and 
'tsv-select' from the tsv-utils package 
(https://github.com/eBay/tsv-utils). 'tsv-select' is written in 
D, and works like 'cut'. 'tsv-select' reads from stdin or a file 
via a 'File' struct. It's not using the built-in 'byLine' member 
though, it uses a version of 'byLine' that includes some 
additional buffering. Both stdin and a file system file are read 
this way.

I used a file from the google ngram collection 
(http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) and the file TREE_GRM_ESTN.csv from https://apps.fs.usda.gov/fia/datamart/CSV/datamart_csv.html, converted to a tsv file.

The ngram file is a narrow file (21 bytes/line, 4 columns), the 
TREE file is wider (206 bytes/line, 49 columns). In both cases I 
cut the 2nd and 3rd columns. This tends to focus processing on 
input rather than processing and output. I also timed 'wc -l' for 
another data point.

I ran the benchmarks 5 times each way and recorded the median 
time below. Machine used is a MacMini (so Mac OS) with 16 GB RAM 
and SSD drives. The numbers are very consisent for this test on 
this machine. Differences in the reported times are real deltas, 
not system noise. The commands timed were:

* bash -c 'tsv-select -f 2,3 FILE > /dev/null'
* bash -c 'cat FILE | tsv-select -f 2,3 > /dev/null'
* bash -c 'gcut -f 2,3 FILE > /dev/null'
* bash -c 'cat FILE | gcut -f 2,3 > /dev/null'
* bash -c 'gwc -l FILE > /dev/null'
* bash -c 'cat FILE | gwc -l > /dev/null'

Note that 'gwc' and 'gcut' are the GNU versions of 'wc' and 'cut' 
installed by Homebrew.

Google ngram file (the 's' unigram file):

Test                          Elapsed  System   User
----                          -------  ------   ----
tsv-select -f 2,3 FILE          10.28    0.42   9.85
cat FILE | tsv-select -f 2,3    11.10    1.45  10.23
cut -f 2,3 FILE                 14.64    0.60  14.03
cat FILE | cut -f 2,3           14.36    1.03  14.19
wc -l FILE                       1.32    0.39   0.93
cat FILE | wc -l                 1.18    0.96   1.04

The TREE file:

Test                          Elapsed  System   User
----                          -------  ------   ----
tsv-select -f 2,3 FILE           3.77    0.95   2.81
cat FILE | tsv-select -f 2,3     4.54    2.65   3.28
cut -f 2,3 FILE                 17.78    1.53  16.24
cat FILE | cut -f 2,3           16.77    2.64  16.36
wc -l FILE                       1.38    0.91   0.46
cat FILE | wc -l                 2.02    2.63   0.77

What this shows is that 'tsv-select' (D program) was faster when 
reading from a file than when reading from a standard input. It 
doesn't indicate why or whether the delta is due to code D 
library or code in 'tsv-select'.

Interestingly, 'cut' showed the opposite behavior. It was faster 
when reading from standard input than when reading from the file. 
For 'wc', which method was faster was dependent on line length.

Again, I caution against reading too much into this regarding 
performance of reading from standard input vs a disk file. Much 
more definitive tests can be done. However, it is an interesting 
comparison.

Also, the D program is still fast in both cases.

--Jon