Reading from stdin significantly slower than reading file directly?
Jon Degenhardt
jond at noreply.com
Thu Aug 13 07:08:21 UTC 2020
On Wednesday, 12 August 2020 at 22:44:44 UTC, methonash wrote:
> Hi,
>
> Relative beginner to D-lang here, and I'm very confused by the
> apparent performance disparity I've noticed between programs
> that do the following:
>
> 1) cat some-large-file | D-program-reading-stdin-byLine()
>
> 2) D-program-directly-reading-file-byLine() using File() struct
>
> The D-lang difference I've noticed from options (1) and (2) is
> somewhere in the range of 80% wall time taken (7.5s vs 4.1s),
> which seems pretty extreme.
I don't know enough details of the implementation to really
answer the question, and I expect it's a bit complicated.
However, it's an interesting question, and I have relevant
programs and data files, so I tried to get some actuals.
The tests I ran don't directly answer the question posed, but may
be a useful proxy. I used Unix 'cut' (latest GNU version) and
'tsv-select' from the tsv-utils package
(https://github.com/eBay/tsv-utils). 'tsv-select' is written in
D, and works like 'cut'. 'tsv-select' reads from stdin or a file
via a 'File' struct. It's not using the built-in 'byLine' member
though, it uses a version of 'byLine' that includes some
additional buffering. Both stdin and a file system file are read
this way.
I used a file from the google ngram collection
(http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) and the file TREE_GRM_ESTN.csv from https://apps.fs.usda.gov/fia/datamart/CSV/datamart_csv.html, converted to a tsv file.
The ngram file is a narrow file (21 bytes/line, 4 columns), the
TREE file is wider (206 bytes/line, 49 columns). In both cases I
cut the 2nd and 3rd columns. This tends to focus processing on
input rather than processing and output. I also timed 'wc -l' for
another data point.
I ran the benchmarks 5 times each way and recorded the median
time below. Machine used is a MacMini (so Mac OS) with 16 GB RAM
and SSD drives. The numbers are very consisent for this test on
this machine. Differences in the reported times are real deltas,
not system noise. The commands timed were:
* bash -c 'tsv-select -f 2,3 FILE > /dev/null'
* bash -c 'cat FILE | tsv-select -f 2,3 > /dev/null'
* bash -c 'gcut -f 2,3 FILE > /dev/null'
* bash -c 'cat FILE | gcut -f 2,3 > /dev/null'
* bash -c 'gwc -l FILE > /dev/null'
* bash -c 'cat FILE | gwc -l > /dev/null'
Note that 'gwc' and 'gcut' are the GNU versions of 'wc' and 'cut'
installed by Homebrew.
Google ngram file (the 's' unigram file):
Test Elapsed System User
---- ------- ------ ----
tsv-select -f 2,3 FILE 10.28 0.42 9.85
cat FILE | tsv-select -f 2,3 11.10 1.45 10.23
cut -f 2,3 FILE 14.64 0.60 14.03
cat FILE | cut -f 2,3 14.36 1.03 14.19
wc -l FILE 1.32 0.39 0.93
cat FILE | wc -l 1.18 0.96 1.04
The TREE file:
Test Elapsed System User
---- ------- ------ ----
tsv-select -f 2,3 FILE 3.77 0.95 2.81
cat FILE | tsv-select -f 2,3 4.54 2.65 3.28
cut -f 2,3 FILE 17.78 1.53 16.24
cat FILE | cut -f 2,3 16.77 2.64 16.36
wc -l FILE 1.38 0.91 0.46
cat FILE | wc -l 2.02 2.63 0.77
What this shows is that 'tsv-select' (D program) was faster when
reading from a file than when reading from a standard input. It
doesn't indicate why or whether the delta is due to code D
library or code in 'tsv-select'.
Interestingly, 'cut' showed the opposite behavior. It was faster
when reading from standard input than when reading from the file.
For 'wc', which method was faster was dependent on line length.
Again, I caution against reading too much into this regarding
performance of reading from standard input vs a disk file. Much
more definitive tests can be done. However, it is an interesting
comparison.
Also, the D program is still fast in both cases.
--Jon
More information about the Digitalmars-d-learn
mailing list