Reading from stdin significantly slower than reading file directly?

Thu Aug 13 14:39:45 UTC 2020

On Thursday, 13 August 2020 at 07:08:21 UTC, Jon Degenhardt wrote:
> Test                          Elapsed  System   User
> ----                          -------  ------   ----
> tsv-select -f 2,3 FILE          10.28    0.42   9.85
> cat FILE | tsv-select -f 2,3    11.10    1.45  10.23
> cut -f 2,3 FILE                 14.64    0.60  14.03
> cat FILE | cut -f 2,3           14.36    1.03  14.19
> wc -l FILE                       1.32    0.39   0.93
> cat FILE | wc -l                 1.18    0.96   1.04
>
>
> The TREE file:
>
> Test                          Elapsed  System   User
> ----                          -------  ------   ----
> tsv-select -f 2,3 FILE           3.77    0.95   2.81
> cat FILE | tsv-select -f 2,3     4.54    2.65   3.28
> cut -f 2,3 FILE                 17.78    1.53  16.24
> cat FILE | cut -f 2,3           16.77    2.64  16.36
> wc -l FILE                       1.38    0.91   0.46
> cat FILE | wc -l                 2.02    2.63   0.77
>

Your table shows that when piping the output from one process to 
another, there's a lot more time spent in kernel mode. A switch 
from user mode to kernel mode is expensive [1].
It costs around 1000-1500 clock cycles for a call to getpid() on 
most systems. That's around 100 clock cycles for the actual 
switch and the rest is overhead.

My theory is this:
One of the reasons for the slowdown is very likely mutex 
un/locking of which there is more need when multiple processes 
and (global) resources are involved compared to a single instance.
Another is copying buffers.
  When you read a file the data is first read into a kernel buffer 
which is then copied to the user space buffer i.e. the buffer you 
allocated in your program (the reading part might not happen if 
the data is still in the cache).
If you read the file directly in your program, the data is copied 
once from kernel space to user space.
When you read from stdin (which is technically a file) it would 
seem that cat reads the file which means a copy from kernel to 
user space (cat), then cat outputs that buffer to stdout (also 
technically a file) which is another copy, then you read from 
stdin in your program which will cause another copy from stdout 
to stdin and finally to your allocated buffer.
Each of those steps may invlovle a mutex un/lock.
Also with pipes you start two programs. Starting a program takes 
a few ms.

PS. If you do your own caching, or if you don't care about it 
because you just read a file sequentially once, you may benefit 
from opening your file with the O_DIRECT flag which basically 
means that the kernel copies directly into user space buffers.

[1] https://en.wikipedia.org/wiki/Ring_(computer_security)