A few measurements of stat()'s speed

Tue Mar 26 22:04:05 UTC 2019

On Tuesday, 26 March 2019 at 18:06:08 UTC, Andrei Alexandrescu 
wrote:
> On a Linux moderately-loaded local directory (146 files) 
> mounted from an SSD drive, one failed stat() takes only about 
> 0.5 microseconds. That means e.g. if a module imports std.all 
> (which fails 142 times), the overhead accountable to failed 
> stat() calls is about 70 microseconds, i.e. negligible.

I have some related experience with this:

- The eternal battle of keeping The Server's load levels down 
involves some deal of I/O profiling. The pertinent observation 
was that opening a file by name can be much faster than 
enumerating files in a directory. The reason for that is many 
filesystems implementing directories using some variant of hash 
table, with accessing a file by name being one hash table lookup, 
while enumerating all files meaning reading the entire thing.

- stat() is slow. It fetches a lot of information. Many 
filesystems do not have all of that information as readily 
accessible as a file name. This is observable through a simple 
test: on Ubuntu, drop caches, then, in a big directory, compare 
the execution time of `ls|cat` vs. `ls`. Explanation: when ls's 
output is a terminal, it will fetch extra information to colorize 
objects depending on their properties. These are fetched using 
stat(), but that's not done when it's piped into a file / another 
program. I had to take this into account when implementing a fast 
directory iterator [1] (stat only until necessary). dirEntries 
from std.file does some of this too, but not to the full extent.

My suggestion is: if we are going to read the file if it exists, 
don't even stat(), just open it. It might result in faster total 
performance as a result.

I would not recommend tricks like readdir() and caching. This 
ought to be done at the filesystem layer, and smells of problems 
like TOCTOU / cache invalidation. In any case, I would not 
suggest spending time on it unless someone encounters a specific, 
real-life situation where the additional complexity would make it 
worthwhile to research workarounds.

> So the question is whether many projects are likely to import 
> files over network mounts, which would motivate the 
> optimization. Please share your thoughts, thanks.

Honestly, this sounds like you have a solution in search of a 
problem.

[1]: 
https://github.com/CyberShadow/ae/blob/25850209e03ee97640a9b0715efe7e25b1fcc62d/sys/file.d#L740