A few measurements of stat()'s speed

Wed Mar 27 01:32:35 UTC 2019

On 3/26/19 6:04 PM, Vladimir Panteleev wrote:
> On Tuesday, 26 March 2019 at 18:06:08 UTC, Andrei Alexandrescu wrote:
>> On a Linux moderately-loaded local directory (146 files) mounted from 
>> an SSD drive, one failed stat() takes only about 0.5 microseconds. 
>> That means e.g. if a module imports std.all (which fails 142 times), 
>> the overhead accountable to failed stat() calls is about 70 
>> microseconds, i.e. negligible.
> 
> I have some related experience with this:
> 
> - The eternal battle of keeping The Server's load levels down involves 
> some deal of I/O profiling. The pertinent observation was that opening a 
> file by name can be much faster than enumerating files in a directory. 
> The reason for that is many filesystems implementing directories using 
> some variant of hash table, with accessing a file by name being one hash 
> table lookup, while enumerating all files meaning reading the entire thing.
> 
> - stat() is slow. It fetches a lot of information. Many filesystems do 
> not have all of that information as readily accessible as a file name. 
> This is observable through a simple test: on Ubuntu, drop caches, then, 
> in a big directory, compare the execution time of `ls|cat` vs. `ls`. 
> Explanation: when ls's output is a terminal, it will fetch extra 
> information to colorize objects depending on their properties. These are 
> fetched using stat(), but that's not done when it's piped into a file / 
> another program. I had to take this into account when implementing a 
> fast directory iterator [1] (stat only until necessary). dirEntries from 
> std.file does some of this too, but not to the full extent.
> 
> My suggestion is: if we are going to read the file if it exists, don't 
> even stat(), just open it. It might result in faster total performance 
> as a result.
> 
> I would not recommend tricks like readdir() and caching. This ought to 
> be done at the filesystem layer, and smells of problems like TOCTOU / 
> cache invalidation. In any case, I would not suggest spending time on it 
> unless someone encounters a specific, real-life situation where the 
> additional complexity would make it worthwhile to research workarounds.

That's solid, thanks very much!

What seems to be the case according to 
https://github.com/dlang/dmd/blob/master/src/dmd/dmodule.d is that a 
bunch of "exists" are invoked (presumably those would call stat()). Then 
a filename is returned, which is used to create a File object, see 
https://github.com/dlang/dmd/blob/master/src/dmd/root/file.d. In turn, 
that calls open() and then fstat() again on the opened handle. Quite 
wasteful on the face of it, but hey if the measurable benefit is low not 
worth optimizing.

>> So the question is whether many projects are likely to import files 
>> over network mounts, which would motivate the optimization. Please 
>> share your thoughts, thanks.
> 
> Honestly, this sounds like you have a solution in search of a problem.
> 
> [1]: 
> https://github.com/CyberShadow/ae/blob/25850209e03ee97640a9b0715efe7e25b1fcc62d/sys/file.d#L740 

Agreed. Just looking for low-hanging fruit to pluck.

Andrei