A few measurements of stat()'s speed
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Wed Mar 27 01:32:35 UTC 2019
On 3/26/19 6:04 PM, Vladimir Panteleev wrote:
> On Tuesday, 26 March 2019 at 18:06:08 UTC, Andrei Alexandrescu wrote:
>> On a Linux moderately-loaded local directory (146 files) mounted from
>> an SSD drive, one failed stat() takes only about 0.5 microseconds.
>> That means e.g. if a module imports std.all (which fails 142 times),
>> the overhead accountable to failed stat() calls is about 70
>> microseconds, i.e. negligible.
>
> I have some related experience with this:
>
> - The eternal battle of keeping The Server's load levels down involves
> some deal of I/O profiling. The pertinent observation was that opening a
> file by name can be much faster than enumerating files in a directory.
> The reason for that is many filesystems implementing directories using
> some variant of hash table, with accessing a file by name being one hash
> table lookup, while enumerating all files meaning reading the entire thing.
>
> - stat() is slow. It fetches a lot of information. Many filesystems do
> not have all of that information as readily accessible as a file name.
> This is observable through a simple test: on Ubuntu, drop caches, then,
> in a big directory, compare the execution time of `ls|cat` vs. `ls`.
> Explanation: when ls's output is a terminal, it will fetch extra
> information to colorize objects depending on their properties. These are
> fetched using stat(), but that's not done when it's piped into a file /
> another program. I had to take this into account when implementing a
> fast directory iterator [1] (stat only until necessary). dirEntries from
> std.file does some of this too, but not to the full extent.
>
> My suggestion is: if we are going to read the file if it exists, don't
> even stat(), just open it. It might result in faster total performance
> as a result.
>
> I would not recommend tricks like readdir() and caching. This ought to
> be done at the filesystem layer, and smells of problems like TOCTOU /
> cache invalidation. In any case, I would not suggest spending time on it
> unless someone encounters a specific, real-life situation where the
> additional complexity would make it worthwhile to research workarounds.
That's solid, thanks very much!
What seems to be the case according to
https://github.com/dlang/dmd/blob/master/src/dmd/dmodule.d is that a
bunch of "exists" are invoked (presumably those would call stat()). Then
a filename is returned, which is used to create a File object, see
https://github.com/dlang/dmd/blob/master/src/dmd/root/file.d. In turn,
that calls open() and then fstat() again on the opened handle. Quite
wasteful on the face of it, but hey if the measurable benefit is low not
worth optimizing.
>> So the question is whether many projects are likely to import files
>> over network mounts, which would motivate the optimization. Please
>> share your thoughts, thanks.
>
> Honestly, this sounds like you have a solution in search of a problem.
>
> [1]:
> https://github.com/CyberShadow/ae/blob/25850209e03ee97640a9b0715efe7e25b1fcc62d/sys/file.d#L740
Agreed. Just looking for low-hanging fruit to pluck.
Andrei
More information about the Digitalmars-d
mailing list