Speeding up importing Phobos files

Mon Jan 21 05:33:25 UTC 2019

On Saturday, 19 January 2019 at 16:30:39 UTC, Andrei Alexandrescu 
wrote:
> On 1/19/19 4:12 AM, FeepingCreature wrote:
>> On Saturday, 19 January 2019 at 09:08:00 UTC, Walter Bright 
>> wrote:
>>> On 1/19/2019 1:00 AM, Temtaime wrote:
>>>> C'mon, everyone has a SSD, OS tends to cache previously 
>>>> opened files. What's the problem ?
>>>> Better speedup compilation speed.
>>>
>>> You'd think that'd be true, but it isn't. File reads are 
>>> fast, but file lookups are slow. Searching for a file along a 
>>> path is particularly slow.
>> 
>> If you've benchmarked this, could you please post your 
>> benchmark source so people can reproduce it? Probably be good 
>> to gather data from more than one PC. Maybe make a minisurvey 
>> for the results.
>
> I've done a bunch of measurements while I was working on 
> https://github.com/dlang/DIPs/blob/master/DIPs/DIP1005.md, on a 
> modern machine with SSD and Linux (which aggressively caches 
> file contents). I don't think I still have the code, but it 
> shouldn't be difficult to sit down and produce some. The 
> overall conclusion of those experiments was that if you want to 
> improve compilation speed, you need to minimize the number of 
> files opened; once opened, whether it was 1 KB or 100 KB made 
> virtually no difference.
>
> One thing I didn't measure was whether opening the file was 
> most overhead, or closing also had a large share.

I deal with large compressed files. For large data lz4 would 
probably be a better choice over zip these days. And, even with 
cached lookups of dir entries, I think one file that is 
sequentially read will always be an improvement. Note also that 
compressed files may even be faster than uncompressed ones with 
some system configurations (relatively slow disk IO, many 
processors).

Another thing to look at is indexed compressed files. For example 
http://www.htslib.org/doc/tabix.html. Using those we may 
partition phobos into sensible sub-sections. Particularly section 
out those submodules people hardly ever use.