A file reading benchmark

Thu Feb 23 12:11:39 PST 2012

On 2/17/12 7:44 PM, bearophile wrote:
> A tiny little file lines reading benchmark I've just found on Reddit:
> http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/
>
> http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html
>
> The Ruby code that generates slowly the test data:
> https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
> But for my timings I have used only about a 40% of that file, the first 1_965_800 lines, because I have less memory.
>
> My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds (the D version runs in 13.20 seconds if I don't disable the GC).
>
>  From many other benchmarks I've seen that file reading line-by-line is slow in D.
>
> -------------------------
> My D code:
[snip]

The thread in D.announce prompted me to go back to this, and I've run a 
simple test that isolates file reads from everything else. After 
generating the CSV data as described above, I ran this Python code:

import sys

rows = []
f = open(sys.argv[1])
for line in f:
     if len(line) > 10000: rows.append(line[:-1].split("\t"))

and this D code:

import std.stdio, std.string, std.array;

void main(in string[] args) {
     Appender!(string[][]) rows;
     auto f = File(args[1]);
     foreach (line; f.byLine()) {
         if (line.length > 10000) rows.put(line.idup.split("\t"));
     }
}

Both programs end up appending nothing because 10000 is larger than any 
line length.

On my machine (Mac OSX Lion), the Python code clocks around 1.2 seconds 
and the D code at a whopping 9.3 seconds. I looked around where the 
problem lies and sure enough the issue was with a slow loop in the 
generic I/O implementation of readln. The commit 
https://github.com/D-Programming-Language/phobos/commit/94b21d38d16e075d7c44b53015eb1113854424d0 
brings the speed of the test to 2.1 seconds. We could and should reduce 
that further with taking buffering in our own hands, but for now this is 
a good low-hanging fruit to pick.

Andrei