Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Feb 23 12:11:39 PST 2012

On 2/17/12 7:44 PM, bearophile wrote:
> A tiny little file lines reading benchmark I've just found on Reddit:
> http://www.reddit.com/r/programming/comments/pub98/a_benchmark_for_reading_flat_files_into_memory/
> http://steve.80cols.com/reading_flat_files_into_memory_benchmark.html
> The Ruby code that generates slowly the test data:
> https://raw.github.com/lorca/flat_file_benchmark/master/gen_data.rb
> But for my timings I have used only about a 40% of that file, the first 1_965_800 lines, because I have less memory.
> My Python-Psyco version runs in 2.46 seconds, the D version in 4.65 seconds (the D version runs in 13.20 seconds if I don't disable the GC).
>  From many other benchmarks I've seen that file reading line-by-line is slow in D.
> -------------------------
> My D code:

The thread in D.announce prompted me to go back to this, and I've run a 
simple test that isolates file reads from everything else. After 
generating the CSV data as described above, I ran this Python code:

import sys

rows = []
f = open(sys.argv[1])
for line in f:
     if len(line) > 10000: rows.append(line[:-1].split("\t"))

and this D code:

import std.stdio, std.string, std.array;

void main(in string[] args) {
     Appender!(string[][]) rows;
     auto f = File(args[1]);
     foreach (line; f.byLine()) {
         if (line.length > 10000) rows.put(line.idup.split("\t"));

Both programs end up appending nothing because 10000 is larger than any 
line length.

On my machine (Mac OSX Lion), the Python code clocks around 1.2 seconds 
and the D code at a whopping 9.3 seconds. I looked around where the 
problem lies and sure enough the issue was with a slow loop in the 
generic I/O implementation of readln. The commit 
brings the speed of the test to 2.1 seconds. We could and should reduce 
that further with taking buffering in our own hands, but for now this is 
a good low-hanging fruit to pick.


