tolf and detab

Sun Aug 8 05:44:56 PDT 2010

Andrei Alexandrescu:
> This makes me think we should have a range that detects and replaces 
> patterns lazily and on the fly.

In Python there is a helper module:
http://docs.python.org/library/fileinput.html

> I think it's worth targeting D2 to tasks that are usually handled by
> scripting languages. I've done a lot of that and it beats the hell out
> of rewriting in D a script that's grown out of control

Dynamic languages are handy but they require some rigour when you program. Python is probably unfit to write one million lines long programs, but if you train yourself a little and keep your code clean, you usually become able to write clean larghish programs in Python.

> That would be great so we can tune our approach. Thanks!

In my dlibs I have the xio module that read by lines efficiently, it was faster than the iterating on the lines of BufferedFile.
There are tons of different benchmarks that you may use, but a simple one to start is better, one that just iterates the file lines. See below.

Related: experiments have shown that the (oldish) Java GC improves its performance if it is able to keep strings (that are immutable) in a separate memory pool, _and_ be able to recognize duplicated strings, of course keeping only one string for each equality set. It's positive to do a similar experiment with the D GC, but first you need applications that use the GC to test if this idea is an improvement :-)

So I have used a minimal benchmark:

--------------------------

# Python 2.7 code
from sys import argv

def process(file_name):
    total = 0
    for line in open(file_name):
        total += len(line)
    return total

print "Total:", process(argv[1])

--------------------------

// D2 code
import std.stdio: File, writeln;

int process(string fileName) {
    int total = 0;

    auto file = File(fileName);
    foreach (rawLine; file.byLine()) {
        string line = rawLine.idup;
        total += line.length;
    }
    file.close();

    return total;
}

void main(string[] args) {
    if (args.length == 2)
        writeln("Total: ", process(args[1]));
}

--------------------------

In the D code I have added an idup to make the comparison more fair, because in the Python code the "line" is a true newly allocated line, you can safely use it as dictionary key.

I have used Python 2.7 with no Psyco JIT (http://psyco.sourceforge.net/ ) to speed up the Python code because it's not available yet for Python 2.7. D code compiled with dmd 2.047, optimized build.

As test text data I have used a concatenation of all text files here (they are copyrighted, but freely usable):
http://gnosis.cx/TPiP/

The result on Windows is a file of 1_116_552 bytes.

I have attached the file to itself, duplicating its length, some times, the result is a file of 71_459_328 bytes (this is not an fully realistic case because you often have many small files to read instead of a very large one).

The timings are taken with warm disk cache, so they are essentially read from RAM. This is not fully realistic, but if you want to write a benchmark you have to do this, because for me it's very hard on Windows to make sure that the disk cache is fully empty. So it's better to do the opposite and benchmark a warm file.

The output of the Python code is:
Total: 69789888
Found in 0.88 seconds (best of 6, the variance is minimal).

The output of the D code is:
Total: 69789888
Found in 1.28 seconds (best of 6, minimal variance).

If in the D2 code I comment out the idup like this:

    foreach (rawLine; file.byLine()) {
        total += rawLine.length;
    }

The output of the D code without idup is:
Total: 69789888
Found in 0.75 seconds (best of 6, minimal variance).

As you see it's a matter of GC efficiency too.

Beside the GC the cause of the higher performance of the Python code comes from a tuned design, you can see the function getline_via_fgets here:
http://svn.python.org/view/python/trunk/Objects/fileobject.c?revision=81275&view=markup
It uses a "stack buffer" (char buf[MAXBUFSIZE]; where MAXBUFSIZE is 300) too.

Bye,
bearophile