howto count lines - fast

H. S. Teoh via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Tue May 30 16:41:01 PDT 2017


On Tue, May 30, 2017 at 08:02:38PM +0000, Nitram via Digitalmars-d-learn wrote:
> After reading
> https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i
> was wondering how fast one can do a simple "wc -l" in D.
> 
> So i made a couple short implementations and found myself confronted
> with slow results compared to "/usr/bin/wc -l".
> 
> How would a implementation look like in D, which is fast?

This little challenge piqued my interest.  So I decided to take a shot
at seeing if I could beat my system's /usr/bin/wc -l.

First order of business: whenever it comes to performance, always choose
the right compiler for the job.  While dmd is the reference compiler and
has the latest and greatest features, it doesn't do too well in the
performance department.  So the first thing I did was to use gdc
instead. Specifically, gdc -O3, to get the most aggressive optimizations
the gcc backend can give me.

Second order of business: the naïve algorithm of reading 1 character at
a time from the file is obviously suboptimal, so I didn't even consider
it.

Third order of business: I constructed a large text file containing
10634420 lines by concatenating 5 copies of a large CSV file I obtained
online. The sample had to be large enough so that overhead like program
startup times and background system noise don't bias the test results
too much.

With this setup, I measured the performance of my system's wc -l as the
baseline for comparing the performance of the D solutions. Here's a
typical output of my shell's `time wc -l` command:

	real    0m0.344s
	user    0m0.153s
	sys     0m0.191s

Since we're talking about beating wc, and others have already posted
the obvious solutions of loading into a buffer and scanning the buffer,
the most obvious next step up is to use std.mmfile to memory-map the
file so that we don't waste effort managing file buffers ourselves: let
the OS do it for us.

So here are the algorithms I tried:

1) lineCount1(): use std.mmfile to memory-map the file so that we can
scan it as a single, contiguous array without any fussy
buffer-management details. Result: pretty fast, but still loses to wc.
Typical measurements:

	real    0m0.422s
	user    0m0.366s
	sys     0m0.054s

2) lineCount2(): just as a comparison, I did a read-into-buffer
algorithm so that we have a reference as to how it performs. As
expected, it's worse than the std.mmfile solution. Typical measurements:

	real    0m0.519s
	user    0m0.320s
	sys     0m0.198s

3) lineCount3(): In lineCount1(), I used std.algorithm.searching.count
for counting newlines; so I thought, what if the range abstraction is
introducing too much overhead?  So I decided to write a simple foreach
loop instead, in hope that the simpler code will allow the gcc optimizer
do a better job. Sadly, the result is pretty much the same as
lineCount1:

	real    0m0.421s
	user    0m0.379s
	sys     0m0.042s

(The good news, though, is that this shows that using std.algorithm does
not introduce excessive overhead compared to a hand-written loop. This
proves that the high-level range abstraction does not necessarily equate
to slower performance.)

4) lineCount4(): Having failed to beat wc thus far, it's time to bring
out the big guns.  Since we're essentially counting newlines in the
input, who says we have to process the data sequentially from start to
end?  Counting from front to back or back to front gives the same
answer, as does counting from middle to end, then front to middle. In
particular, counting from middle to end *in parallel* with counting from
front to middle, then summing the subtotals, gives the same answer.
I.e., time to bust out std.parallelism. :-)  So, in my 4th algorithm, I
split the input into 16KB chunks, and then scan them for newlines in
parallel, creating file_size/16384 subtotals. Then I sum the subtotals
in the end. Here's the result, tested on my AMD Phenom 6-core CPU:

	real    0m0.242s
	user    0m1.151s
	sys     0m0.057s

Woohoo!!! Finally, we beat the system's wc -l!! And by a pretty fair
margin, too.  Eat your heart out, wc!!!  (The large user time is because
we're using all 6 cores at once. But the actual elapsed time is
shorter.)

Here's the code for all of the above:

---------snip---------
import std.stdio;

// real    0m0.422s
// user    0m0.366s
// sys     0m0.054s
size_t lineCount1(string filename)
{
    import std.algorithm.searching : count;
    import std.mmfile;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    return data.count('\n');
}

// real    0m0.519s
// user    0m0.320s
// sys     0m0.198s
size_t lineCount2(string filename)
{
    import std.algorithm.searching : count;

    auto f = File(filename);
    ubyte[] buf;
    size_t c = 0;

    buf.length = 32768;
    do
    {
        auto d = f.rawRead(buf);
        c += d.count('\n');
    } while (!f.eof());

    return c;
}

// real    0m0.421s
// user    0m0.379s
// sys     0m0.042s
size_t lineCount3(string filename)
{
    import std.mmfile;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    size_t c;

    foreach (i; 0 .. data.length)
    {
        if (data[i] == '\n') c++;
    }
    return c;
}

// real    0m0.242s
// user    0m1.151s
// sys     0m0.057s
size_t lineCount4(string filename)
{
    import std.algorithm.comparison : min;
    import std.algorithm.iteration : sum;
    import std.mmfile;
    import std.parallelism;
    import std.range : chunks;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    size_t[] subtotals;

    enum blockSize = 16384;
    if (data.length == 0) return 0;
    subtotals.length = 1 + (data.length - 1) / blockSize;

    foreach (j, ref subtotal; parallel(subtotals))
    {
        size_t end = min(blockSize*(j+1), data.length);
        foreach (i; blockSize*j .. end)
        {
            if (data[i] == '\n') subtotal++;
        }
    }
    return subtotals.sum;
}

int main(string[] args)
{
    if (args.length < 2)
    {
        stderr.writeln("Please specify target file.");
        return 1;
    }

    auto filename = args[1];
    //auto count = lineCount1(filename);
    //auto count = lineCount2(filename);
    //auto count = lineCount3(filename);
    auto count = lineCount4(filename);
    writeln(count);
    return 0;
}

// vim:set sw=4 ts=4 et ai:
---------snip---------


T

-- 
BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL


More information about the Digitalmars-d-learn mailing list