D vs C++

Fri Dec 24 22:21:11 PST 2010

Caligo:

> I'm going to ignore the C version because it's ugly and uses a hash.

Some of the others too use a hash. You can write nice looking code in C too, but you need more skills :-)

> I'm also going to ignore the fastest C++ version because it uses a digital trie
> (it's very fast but extremely memory hungry; the complexity is constant over
> the size of the input and linear over the length of the word being searched
> for).

The fastest C++ version uses more memory, but sometimes if you need more performance it may become the right choice.

> I just wanted to focus on the language and the std library and not
> have to implement a data structure.

One of the few advantages of D over Python is that in D you are able to implement efficient and custom data structures without leaving the D language itself :-)

> For D I pretty much used the example from TDPL.  As far as I can tell, the
> associate array used is closer to std::map (or maybe std::unordered_map ?)

D built-in AAs are a hash map, but they use comparisons to resolve collisions. This makes D AAs strong against malicious attacks. Python dicts are faster but they are a pure hash map.

> than std::unordered_set, but I don't know of any other data structures in D
> for this (I'm still learning).

A unordered_set is not present in stc.collections yet.

> Here are the measurements (average of 3 runs):

Your timings lack information about the CPU, compilation switches used, and C++ compiler version used.
Are those really averages?

> I'm interested to see a better D version than the one I posted.

If you want to use only the built-ins and std lib I think you can't improve your code a lot. To go faster you need to go lower level.

Regarding your code, break and continue statements are not Structured Programming, so it's better to avoid them when possible. I write your code like this:

import std.stdio, std.string;

void main() {
    size_t[string] dictionary;

    foreach (line; stdin.byLine())
        foreach (word; line.strip().splitter())
            if (word !in dictionary)
                dictionary[word.idup] = 1;

    writeln("Words: ", dictionary.length);
}

This Python2 version is as fast as the D-DMD version with a 8.7 MB file that contains about 120_000 words:

from sys import stdin
import psyco

def main():
    dictionary = {}
    for line in stdin:
        for word in line.split():
            if word not in dictionary:
                dictionary[word] = 1
    print "Words:", len(dictionary)

psyco.bind(main)
main()

Bye,
bearophile