D vs C++

Fri Dec 24 16:49:15 PST 2010

== Quote from Caligo (iteronvexor at gmail.com)'s article
> --000e0cd215b8b968a004982e3775
> Content-Type: text/plain; charset=ISO-8859-1
> This is the page that would require your attention:
> http://unthought.net/c++/c_vs_c++.html
> I'm going to ignore the C version because it's ugly and uses a hash.  I'm
> also going to ignore the fastest C++ version because it uses a digital trie
> (it's very fast but extremely memory hungry; the complexity is constant over
> the size of the input and linear over the length of the word being searched
> for).  I just wanted to focus on the language and the std library and not
> have to implement a data structure.
> Here is the C++ code:
> #include <unordered_set>
> #include <string>
> #include <iostream>
> #include <stdio.h>
> int main(int argc, char* argv[]){
>   using namespace std;
>   char buf[8192];
>   string word;
>   unordered_set<string> wordcount;
>   while( scanf("%s", buf) != EOF ) wordcount.insert(buf);
>   cout << "Words: " << wordcount.size() << endl;
>   return 0;
> }
> For D I pretty much used the example from TDPL.  As far as I can tell, the
> associate array used is closer to std::map (or maybe std::unordered_map ?)
> than std::unordered_set, but I don't know of any other data structures in D
> for this (I'm still learning).
> Here is the D code:
> import std.stdio;
> import std.string;
> void main(){
>   size_t[string] dictionary;
>   foreach(line; stdin.byLine()){
>     foreach(word; splitter(strip(line))){
>       if(word in dictionary) continue;
>       dictionary[word.idup] = 1;
>     }
>   }
>   writeln("Words: ", dictionary.length);
> }
> Here are the measurements (average of 3 runs):
> C++
> ===
> Data size: 990K with 23K unique words
> real    0m0.055s
> user   0m0.046s
> sys     0m0.000
> Data size: 9.7M with 23K unique words
> real    0m0.492s
> user   0m0.470s
> sys    0m0.013
> Data size: 5.1M with 65K unique words
> real    0m0.298s
> user   0m0.277s
> sys    0m0.013
> Data size: 51M with 65K unique words
> real    0m2.589s
> user   0m2.533s
> sys    0m0.070
> DMD D 2.051
> ===
> Data size: 990K with 23K unique words
> real    0m0.064s
> user   0m0.053s
> sys     0m0.006
> Data size: 9.7M with 23K unique words
> real    0m0.513s
> user   0m0.487s
> sys    0m0.013
> Data size: 5.1M with 65K unique words
> real    0m0.305s
> user   0m0.287s
> sys    0m0.007
> Data size: 51M with 65K unique words
> real    0m2.683s
> user   0m2.590s
> sys    0m0.103
> GDC D 2.051
> ===
> Data size: 990K with 23K unique words
> real    0m0.146s
> user   0m0.140s
> sys     0m0.000
> Data size: 9.7M with 23K unique words
> Segmentation fault
> Data size: 5.1M with 65K unique words
> Segmentation fault
> Data size: 51M with 65K unique words
> Segmentation fault
> GDC fails for some reason with large number of unique words and/or large
> data.  Also, GDC doesn't always give correct results; the word count is
> usually off by a few hundred.
> D and C++ are very close.  Without scanf() C++ is almost twice as slow.
> Also, using std::unordered_set, the performance almost doubles.
> I'm interested to see a better D version than the one I posted.
> P.S.
> No flame wars please.

System details, compiler flags and the test data you used would be helpful. Else
can't be sure what you mean by "doesn't always give correct results". :~)