This is the page that would require your attention: <a href="http://unthought.net/c++/c_vs_c++.html">http://unthought.net/c++/c_vs_c++.html</a><br><br>I'm going to ignore the C version because it's ugly and uses a hash. I'm also going to ignore the fastest C++ version because it uses a digital trie (it's very fast but extremely memory hungry; the complexity is constant over the size of the input and linear over the length of the word being searched for). I just wanted to focus on the language and the std library and not have to implement a data structure.<br>
<br>Here is the C++ code:<br><br>#include <unordered_set><br>#include <string><br>#include <iostream><br>#include <stdio.h><br><br>int main(int argc, char* argv[]){ <br> <br> using namespace std;<br>
char buf[8192];<br> string word;<br> unordered_set<string> wordcount;<br> while( scanf("%s", buf) != EOF ) wordcount.insert(buf);<br> cout << "Words: " << wordcount.size() << endl;<br>
<br> return 0;<br>}<br><br>For D I pretty much used the example from TDPL. As far as I can tell, the associate array used is closer to std::map (or maybe std::unordered_map ?) than std::unordered_set, but I don't know of any other data structures in D for this (I'm still learning).<br>
Here is the D code:<br><br>import std.stdio;<br>import std.string;<br><br>void main(){<br><br> size_t[string] dictionary;<br> foreach(line; stdin.byLine()){<br> foreach(word; splitter(strip(line))){<br> if(word in dictionary) continue;<br>
dictionary[word.idup] = 1;<br> }<br> }<br> writeln("Words: ", dictionary.length);<br>}<br><br>Here are the measurements (average of 3 runs):<br><br>C++<br>===<br>Data size: 990K with 23K unique words<br>
real 0m0.055s<br>user 0m0.046s<br>sys 0m0.000<br><br>Data size: 9.7M with 23K unique words<br>real 0m0.492s<br>
user 0m0.470s<br>
sys 0m0.013<br><br>Data size: 5.1M with 65K unique words<br>real 0m0.298s<br>
user 0m0.277s<br>
sys 0m0.013<br><br>Data size: 51M with 65K unique words<br>real 0m2.589s<br>
user 0m2.533s<br>
sys 0m0.070<br><br><br>DMD D 2.051<br>
===<br>
Data size: 990K with 23K unique words<br>
real 0m0.064s<br>
user 0m0.053s<br>
sys 0m0.006<br>
<br>
Data size: 9.7M with 23K unique words<br>
real 0m0.513s<br>
user 0m0.487s<br>
sys 0m0.013<br>
<br>
Data size: 5.1M with 65K unique words<br>
real 0m0.305s<br>
user 0m0.287s<br>
sys 0m0.007<br>
<br>
Data size: 51M with 65K unique words<br>
real 0m2.683s<br>
user 0m2.590s<br>
sys 0m0.103<br>
<br><br>GDC D 2.051<br>
===<br>
Data size: 990K with 23K unique words<br>
real 0m0.146s<br>
user 0m0.140s<br>
sys 0m0.000<br>
<br>
Data size: 9.7M with 23K unique words<br>Segmentation fault<br>
<br>
Data size: 5.1M with 65K unique words<br>
Segmentation fault<br><br>
Data size: 51M with 65K unique words<br>
Segmentation fault<br>
<br>
<br>GDC fails for some reason with large number of unique words and/or large data. Also, GDC doesn't always give correct results; the word count is usually off by a few hundred. <br>
<br>D and C++ are very close. Without scanf() C++ is almost twice as slow. Also, using std::unordered_set, the performance almost doubles.<br><br>I'm interested to see a better D version than the one I posted.<br><br>
P.S.<br>No flame wars please. <br><br><br><br><br><br>