If there are, say, 14 unique words then the executable compiled with GDC
doesn't always output the correct result and sometimes it gives
segmentation fault. 14 in this case would be the correct result, and 32
would not. It seems to work fine with very small data sets, but things
start to go wrong with larger ones.<br>
<br>As for the system, it's a 64-bit GNU/Linux, no multilib. What else do you need?<br><br>For GDC I've used gcc-4.4.5 and the following compiler flags:<br>'gdc -O2 -o count_d count.d'<br><br>I can't post the data because it's too large, but it shouldn't be too difficult to generate it. 1MB of text file should work.<br>
<br><div class="gmail_quote">On Fri, Dec 24, 2010 at 6:49 PM, Iain Buclaw <span dir="ltr"><<a href="mailto:ibuclaw@ubuntu.com">ibuclaw@ubuntu.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
== Quote from Caligo (<a href="mailto:iteronvexor@gmail.com">iteronvexor@gmail.com</a>)'s article<br>
> --000e0cd215b8b968a004982e3775<br>
> Content-Type: text/plain; charset=ISO-8859-1<br>
<div><div></div><div class="h5">> This is the page that would require your attention:<br>
> <a href="http://unthought.net/c++/c_vs_c++.html" target="_blank">http://unthought.net/c++/c_vs_c++.html</a><br>
> I'm going to ignore the C version because it's ugly and uses a hash. I'm<br>
> also going to ignore the fastest C++ version because it uses a digital trie<br>
> (it's very fast but extremely memory hungry; the complexity is constant over<br>
> the size of the input and linear over the length of the word being searched<br>
> for). I just wanted to focus on the language and the std library and not<br>
> have to implement a data structure.<br>
> Here is the C++ code:<br>
> #include <unordered_set><br>
> #include <string><br>
> #include <iostream><br>
> #include <stdio.h><br>
> int main(int argc, char* argv[]){<br>
> using namespace std;<br>
> char buf[8192];<br>
> string word;<br>
> unordered_set<string> wordcount;<br>
> while( scanf("%s", buf) != EOF ) wordcount.insert(buf);<br>
> cout << "Words: " << wordcount.size() << endl;<br>
> return 0;<br>
> }<br>
> For D I pretty much used the example from TDPL. As far as I can tell, the<br>
> associate array used is closer to std::map (or maybe std::unordered_map ?)<br>
> than std::unordered_set, but I don't know of any other data structures in D<br>
> for this (I'm still learning).<br>
> Here is the D code:<br>
> import std.stdio;<br>
> import std.string;<br>
> void main(){<br>
> size_t[string] dictionary;<br>
> foreach(line; stdin.byLine()){<br>
> foreach(word; splitter(strip(line))){<br>
> if(word in dictionary) continue;<br>
> dictionary[word.idup] = 1;<br>
> }<br>
> }<br>
> writeln("Words: ", dictionary.length);<br>
> }<br>
> Here are the measurements (average of 3 runs):<br>
> C++<br>
> ===<br>
> Data size: 990K with 23K unique words<br>
> real 0m0.055s<br>
> user 0m0.046s<br>
> sys 0m0.000<br>
> Data size: 9.7M with 23K unique words<br>
> real 0m0.492s<br>
> user 0m0.470s<br>
> sys 0m0.013<br>
> Data size: 5.1M with 65K unique words<br>
> real 0m0.298s<br>
> user 0m0.277s<br>
> sys 0m0.013<br>
> Data size: 51M with 65K unique words<br>
> real 0m2.589s<br>
> user 0m2.533s<br>
> sys 0m0.070<br>
> DMD D 2.051<br>
> ===<br>
> Data size: 990K with 23K unique words<br>
> real 0m0.064s<br>
> user 0m0.053s<br>
> sys 0m0.006<br>
> Data size: 9.7M with 23K unique words<br>
> real 0m0.513s<br>
> user 0m0.487s<br>
> sys 0m0.013<br>
> Data size: 5.1M with 65K unique words<br>
> real 0m0.305s<br>
> user 0m0.287s<br>
> sys 0m0.007<br>
> Data size: 51M with 65K unique words<br>
> real 0m2.683s<br>
> user 0m2.590s<br>
> sys 0m0.103<br>
> GDC D 2.051<br>
> ===<br>
> Data size: 990K with 23K unique words<br>
> real 0m0.146s<br>
> user 0m0.140s<br>
> sys 0m0.000<br>
> Data size: 9.7M with 23K unique words<br>
> Segmentation fault<br>
> Data size: 5.1M with 65K unique words<br>
> Segmentation fault<br>
> Data size: 51M with 65K unique words<br>
> Segmentation fault<br>
> GDC fails for some reason with large number of unique words and/or large<br>
> data. Also, GDC doesn't always give correct results; the word count is<br>
> usually off by a few hundred.<br>
> D and C++ are very close. Without scanf() C++ is almost twice as slow.<br>
> Also, using std::unordered_set, the performance almost doubles.<br>
> I'm interested to see a better D version than the one I posted.<br>
> P.S.<br>
> No flame wars please.<br>
<br>
</div></div>System details, compiler flags and the test data you used would be helpful. Else<br>
can't be sure what you mean by "doesn't always give correct results". :~)<br>
</blockquote></div><br>