d word counting approach performs well but has higher mem usage
dwdv
dwdv at posteo.de
Sat Nov 3 14:26:02 UTC 2018
Hi there,
the task is simple: count word occurrences from stdin (around 150mb in
this case) and print sorted results to stdout in a somewhat idiomatic
fashion.
Now, d is quite elegant while maintaining high performance compared to
both c and c++, but I, as a complete beginner, can't identify where the
10x memory usage (~300mb, see results below) is coming from.
Unicode overhead? Internal buffer? Is something slurping the whole file?
Assoc array allocations? Couldn't find huge allocs with dmd -vgc and
-profile=gc either. What did I do wrong?
```d ===================================================================
void main()
{
import std.stdio, std.algorithm, std.range;
int[string] count;
foreach(const word; stdin.byLine.map!splitter.joiner) {
++count[word];
}
//or even:
//foreach(line; stdin.byLine) {
// foreach(const word; line.splitter) {
// ++count[word];
// }
//}
count.byKeyValue
.array
.sort!((a, b) => a.value > b.value)
.each!(a => writefln("%d %s", a.value, a.key));
}
```
```c++ (for reference) =================================================
#include <iostream>
#include <vector>
#include <unordered_map>
#include <algorithm>
using namespace std;
int main() {
string s;
unordered_map<string, int> count;
std::ios::sync_with_stdio(false);
while (cin >> s) {
count[s]++;
}
vector<pair<string, int>> temp {begin(count), end(count)};
sort(begin(temp), end(temp),
[](const auto& a, const auto& b) {return b.second < a.second;});
for (const auto& elem : temp) {
cout << elem.second << " " << elem.first << '\n';
}
}
```
Results on an old celeron dual core (wall clock and res mem):
0:08.78, 313732 kb <= d dmd
0:08.25, 318084 kb <= d ldc
0:08.38, 38512 kb <= c++ idiomatic (above)
0:07.76, 30276 kb <= c++ boost
0:08.42, 26756 kb <= c verbose, hand-rolled hashtable
Mem and time measured like so:
/usr/bin/time -v $cmd < input >/dev/null
Input words file creation (around 300k * 50 words):
tr '\n' ' ' < /usr/share/dict/$lang > joined
for i in {1..50}; do cat joined >> input; done
word count sample output:
[... snip ...]
50 ironsmith
50 gloried
50 quindecagon
50 directory's
50 hydrobiological
Compilation flags:
dmd -O -release -mcpu=native -ofwc-d-dmd wc.d
ldc2 -O3 -release -flto=full -mcpu=native -ofwc-d-ldc wc.d
clang -std=c11 -O3 -march=native -flto -o wp-c-clang wp.c
clang++ -std=c++17 -O3 -march=native -flto -o wp-cpp-clang wp-boost.cpp
Versions:
dmd: v2.082.1
ldc: 1.12.0 (based on DMD v2.082.1 and LLVM 6.0.1)
llvm/clang: 6.0.1
More information about the Digitalmars-d-learn
mailing list