Trying to reduce memory usage
tsbockman
thomas.bockman at gmail.com
Tue Feb 23 00:08:40 UTC 2021
On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt wrote:
> On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
>> I spent some time experimenting with this problem, and here is
>> the best solution I found, assuming that perfect
>> de-duplication is required. (I'll put the code up on GitHub /
>> dub if anyone wants to have a look.)
>
> It would be interesting to see how the performance compares to
> tsv-uniq
> (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The
> prebuilt binaries turn on all the optimizations
> (https://github.com/eBay/tsv-utils/releases).
My program (called line-dedup below) is modestly faster than
yours, with the gap gradually widening as files get bigger.
Similarly, when not using a memory-mapped scratch file, my
program is modestly less memory hungry than yours, with the gap
gradually widening as files get bigger.
In neither case is the difference very exciting though; the real
benefit of my algorithm is that it can process files too large
for physical memory. It might also handle frequent hash
collisions better, and could be upgraded to handle huge numbers
of very short lines efficiently.
Test 0 (139 MiB, 1537300 unique lines of 2000004):
sort source | uniq > dest
Wall time: 2.04 s
User time: 9.34 s
System time: 0.23 s
Max resident set: 385 MiB
tsv-uniq source > dest
Wall time: 0.82 s
User time: 0.73 s
System time: 0.13 s
Max resident set: 229 MiB
line-dedup source dest
Wall time: 0.58 s
User time: 0.52 s
System time 0.05 s
Max resident set: 206 MiB
Test 1 (1.4 GiB, 14338280 unique lines of 20000004):
sort source | uniq > dest
Wall time: 22.9 s
User time: 107 s
System time: 2.16 s
Max resident set: 3.76 GiB
tsv-uniq source > dest
Wall time: 9.94 s
User time: 9.25 s
System time: 1.3 s
Max resident set: 2.34 GiB
line-dedup source dest
Wall time: 6.34 s
User time: 5.88 s
System time 0.44 s
Max resident set: 1.98 GiB
Test 2 (4.2 GiB, 44379959 unique lines of 60000004):
sort source | uniq > dest
Wall time: 87.1 s
User time: 352.3 s
System time: 7.41 s
Max resident set: 7.84 GiB
tsv-uniq source > dest
Wall time: 42.3 s
User time: 40.7 s
System time: 4.82 s
Max resident set: 7.86 GiB
line-dedup source dest
Wall time: 23 s
User time: 19.6 s
System time 1.4 s
Max resident set: 5.96 GiB
The test files have a fractal distribution, with many lines
having a few duplicates, and a few lines having many duplicates.
More information about the Digitalmars-d-learn
mailing list