Trying to reduce memory usage

tsbockman thomas.bockman at gmail.com
Tue Feb 23 00:08:40 UTC 2021


On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt wrote:
> On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
>> I spent some time experimenting with this problem, and here is 
>> the best solution I found, assuming that perfect 
>> de-duplication is required. (I'll put the code up on GitHub / 
>> dub if anyone wants to have a look.)
>
> It would be interesting to see how the performance compares to 
> tsv-uniq 
> (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The 
> prebuilt binaries turn on all the optimizations 
> (https://github.com/eBay/tsv-utils/releases).

My program (called line-dedup below) is modestly faster than 
yours, with the gap gradually widening as files get bigger. 
Similarly, when not using a memory-mapped scratch file, my 
program is modestly less memory hungry than yours, with the gap 
gradually widening as files get bigger.

In neither case is the difference very exciting though; the real 
benefit of my algorithm is that it can process files too large 
for physical memory. It might also handle frequent hash 
collisions better, and could be upgraded to handle huge numbers 
of very short lines efficiently.

Test 0 (139 MiB, 1537300 unique lines of 2000004):
     sort source | uniq > dest
         Wall time: 2.04 s
         User time: 9.34 s
         System time: 0.23 s
         Max resident set: 385 MiB
     tsv-uniq source > dest
         Wall time: 0.82 s
         User time: 0.73 s
         System time: 0.13 s
         Max resident set: 229 MiB
     line-dedup source dest
         Wall time: 0.58 s
         User time: 0.52 s
         System time 0.05 s
         Max resident set: 206 MiB

Test 1 (1.4 GiB, 14338280 unique lines of 20000004):
     sort source | uniq > dest
         Wall time: 22.9 s
         User time: 107 s
         System time: 2.16 s
         Max resident set: 3.76 GiB
     tsv-uniq source > dest
         Wall time: 9.94 s
         User time: 9.25 s
         System time: 1.3 s
         Max resident set: 2.34 GiB
     line-dedup source dest
         Wall time: 6.34 s
         User time: 5.88 s
         System time 0.44 s
         Max resident set: 1.98 GiB

Test 2 (4.2 GiB, 44379959 unique lines of 60000004):
     sort source | uniq > dest
         Wall time: 87.1 s
         User time: 352.3 s
         System time: 7.41 s
         Max resident set: 7.84 GiB
     tsv-uniq source > dest
         Wall time: 42.3 s
         User time: 40.7 s
         System time: 4.82 s
         Max resident set: 7.86 GiB
     line-dedup source dest
         Wall time: 23 s
         User time: 19.6 s
         System time 1.4 s
         Max resident set: 5.96 GiB

The test files have a fractal distribution, with many lines 
having a few duplicates, and a few lines having many duplicates.


More information about the Digitalmars-d-learn mailing list