Trying to reduce memory usage

Tue Feb 23 04:22:41 UTC 2021

On Tuesday, 23 February 2021 at 00:08:40 UTC, tsbockman wrote:
> On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt 
> wrote:
>> It would be interesting to see how the performance compares to 
>> tsv-uniq 
>> (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The 
>> prebuilt binaries turn on all the optimizations 
>> (https://github.com/eBay/tsv-utils/releases).
>
> My program (called line-dedup below) is modestly faster than 
> yours, with the gap gradually widening as files get bigger. 
> Similarly, when not using a memory-mapped scratch file, my 
> program is modestly less memory hungry than yours, with the gap 
> gradually widening as files get bigger.
>
> In neither case is the difference very exciting though; the 
> real benefit of my algorithm is that it can process files too 
> large for physical memory. It might also handle frequent hash 
> collisions better, and could be upgraded to handle huge numbers 
> of very short lines efficiently.

Thanks for running the comparison! I appreciate seeing how other 
implementations compare.

I'd characterize the results a differently though. Based on the 
numbers, line-dedup is materially faster than tsv-uniq, at least 
on the tests run. To your point, it may not make much practical 
difference on data sets that fit in memory. tsv-uniq is fast 
enough for most needs. But it's still a material performance 
delta. Nice job!

I agree also that the bigger pragmatic benefit is fast processing 
of files much larger than will fit in memory. There are other 
useful problems like this. One I often need is creating a random 
weighted ordering. Easy to do for data sets that fit in memory, 
but hard to do fast for data sets that do not.

--Jon