Trying to reduce memory usage
Jon Degenhardt
jond at noreply.com
Tue Feb 23 04:22:41 UTC 2021
On Tuesday, 23 February 2021 at 00:08:40 UTC, tsbockman wrote:
> On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt
> wrote:
>> It would be interesting to see how the performance compares to
>> tsv-uniq
>> (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The
>> prebuilt binaries turn on all the optimizations
>> (https://github.com/eBay/tsv-utils/releases).
>
> My program (called line-dedup below) is modestly faster than
> yours, with the gap gradually widening as files get bigger.
> Similarly, when not using a memory-mapped scratch file, my
> program is modestly less memory hungry than yours, with the gap
> gradually widening as files get bigger.
>
> In neither case is the difference very exciting though; the
> real benefit of my algorithm is that it can process files too
> large for physical memory. It might also handle frequent hash
> collisions better, and could be upgraded to handle huge numbers
> of very short lines efficiently.
Thanks for running the comparison! I appreciate seeing how other
implementations compare.
I'd characterize the results a differently though. Based on the
numbers, line-dedup is materially faster than tsv-uniq, at least
on the tests run. To your point, it may not make much practical
difference on data sets that fit in memory. tsv-uniq is fast
enough for most needs. But it's still a material performance
delta. Nice job!
I agree also that the bigger pragmatic benefit is fast processing
of files much larger than will fit in memory. There are other
useful problems like this. One I often need is creating a random
weighted ordering. Easy to do for data sets that fit in memory,
but hard to do fast for data sets that do not.
--Jon
More information about the Digitalmars-d-learn
mailing list