Trying to reduce memory usage

Wed Feb 17 04:10:24 UTC 2021

On Friday, 12 February 2021 at 01:23:14 UTC, Josh wrote:
> I'm trying to read in a text file that has many duplicated 
> lines and output a file with all the duplicates removed. By the 
> end of this code snippet, the memory usage is ~5x the size of 
> the infile (which can be multiple GB each), and when this is in 
> a loop the memory usage becomes unmanageable and often results 
> in an OutOfMemory error or just a complete lock up of the 
> system. Is there a way to reduce the memory usage of this code 
> without sacrificing speed to any noticeable extent? My 
> assumption is the .sort.uniq needs improving, but I can't think 
> of an easier/not much slower way of doing it.

I spent some time experimenting with this problem, and here is 
the best solution I found, assuming that perfect de-duplication 
is required. (I'll put the code up on GitHub / dub if anyone 
wants to have a look.)

--------------------------
     0) Memory map the input file, so that the program can pass 
around slices to it directly
without making copies. This also allows the OS to page it in and 
out of physical memory
for us, even if it is too large to fit all at once.

     1) Pre-compute the required space for all large data 
structures, even if an additional pass is required to do so. This 
makes the rest of the algorithm significantly more efficient with 
memory, time, and lines of code.

     2) Do a top-level bucket sort of the file using a small (8-16 
bit) hash into some scratch space. The target can be either in 
RAM, or in another memory-mapped file if we really need to 
minimize physical memory use.

The small hash can be a few bits taken off the top of a larger 
hash (I used std.digest.murmurhash). The larger hash is cached 
for use later on, to accelerate string comparisons, avoid 
unnecessary I/O, and perhaps do another level of bucket sort.

If there is too much data to put in physical memory all at once, 
be sure to copy the full text of each line into a region of the 
scratch file where it will be together with the other lines that 
share the same small hash. This is critical, as otherwise the 
string comparisons in the next step turn into slow random I/O.

     3) For each bucket, sort, filter out duplicates, and write to 
the output file. Any sorting algorithm(s) may be used if all 
associated data fits in physical memory. If not, use a merge 
sort, whose access patterns won't thrash the disk too badly.

     4) Manually release all large data structures, and delete the 
scratch file, if one was used. This is not difficult to do, since 
their life times are well-defined, and ensures that the program 
won't hang on to GiB of space any longer than necessary.
--------------------------

I wrote an optimized implementation of this algorithm. It's fast, 
efficient, and really does work on files too large for physical 
memory. However, it is complicated at almost 800 lines.

On files small enough to fit in RAM, it is similar in speed to 
the other solutions posted, but less memory hungry. Memory 
consumption in this case is around (sourceFile.length + 32 * 
lineCount * 3 / 2) bytes. Run time is similar to other posted 
solutions: about 3 seconds per GiB on my desktop.

When using a memory-mapped scratch file to accommodate huge 
files, the physical memory required is around 
max(largestBucket.data.length + 32 * largestBucket.lineCount * 3 
/ 2, bucketCount * writeBufferSize) bytes. (Virtual address space 
consumption is far higher, and the OS will commit however much 
physical memory is available and not needed by other tasks.) The 
run time is however long it takes the disk to read the source 
file twice, write a (sourceFile.length + 32 * lineCount * 3 / 2) 
byte scratch file, read back the scratch file, and write the 
destination file.

I tried it with a 38.8 GiB, 380_000_000 line file on a magnetic 
hard drive. It needed a 50.2 GiB scratch file and took about an 
hour (after much optimization and many bug fixes).