Trying to reduce memory usage

H. S. Teoh hsteoh at quickfur.ath.cx
Fri Feb 12 02:22:35 UTC 2021


On Fri, Feb 12, 2021 at 01:45:23AM +0000, mw via Digitalmars-d-learn wrote:
> On Friday, 12 February 2021 at 01:23:14 UTC, Josh wrote:
> > I'm trying to read in a text file that has many duplicated lines and
> > output a file with all the duplicates removed.
> 
> If you only need to remove duplicates, keep (and compare) a string
> hash for each line is good enough. Memory usage should be just n x
> integers.
[...]

+1. This can be even done on-the-fly: you don't even need to use .sort
or .uniq.  Just something like this:

	bool[size_t] hashes;
	foreach (line; stdin.byLine) {
		auto h = hashOf(line); // use a suitable hash function here
		if (h !in hashes) {
			outfile.writeln(line);
			hashes[h] = true;
		}
		// else this line already seen before; just skip it
	}

This turns the OP's O(n log n) algorithm into an O(n) algorithm, doesn't
need to copy the entire content of the file into memory, and also uses
much less memory by storing only hashes.


T

-- 
MASM = Mana Ada Sistem, Man!


More information about the Digitalmars-d-learn mailing list