Trying to reduce memory usage

Fri Feb 12 07:57:56 UTC 2021

On Friday, 12 February 2021 at 07:23:12 UTC, frame wrote:
> On Friday, 12 February 2021 at 02:22:35 UTC, H. S. Teoh wrote:
>
>> This turns the OP's O(n log n) algorithm into an O(n) 
>> algorithm, doesn't
>> need to copy the entire content of the file into memory, and 
>> also uses
>> much less memory by storing only hashes.
>
> But this kind of hash is maybe insufficient to avoid hash 
> collisions. For such big data slower but stronger algorithms 
> like SHA are advisable.
>
> Also associative arrays uses the same weak algorithm where you 
> can run into collision issues. Thus using the hash from string 
> data as key can be a problem. I always use a quick hash as key 
> but hold actually a collection of hashes in them and do a 
> lookup to be on the safe side.

Forgot to mention that this kind of solution needs a better 
approach if you don't want to miss a potential different line:

You can use a weak hash but track the line position and count how 
often the same hash occurs as a pre-process. In the post-process 
you look for this lines again and compare if they are really 
identical or hash collisions to correct.