Help with an algorithm!

Thu Jun 15 04:48:54 PDT 2017

On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:
> There are two arrays of string [] mas1, mas2; Size of each 
> about 5M lines. By the size they different, but lines in both 
> match for 95%. It is necessary to find all lines in an array of 
> mas2 which differ from mas1. The principal criterion - speed. 
> There are the 8th core processor and it is good to involve a 
> multithreading.

The approaches which come to mind are:

1. Sort both arrays, then traverse them in sorted order, like in 
merge step of merge sort:

     sort (mas1);
     sort (mas2);

     size_t index1 = 0;
     foreach (str2; mas2)
     {
         while (index1 < mas1.length && mas1[index1] < str2)
             index1 += 1;
         if (mas1[index1] != str2)
             writeln (str2);
     }

Sorting takes O (n log n * c) time, where n is the size of the 
arrays, and c is the expected time of two strings comparison when 
sorting.  The subsequent step is O (n * c) which is faster than 
sorting.

2. Hashing.  Just put the contents of the first array into a bool 
[string], and then, for each string from the second array, check 
whether it is contained in the associative array.  The time will 
be O (total length of all strings) multiplied by a moderate 
constant, unless the strings are designed specifically to 
generate hash collisions, in which case it will be slower.

3. Trie.  Similar to hashing, but the constant multipliers will 
be much higher unless the strings have large common prefixes.

Whether we can do faster depends on context.  For example, if the 
strings tend to all have long common prefixes, any string 
comparison will be slow, but otherwise it can be thought of as 
taking constant time.

Ivan Kazmenko.