Help with an algorithm!
Ivan Kazmenko via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Thu Jun 15 04:48:54 PDT 2017
On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:
> There are two arrays of string [] mas1, mas2; Size of each
> about 5M lines. By the size they different, but lines in both
> match for 95%. It is necessary to find all lines in an array of
> mas2 which differ from mas1. The principal criterion - speed.
> There are the 8th core processor and it is good to involve a
> multithreading.
The approaches which come to mind are:
1. Sort both arrays, then traverse them in sorted order, like in
merge step of merge sort:
sort (mas1);
sort (mas2);
size_t index1 = 0;
foreach (str2; mas2)
{
while (index1 < mas1.length && mas1[index1] < str2)
index1 += 1;
if (mas1[index1] != str2)
writeln (str2);
}
Sorting takes O (n log n * c) time, where n is the size of the
arrays, and c is the expected time of two strings comparison when
sorting. The subsequent step is O (n * c) which is faster than
sorting.
2. Hashing. Just put the contents of the first array into a bool
[string], and then, for each string from the second array, check
whether it is contained in the associative array. The time will
be O (total length of all strings) multiplied by a moderate
constant, unless the strings are designed specifically to
generate hash collisions, in which case it will be slower.
3. Trie. Similar to hashing, but the constant multipliers will
be much higher unless the strings have large common prefixes.
Whether we can do faster depends on context. For example, if the
strings tend to all have long common prefixes, any string
comparison will be slow, but otherwise it can be thought of as
taking constant time.
Ivan Kazmenko.
More information about the Digitalmars-d-learn
mailing list