Storing and Searching large text lists

Thu Dec 31 18:22:14 PST 2015

On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:
> I have a large list, B, of string items. For each item in that 
> large list, I need to see if it is in the smaller list, A.
>
> I have been using a simple string array for the storage of A
>
> string[] A
>
> and then using foreach to go through all the items of B and 
> check they are in A
>
> foreach(string;B)
> /* this looks hacky but wasn't working without the !=0 bit )
>     if(find(A,string) != 0)
>         writeln("Found a line: ", string);
>
> While this works for small datasets, but when either A or B get 
> large (A could be up to 150k records, B in the millions) it 
> takes quite a while to run.
>
> I'd like to know what is the best way to store lists of text 
> for searching? Is there a better container than a simply array? 
> Neither A nor B need to be ordered for my purpose, but would 
> sorting help the search? Would it help enough to be worth the 
> CPU expense?
>
> Regards
> B

Your problem is that your algorithm is O(m*n). (B is m ,A is n)
A simple speedup would be to sort A and then use a binary search 
(O(m*log(n)))
(partially sorting B may also help but only for cache)

Fully sorting may be worth doing if you have other steps after 
that also benefit in speed when working on sorted data (like 
searching does).

Changing A to a trie is another possibility.

But as always it help to know your data. How long are the 
strings(are they of the same length)? Are there any similarities 
(e.g. are they all email addresses)? Are there duplicates allowed 
in B?

Nic