Storing and Searching large text lists

Thu Dec 31 18:43:09 PST 2015

On 2016-01-01 13:22, Nicholas Wilson via Digitalmars-d-learn wrote: 

> On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:
> 
>> I have a large list, B, of string items. For each item in that large list, I need to see if it is in the smaller list, A. I have been using a simple string array for the storage of A string[] A and then using foreach to go through all the items of B and check they are in A foreach(string;B) /* this looks hacky but wasn't working without the !=0 bit ) if(find(A,string) != 0) writeln("Found a line: ", string); While this works for small datasets, but when either A or B get large (A could be up to 150k records, B in the millions) it takes quite a while to run. I'd like to know what is the best way to store lists of text for searching? Is there a better container than a simply array? Neither A nor B need to be ordered for my purpose, but would sorting help the search? Would it help enough to be worth the CPU expense? Regards B
> 
> Your problem is that your algorithm is O(m*n). (B is m ,A is n)
> A simple speedup would be to sort A and then use a binary search (O(m*log(n)))
> (partially sorting B may also help but only for cache)
> 
> Fully sorting may be worth doing if you have other steps after that also benefit in speed when working on sorted data (like searching does).
> 
> Changing A to a trie is another possibility.
> 
> But as always it help to know your data. How long are the strings(are they of the same length)? Are there any similarities (e.g. are they all email addresses)? Are there duplicates allowed in B?
> 
> Nic

Thanks. I'll try your suggestions. 

The data I am looking at is the Microsoft Academic Graph data.
https://academicgraph.blob.core.windows.net/graph-2015-11-06/index.html 

I have picked a topic of interest, and am pulling the IDs for papers of
those topics.
There will be up to 150,000 in the list (A) and each id will look
something like: 

000241C7
0018BC77 

I then have a "Papers.txt" file, which has a list (B) of (a couple of
hundred million) papers and details about the papers. 
I am then searching through this list for the items in A.The Papers.txt
file has records that look like: 

007D0259 Business Intelligence analytics without conventional data
warehousing business intelligence analytics without conventional data
warehousing 2010 2010 19542 
0018BC77 Data mining email to discover social networks and communities
of practice data mining email to discover social networks and
communities of practice 2003 17584 

So I need to iterate through B, and see if the first entry, is one of
the papers I want, by comparing it to list A.
In the above example, 007D0259 I don't want, 0018BC77 I do want, and
will write that record to another file. 

I have started to delete entries from list A as I find them, so that the
search doesn't have as many records to look through each time, but I
amn't sure that that will decrease time, because it'll take time to find
the record on each delete. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-learn/attachments/20160101/0321c740/attachment.html>