<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">

<html><body style='font-size: 10pt; font-family: Verdana,Geneva,sans-serif'>

<p>On 2016-01-01 13:22, Nicholas Wilson via Digitalmars-d-learn wrote:</p>

<blockquote type="cite" style="padding-left:5px; border-left:#1010ff 2px solid; margin-left:5px"><!-- html ignored --><!-- head ignored --><!-- meta ignored -->

<pre>On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:</pre>

<blockquote type="cite" style="padding-left:5px; border-left:#1010ff 2px solid; margin-left:5px">I have a large list, B, of string items. For each item in that large list, I need to see if it is in the smaller list, A. I have been using a simple string array for the storage of A string[] A and then using foreach to go through all the items of B and check they are in A foreach(string;B) /* this looks hacky but wasn't working without the !=0 bit ) if(find(A,string) != 0) writeln("Found a line: ", string); While this works for small datasets, but when either A or B get large (A could be up to 150k records, B in the millions) it takes quite a while to run. I'd like to know what is the best way to store lists of text for searching? Is there a better container than a simply array? Neither A nor B need to be ordered for my purpose, but would sorting help the search? Would it help enough to be worth the CPU expense? Regards B</blockquote>

<pre>Your problem is that your algorithm is O(m*n). (B is m ,A is n)

A simple speedup would be to sort A and then use a binary search (O(m*log(n)))

(partially sorting B may also help but only for cache)

Fully sorting may be worth doing if you have other steps after that also benefit in speed when working on sorted data (like searching does).

Changing A to a trie is another possibility.

But as always it help to know your data. How long are the strings(are they of the same length)? Are there any similarities (e.g. are they all email addresses)? Are there duplicates allowed in B?

Nic

</pre>

</blockquote>

<p>Thanks. I'll try your suggestions. </p>

<p>The data I am looking at is the Microsoft Academic Graph data.<br />https://academicgraph.blob.core.windows.net/graph-2015-11-06/index.html</p>

<p>I have picked a topic of interest, and am pulling the IDs for papers of those topics.<br />There will be up to 150,000 in the list (A) and each id will look something like:</p>

<p>000241C7<br />0018BC77</p>

<p>I then have a "Papers.txt" file, which has a list (B) of (a couple of hundred million) papers and details about the papers. <br />I am then searching through this list for the items in A.The Papers.txt file has records that look like:</p>

<p>007D0259    Business Intelligence analytics without conventional data warehousing    business intelligence analytics without conventional data warehousing    2010    2010                        19542 <br />0018BC77    Data mining email to discover social networks and communities of practice    data mining email to discover social networks and communities of practice    2003                            17584 <br /><br /></p>

<p>So I need to iterate through B, and see if the first entry, is one of the papers I want, by comparing it to list A.<br />In the above example, 007D0259 I don't want, 0018BC77 I do want, and will write that record to another file.</p>

<p>I have started to delete entries from list A as I find them, so that the search doesn't have as many records to look through each time, but I amn't sure that that will decrease time, because it'll take time to find the record on each delete.</p>

<p> </p>

<div> </div>

</body></html>