[OT] Finding longest documents

Sun Oct 12 23:47:11 PDT 2008

Andrei Alexandrescu wrote:
> Walter Bright wrote:
>> Andrei Alexandrescu wrote:
>>>    nil '(("\\(!(\\)[^()]*\\()\\)"
>>
>> I guess this is why I don't use emacs. I don't want to hear any more 
>> grousing about !( ) after that!!!
> 
> I agree about that, but why don't you use std.algorithm? :o)
> 
> Speaking of which, here's another challenge for everybody who'd like to 
> waste some cycles.
> 
> Say you have a simple API for accessing a large collection of files - 
> e.g. all of Google's cached documents. The task, should you accept it, 
> is to find the 1,000,000 largest ones of those files. The output should 
> be filenames sorted in decreasing order of size. Assume the API gives 
> you <filename, size> pairs in a serial fashion. You can't use parallel 
> processing (e.g. map/reduce on clusters), but you are allowed to use 
> threads on one machine if if fancies you. Speed is highly desirable. 
> Devise your algorithm and explain how fast it runs with practical and/or 
> theoretical arguments.
> 
> 
> Andrei

Straightforward solution: Use insertion sort.

auto list = new PriorityQueue!(Documents);

// Documents.opCmp returns difference of two documents' length.
// the SHORTEST document gets priority.

foreach (doc; google_cache) {     // O(N)
   if (list.length >= 1_000_000)
     list.pop();                   // O(log 1M)
   list.push(doc);                 // O(log 1M)
}

auto array = convertToArray(list); // O(1M log 1M)
array.reverse();                   // O(1M)

write(array);                      // O(1M).

Total time complexity: O(N log 1M).