[OT] Finding longest documents

Sun Oct 12 18:51:05 PDT 2008

Benji Smith wrote:
> Andrei Alexandrescu wrote:
>> Walter Bright wrote:
>>> Andrei Alexandrescu wrote:
>>>>    nil '(("\\(!(\\)[^()]*\\()\\)"
>>>
>>> I guess this is why I don't use emacs. I don't want to hear any more 
>>> grousing about !( ) after that!!!
>>
>> I agree about that, but why don't you use std.algorithm? :o)
>>
>> Speaking of which, here's another challenge for everybody who'd like 
>> to waste some cycles.
>>
>> Say you have a simple API for accessing a large collection of files - 
>> e.g. all of Google's cached documents. The task, should you accept it, 
>> is to find the 1,000,000 largest ones of those files. The output 
>> should be filenames sorted in decreasing order of size. Assume the API 
>> gives you <filename, size> pairs in a serial fashion. You can't use 
>> parallel processing (e.g. map/reduce on clusters), but you are allowed 
>> to use threads on one machine if if fancies you. Speed is highly 
>> desirable. Devise your algorithm and explain how fast it runs with 
>> practical and/or theoretical arguments.
>>
>>
>> Andrei
> 
> I smell a clever trick to divert us all from rolling our eyes about the 
> emacs chevron trick :-)
> 
> Fine by me. I'll bite...
> 
>    auto heap = new BinaryHeap!(File, ulong)(1_000_000);
>    while (googleApi.hasMoreFiles()) {
>       File f = googleApi.getNextFile();
>       ulong size = googleApi.sizeOfFile(f);
>       heap.add(f, size);
>    }
> 
>    while (heap.size() > 0) {
>       ulong size = heap.getTopValue();
>       File f = heap.removeTopItem();
>       Stdout.formatln("file: {0}, size: {1}", f.getName(), size);
>    }
> 
> The BinaryHeap class I'm using is described pretty well here:
> 
> http://en.wikipedia.org/wiki/Binary_heap
> 
> Items with the highest value automatically bubble to the top of the 
> heap, and items with lower values sink to the bottom, eventually falling 
> completely out of the collection.
> 
> The performance is n log m, where 'n' is the total number of documents, 
> and 'm' is the number of <filename, size> pairs that we care about (in 
> this case 1,000,000).
> 
> --benji

Winner!

One nice touch is that you don't even bother to sort the heap; you only 
remove from it. I wonder if this will be overall sensibly faster than 
just heapsorting the heap. Thoughts?

Andrei