[OT] Finding longest documents

Benji Smith dlanguage at benjismith.net
Sun Oct 12 17:37:28 PDT 2008


Andrei Alexandrescu wrote:
> Walter Bright wrote:
>> Andrei Alexandrescu wrote:
>>>    nil '(("\\(!(\\)[^()]*\\()\\)"
>>
>> I guess this is why I don't use emacs. I don't want to hear any more 
>> grousing about !( ) after that!!!
> 
> I agree about that, but why don't you use std.algorithm? :o)
> 
> Speaking of which, here's another challenge for everybody who'd like to 
> waste some cycles.
> 
> Say you have a simple API for accessing a large collection of files - 
> e.g. all of Google's cached documents. The task, should you accept it, 
> is to find the 1,000,000 largest ones of those files. The output should 
> be filenames sorted in decreasing order of size. Assume the API gives 
> you <filename, size> pairs in a serial fashion. You can't use parallel 
> processing (e.g. map/reduce on clusters), but you are allowed to use 
> threads on one machine if if fancies you. Speed is highly desirable. 
> Devise your algorithm and explain how fast it runs with practical and/or 
> theoretical arguments.
> 
> 
> Andrei

I smell a clever trick to divert us all from rolling our eyes about the 
emacs chevron trick :-)

Fine by me. I'll bite...

    auto heap = new BinaryHeap!(File, ulong)(1_000_000);
    while (googleApi.hasMoreFiles()) {
       File f = googleApi.getNextFile();
       ulong size = googleApi.sizeOfFile(f);
       heap.add(f, size);
    }

    while (heap.size() > 0) {
       ulong size = heap.getTopValue();
       File f = heap.removeTopItem();
       Stdout.formatln("file: {0}, size: {1}", f.getName(), size);
    }

The BinaryHeap class I'm using is described pretty well here:

http://en.wikipedia.org/wiki/Binary_heap

Items with the highest value automatically bubble to the top of the 
heap, and items with lower values sink to the bottom, eventually falling 
completely out of the collection.

The performance is n log m, where 'n' is the total number of documents, 
and 'm' is the number of <filename, size> pairs that we care about (in 
this case 1,000,000).

--benji



More information about the Digitalmars-d mailing list