[OT] Finding longest documents
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Sun Oct 12 18:51:43 PDT 2008
Sergey Gromov wrote:
> Sun, 12 Oct 2008 19:11:00 -0500,
> Andrei Alexandrescu wrote:
>> Walter Bright wrote:
>>> Andrei Alexandrescu wrote:
>>>> nil '(("\\(!(\\)[^()]*\\()\\)"
>>> I guess this is why I don't use emacs. I don't want to hear any more
>>> grousing about !( ) after that!!!
>> I agree about that, but why don't you use std.algorithm? :o)
>>
>> Speaking of which, here's another challenge for everybody who'd like to
>> waste some cycles.
>>
>> Say you have a simple API for accessing a large collection of files -
>> e.g. all of Google's cached documents. The task, should you accept it,
>> is to find the 1,000,000 largest ones of those files. The output should
>> be filenames sorted in decreasing order of size. Assume the API gives
>> you <filename, size> pairs in a serial fashion. You can't use parallel
>> processing (e.g. map/reduce on clusters), but you are allowed to use
>> threads on one machine if if fancies you. Speed is highly desirable.
>> Devise your algorithm and explain how fast it runs with practical and/or
>> theoretical arguments.
>
> A straight-forward solution would be putting name-size pairs into a tree
> sorted by size, which is O(1e6 log 1e6). Then, if the tree contains more
> than a million elements, remove the smallest which is also O(1e6 log
> 1e6). You can optimize it by skipping elements smaller than the smallest
> element in the tree if the tree is full.
>
> I haven't found any trees in Phobos though.
>
> Am I missing something?
The trick here was to figure that you don't really need a sorted
container; an "almost sorted" container suffices, and a heap is exactly
what the doctor prescribed.
Andrei
More information about the Digitalmars-d
mailing list