[OT] shuffling lines in a stream
BLS
nanali at nospam-wanadoo.fr
Fri Oct 10 13:32:49 PDT 2008
Benji Smith schrieb:
> BLS wrote:
>> Jason House schrieb:
>>> Andrei Alexandrescu Wrote:
>>>
>>>> Sergey Gromov wrote:
>>>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>>>> Andrei Alexandrescu wrote:
>>>>>> The high traffic enjoyed by "random k-sample of a file" confirms
>>>>>> that plenty of people around here have idle cycles ready to be
>>>>>> spent on fun tasks.
>>>>>>
>>>>>> In that vein, here's another problem. Say you want to randomly
>>>>>> shuffle lines in a stream. You don't know the size beforehand, and
>>>>>> one pass is preferable. The stream can be larger than the working
>>>>>> memory, but not an unbounded amount of times larger. Also lines in
>>>>>> the stream can be unbounded, but it is guaranteed that at least a
>>>>>> nonzero fraction of the file fits in working memory.
>>>>>>
>>>>>> You can use temporary files, but of course minimizing their total
>>>>>> size is recommended. Speed is of course important. Draft your
>>>>>> solution and explain why you think it's correct and advantageous.
>>>>> Well, my temporary file contains the whole stream. Because there
>>>>> is a probability that the last line of the stream goes first.
>>>> Yes, that probability exists, but that doesn't imply you must load
>>>> the entire file in memory.
>>>>
>>>> Andrei
>>>
>>> What you want to optimize is still unclear to me. At its heart, you
>>> must pass over the data twice. Once to index it, and once to display
>>> it. Some library-based memmory-mapped file or a seekable stream seems
>>> like enough to get the job done. Neither that bit nor the
>>> randomization of indicies seems particularly interesting. What am I
>>> missing?
>>>
>>> I look forward to the next challenge ;)
>> Hi Jason,
>> I think (I may be wrong) that -Normal distribution- is the key
>> to optimize speed. Let's say the stream Andrei is talking about is a
>> book. Each sentence in this book consist of N characters. (End of Line
>> is a DOT) Now we read the first ~10 pages. Character by Character.
>> Based on this information we are able to say that each sentence has an
>> average size of 100 characters and the normal distribution looks
>> similat to :
>> no Graphic possible :(
>> so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution
>>
>> Indeed, NOT the solution Andrei is looking for, because indexing is
>> still necesary. But due to the fact that finding the EOL/DOT is
>> probabely faster by creating a slice from position 90 to 110 (in
>> sentence #1, +- 10 percent from our average sentence size), this could
>> be already a win.
>> Further : During the indexing process we can also adjust our
>> first(base) normal distribution, as well as our +- tolerance.
>>
>> Heck it is late... hope this is not completely bullshi*
>> Bjoern
>
> Of course, this relies on the assumption that sentences have
> normally-distributed length (which may or may not be true on any
> specific document).
>
> You might be interested in checking out something called a "Kernel
> Density Estimator", which is the typical solution for estimating the
> shape of an unknown distribution based on a series of samples.
>
> Here's the relevant wikipedia article:
>
> http://en.wikipedia.org/wiki/Kernel_density_estimation
>
> I learned about this technique because it was the subject of a job
> interview question once. The question was something like "Given a file
> containing 100 billion floating point values, what's the most effective
> algorithm for finding the median value. Your machine only has room for a
> billion values in memory at any given time, and making multiple passes
> of the file is very expensive."
>
> In my solution, I built a histogram during the first pass, counting the
> number of entries in each bucket. At the end of the first pass, I could
> isolate the bucket containing the median, and then during the second
> pass, I'd be able to definitely identify the median value.
>
> They pointed out several degenerate cases where my algorithm would
> collapse (because I always had to make assumptions about the shape of
> the distribution), and I left the interview stumped about the real answer.
>
> When I got home, I did my research and found the KDE technique, which
> I've used on a few projects since then.
>
> --benji
Thanks for the pointer benji. I should have a book about it, somewhere.
... well, have to look another day. And yes, my idea is pretty weak.
bjoern
More information about the Digitalmars-d
mailing list