[OT] shuffling lines in a stream

Fri Oct 10 13:18:37 PDT 2008

BLS wrote:
> Jason House schrieb:
>> Andrei Alexandrescu Wrote:
>>
>>> Sergey Gromov wrote:
>>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>>> Andrei Alexandrescu wrote:
>>>>> The high traffic enjoyed by "random k-sample of a file" confirms 
>>>>> that plenty of people around here have idle cycles ready to be 
>>>>> spent on fun tasks.
>>>>>
>>>>> In that vein, here's another problem. Say you want to randomly 
>>>>> shuffle lines in a stream. You don't know the size beforehand, and 
>>>>> one pass is preferable. The stream can be larger than the working 
>>>>> memory, but not an unbounded amount of times larger. Also lines in 
>>>>> the stream can be unbounded, but it is guaranteed that at least a 
>>>>> nonzero fraction of the file fits in working memory.
>>>>>
>>>>> You can use temporary files, but of course minimizing their total 
>>>>> size is recommended. Speed is of course important. Draft your 
>>>>> solution and explain why you think it's correct and advantageous.
>>>> Well, my temporary file contains the whole stream.  Because there is 
>>>> a probability that the last line of the stream goes first.
>>> Yes, that probability exists, but that doesn't imply you must load 
>>> the entire file in memory.
>>>
>>> Andrei
>>
>> What you want to optimize is still unclear to me. At its heart, you 
>> must pass over the data twice. Once to index it, and once to display 
>> it. Some library-based memmory-mapped file or a seekable stream seems 
>> like enough to get the job done. Neither that bit nor the 
>> randomization of indicies seems particularly interesting. What am I 
>> missing?
>>
>> I look forward to the next challenge ;) 
> Hi Jason,
> I think (I may be wrong) that -Normal distribution- is the key
> to optimize speed.  Let's say the stream Andrei is talking about is a 
> book. Each sentence in this book consist of N characters. (End of Line 
> is a DOT) Now we read the first ~10 pages. Character by Character. Based 
> on this information we are able to say that each sentence has an average 
> size of 100 characters and the normal distribution looks similat to :
> no Graphic possible :(
> so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution
> 
> Indeed, NOT the solution Andrei is looking for, because indexing is 
> still necesary. But due to the fact that finding the EOL/DOT is 
> probabely faster by creating a slice from position 90 to 110 (in 
> sentence #1, +- 10 percent from our average sentence size), this could 
> be already a win.
> Further : During the indexing process we can also adjust our first(base) 
> normal distribution, as well as our +- tolerance.
> 
> Heck it is late... hope this is not completely bullshi*
> Bjoern

Of course, this relies on the assumption that sentences have 
normally-distributed length (which may or may not be true on any 
specific document).

You might be interested in checking out something called a "Kernel 
Density Estimator", which is the typical solution for estimating the 
shape of an unknown distribution based on a series of samples.

Here's the relevant wikipedia article:

http://en.wikipedia.org/wiki/Kernel_density_estimation

I learned about this technique because it was the subject of a job 
interview question once. The question was something like "Given a file 
containing 100 billion floating point values, what's the most effective 
algorithm for finding the median value. Your machine only has room for a 
billion values in memory at any given time, and making multiple passes 
of the file is very expensive."

In my solution, I built a histogram during the first pass, counting the 
number of entries in each bucket. At the end of the first pass, I could 
isolate the bucket containing the median, and then during the second 
pass, I'd be able to definitely identify the median value.

They pointed out several degenerate cases where my algorithm would 
collapse (because I always had to make assumptions about the shape of 
the distribution), and I left the interview stumped about the real answer.

When I got home, I did my research and found the KDE technique, which 
I've used on a few projects since then.

--benji