[OT] shuffling lines in a stream

Fri Oct 10 13:32:49 PDT 2008

Benji Smith schrieb:
> BLS wrote:
>> Jason House schrieb:
>>> Andrei Alexandrescu Wrote:
>>>
>>>> Sergey Gromov wrote:
>>>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>>>> Andrei Alexandrescu wrote:
>>>>>> The high traffic enjoyed by "random k-sample of a file" confirms 
>>>>>> that plenty of people around here have idle cycles ready to be 
>>>>>> spent on fun tasks.
>>>>>>
>>>>>> In that vein, here's another problem. Say you want to randomly 
>>>>>> shuffle lines in a stream. You don't know the size beforehand, and 
>>>>>> one pass is preferable. The stream can be larger than the working 
>>>>>> memory, but not an unbounded amount of times larger. Also lines in 
>>>>>> the stream can be unbounded, but it is guaranteed that at least a 
>>>>>> nonzero fraction of the file fits in working memory.
>>>>>>
>>>>>> You can use temporary files, but of course minimizing their total 
>>>>>> size is recommended. Speed is of course important. Draft your 
>>>>>> solution and explain why you think it's correct and advantageous.
>>>>> Well, my temporary file contains the whole stream.  Because there 
>>>>> is a probability that the last line of the stream goes first.
>>>> Yes, that probability exists, but that doesn't imply you must load 
>>>> the entire file in memory.
>>>>
>>>> Andrei
>>>
>>> What you want to optimize is still unclear to me. At its heart, you 
>>> must pass over the data twice. Once to index it, and once to display 
>>> it. Some library-based memmory-mapped file or a seekable stream seems 
>>> like enough to get the job done. Neither that bit nor the 
>>> randomization of indicies seems particularly interesting. What am I 
>>> missing?
>>>
>>> I look forward to the next challenge ;) 
>> Hi Jason,
>> I think (I may be wrong) that -Normal distribution- is the key
>> to optimize speed.  Let's say the stream Andrei is talking about is a 
>> book. Each sentence in this book consist of N characters. (End of Line 
>> is a DOT) Now we read the first ~10 pages. Character by Character. 
>> Based on this information we are able to say that each sentence has an 
>> average size of 100 characters and the normal distribution looks 
>> similat to :
>> no Graphic possible :(
>> so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution
>>
>> Indeed, NOT the solution Andrei is looking for, because indexing is 
>> still necesary. But due to the fact that finding the EOL/DOT is 
>> probabely faster by creating a slice from position 90 to 110 (in 
>> sentence #1, +- 10 percent from our average sentence size), this could 
>> be already a win.
>> Further : During the indexing process we can also adjust our 
>> first(base) normal distribution, as well as our +- tolerance.
>>
>> Heck it is late... hope this is not completely bullshi*
>> Bjoern
> 
> Of course, this relies on the assumption that sentences have 
> normally-distributed length (which may or may not be true on any 
> specific document).
> 
> You might be interested in checking out something called a "Kernel 
> Density Estimator", which is the typical solution for estimating the 
> shape of an unknown distribution based on a series of samples.
> 
> Here's the relevant wikipedia article:
> 
> http://en.wikipedia.org/wiki/Kernel_density_estimation
> 
> I learned about this technique because it was the subject of a job 
> interview question once. The question was something like "Given a file 
> containing 100 billion floating point values, what's the most effective 
> algorithm for finding the median value. Your machine only has room for a 
> billion values in memory at any given time, and making multiple passes 
> of the file is very expensive."
> 
> In my solution, I built a histogram during the first pass, counting the 
> number of entries in each bucket. At the end of the first pass, I could 
> isolate the bucket containing the median, and then during the second 
> pass, I'd be able to definitely identify the median value.
> 
> They pointed out several degenerate cases where my algorithm would 
> collapse (because I always had to make assumptions about the shape of 
> the distribution), and I left the interview stumped about the real answer.
> 
> When I got home, I did my research and found the KDE technique, which 
> I've used on a few projects since then.
> 
> --benji

Thanks for the pointer benji. I should have a book about it, somewhere. 
... well, have to look another day. And yes, my idea is pretty weak.
bjoern