[OT] shuffling lines in a stream

Fri Oct 10 13:03:59 PDT 2008

Jason House schrieb:
> Andrei Alexandrescu Wrote:
> 
>> Sergey Gromov wrote:
>>> Thu, 09 Oct 2008 23:42:47 -0500,
>>> Andrei Alexandrescu wrote:
>>>> The high traffic enjoyed by "random k-sample of a file" confirms that 
>>>> plenty of people around here have idle cycles ready to be spent on fun 
>>>> tasks.
>>>>
>>>> In that vein, here's another problem. Say you want to randomly shuffle 
>>>> lines in a stream. You don't know the size beforehand, and one pass is 
>>>> preferable. The stream can be larger than the working memory, but not an 
>>>> unbounded amount of times larger. Also lines in the stream can be 
>>>> unbounded, but it is guaranteed that at least a nonzero fraction of the 
>>>> file fits in working memory.
>>>>
>>>> You can use temporary files, but of course minimizing their total size 
>>>> is recommended. Speed is of course important. Draft your solution and 
>>>> explain why you think it's correct and advantageous.
>>> Well, my temporary file contains the whole stream.  Because there is a 
>>> probability that the last line of the stream goes first.
>> Yes, that probability exists, but that doesn't imply you must load the 
>> entire file in memory.
>>
>> Andrei
> 
> What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing?
> 
> I look forward to the next challenge ;) 
Hi Jason,
I think (I may be wrong) that -Normal distribution- is the key
to optimize speed.  Let's say the stream Andrei is talking about is a 
book. Each sentence in this book consist of N characters. (End of Line 
is a DOT) Now we read the first ~10 pages. Character by Character. Based 
on this information we are able to say that each sentence has an average 
size of 100 characters and the normal distribution looks similat to :
no Graphic possible :(
so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution

Indeed, NOT the solution Andrei is looking for, because indexing is 
still necesary. But due to the fact that finding the EOL/DOT is 
probabely faster by creating a slice from position 90 to 110 (in 
sentence #1, +- 10 percent from our average sentence size), this could 
be already a win.
Further : During the indexing process we can also adjust our first(base) 
normal distribution, as well as our +- tolerance.

Heck it is late... hope this is not completely bullshi*
Bjoern