[OT] shuffling lines in a stream

Jason House jason.james.house at gmail.com
Fri Oct 10 06:23:55 PDT 2008


Andrei Alexandrescu Wrote:

> Sergey Gromov wrote:
> > Thu, 09 Oct 2008 23:42:47 -0500,
> > Andrei Alexandrescu wrote:
> >> The high traffic enjoyed by "random k-sample of a file" confirms that 
> >> plenty of people around here have idle cycles ready to be spent on fun 
> >> tasks.
> >>
> >> In that vein, here's another problem. Say you want to randomly shuffle 
> >> lines in a stream. You don't know the size beforehand, and one pass is 
> >> preferable. The stream can be larger than the working memory, but not an 
> >> unbounded amount of times larger. Also lines in the stream can be 
> >> unbounded, but it is guaranteed that at least a nonzero fraction of the 
> >> file fits in working memory.
> >>
> >> You can use temporary files, but of course minimizing their total size 
> >> is recommended. Speed is of course important. Draft your solution and 
> >> explain why you think it's correct and advantageous.
> > 
> > Well, my temporary file contains the whole stream.  Because there is a 
> > probability that the last line of the stream goes first.
> 
> Yes, that probability exists, but that doesn't imply you must load the 
> entire file in memory.
> 
> Andrei

What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing?

I look forward to the next challenge ;) 



More information about the Digitalmars-d mailing list