[OT] shuffling lines in a stream

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Fri Oct 10 06:47:18 PDT 2008


Jason House wrote:
> Andrei Alexandrescu Wrote:
> 
>> Sergey Gromov wrote:
>>> Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:
>>>> The high traffic enjoyed by "random k-sample of a file"
>>>> confirms that plenty of people around here have idle cycles
>>>> ready to be spent on fun tasks.
>>>> 
>>>> In that vein, here's another problem. Say you want to randomly
>>>> shuffle lines in a stream. You don't know the size beforehand,
>>>> and one pass is preferable. The stream can be larger than the
>>>> working memory, but not an unbounded amount of times larger.
>>>> Also lines in the stream can be unbounded, but it is guaranteed
>>>> that at least a nonzero fraction of the file fits in working
>>>> memory.
>>>> 
>>>> You can use temporary files, but of course minimizing their
>>>> total size is recommended. Speed is of course important. Draft
>>>> your solution and explain why you think it's correct and
>>>> advantageous.
>>> Well, my temporary file contains the whole stream.  Because there
>>> is a probability that the last line of the stream goes first.
>> Yes, that probability exists, but that doesn't imply you must load
>> the entire file in memory.
>> 
>> Andrei
> 
> What you want to optimize is still unclear to me.

At the end of the day, devise a means to take care of the task. Faster
is better, occupying less space is better.

> At its heart, you must pass over the data twice. Once to index it,
> and once to display it. Some library-based memmory-mapped file or a
> seekable stream seems like enough to get the job done. Neither that
> bit nor the randomization of indicies seems particularly interesting.
> What am I missing?

Your solution will be slow. Doing one random seek per line is too much, 
even in a memory-mapped file. Mapping into memory is not magic! We are 
talking about many millions of lines. Also you can't memory-map streams 
unless you create a full-blown copy first.

> I look forward to the next challenge ;)

I think you will be surprised by the juicy parts of this one.


Andrei



More information about the Digitalmars-d mailing list