[OT] shuffling lines in a stream
Christopher Wright
dhasenan at gmail.com
Sat Oct 11 06:58:44 PDT 2008
Sergey Gromov wrote:
> Fri, 10 Oct 2008 09:20:44 -0500,
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 07:55:38 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sergey Gromov wrote:
>>>>> Well, my temporary file contains the whole stream. Because there is a
>>>>> probability that the last line of the stream goes first.
>>>> Yes, that probability exists, but that doesn't imply you must load the
>>>> entire file in memory.
>>> This implies though that I cannot start output before I read all the
>>> lines from the stream. Therefore I must store those lines somewhere.
>> Creating temporary files is fine as long as you never need to store the
>> entire input in memory.
>
> Ok, let me re-formulate the task then, since the creation of a temporary
> file containing all the lines seems inevitable. You basically want to
> uniformly shuffle lines in a very large file.
>
> The idea is this. Let N be number of lines in the file, which you know
> after storing it. There are two passes, forward and backward. They are
> symmetric. On a forward pass you read as many lines as you can fit in
> memory, let it be m. Shuffle them. Then choose a number, k which is
> 0 < k < m and is somehow based upon the m and N, like k = m * m/N.
> Write the k first lines back into the file and discard them from memory.
> Append as many new lines to your memory array as possible. Shuffle.
> Choose k. Write. Repeat. The reverse pass is exactly the same but you
> start from the end of file.
>
> The tricky part is how to choose k. It obviously needs to take number
> of discarded lines, q, into account like in k = m * m/(N-q). But my
> math-fu is almost non-existent so I'm unable to prove if this approach
> can give an uniform distribution. I feel like it can though.
You have to continually alter k to maintain a random distribution, but
you have an invariant 0 <= k <= m. Those bounds mean you can't use this
approach.
More information about the Digitalmars-d
mailing list