[OT] shuffling lines in a stream
Sean Kelly
sean at invisibleduck.org
Fri Oct 10 15:37:49 PDT 2008
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
>> Fri, 10 Oct 2008 15:54:18 -0500,
>> Andrei Alexandrescu wrote:
>>> Sean Kelly wrote:
>>>> It would be slower than the seeking option, but something like a
>>>> randomized mergesort would work as well. If the program can buffer
>>>> k lines in a file containing n lines, then read the first k lines
>>>> into memory, shuffle them, and write them out to a temporary file.
>>>> Repeat until the input is exhausted. Now randomly pick two of the
>>>> temporary files and randomly merge them. Repeat until two temporary
>>>> files remain, then output the result of the final random merge to
>>>> the screen.
>>>>
>>>> For small files (ie. where n<k) the file would be read, shuffled in
>>>> memory, and printed to the screen, assuming the proper checks were
>>>> in place.
>>> I think I found a problem with your solution. Consider you break the
>>> file in three chunks: a, b, c. Then you pick at random b and c and
>>> randomly merge them. The problem is, you make early the decision that
>>> nothing in a will appear in the first two thirds of the result. So
>>> the quality of randomness suffers. How would you address that?
>>
>> After merging b and c you end up with a and bc. Then you randomly
>> merge these two files and lines from a have all the chances to appear
>> anywhere within result.
>>
>> When randomly merging, the probability of picking a line from a file
>> should be proportional to a number of lines left in that file.
>
> How do you "randomly merge"? Describe the process.
while( a.hasMore || b.hasMore )
{
if( rand % 2 )
r ~= a.readLine;
else
r ~= b.readLine;
}
while( a.hasMore )
r ~= a.readLine;
while( b.hasMore )
r ~= b.readLine;
The downside to randomly choosing two files is that it's possible you
could end up always choosing a small and a large file, thus increasing
the chance that the large file will have a bunch of lines left over
which are essentially appended.
The easy way around this would be to maintain a list of files so only
equal-sized files are merged (similar to how mergesort works). However,
I think a better way would be to use some sort of bias for the random
check above so that different-sized files are on equal footing. Perhaps
write out the file with the first 4 bytes containing the number of lines
in that file and use that to construct the bias. I'll leave those with
better math-fu than I to come up with a formula which would be fair.
Sean
More information about the Digitalmars-d
mailing list