[OT] shuffling lines in a stream
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Fri Oct 10 15:45:28 PDT 2008
Sean Kelly wrote:
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 15:54:18 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sean Kelly wrote:
>>>>> It would be slower than the seeking option, but something like a
>>>>> randomized mergesort would work as well. If the program can buffer
>>>>> k lines in a file containing n lines, then read the first k lines
>>>>> into memory, shuffle them, and write them out to a temporary file.
>>>>> Repeat until the input is exhausted. Now randomly pick two of the
>>>>> temporary files and randomly merge them. Repeat until two
>>>>> temporary files remain, then output the result of the final random
>>>>> merge to the screen.
>>>>>
>>>>> For small files (ie. where n<k) the file would be read, shuffled in
>>>>> memory, and printed to the screen, assuming the proper checks were
>>>>> in place.
>>>> I think I found a problem with your solution. Consider you break the
>>>> file in three chunks: a, b, c. Then you pick at random b and c and
>>>> randomly merge them. The problem is, you make early the decision
>>>> that nothing in a will appear in the first two thirds of the result.
>>>> So the quality of randomness suffers. How would you address that?
>>>
>>> After merging b and c you end up with a and bc. Then you randomly
>>> merge these two files and lines from a have all the chances to appear
>>> anywhere within result.
>>>
>>> When randomly merging, the probability of picking a line from a file
>>> should be proportional to a number of lines left in that file.
>>
>> How do you "randomly merge"? Describe the process.
>
>
> while( a.hasMore || b.hasMore )
> {
> if( rand % 2 )
> r ~= a.readLine;
> else
> r ~= b.readLine;
> }
> while( a.hasMore )
> r ~= a.readLine;
> while( b.hasMore )
> r ~= b.readLine;
>
> The downside to randomly choosing two files is that it's possible you
> could end up always choosing a small and a large file, thus increasing
> the chance that the large file will have a bunch of lines left over
> which are essentially appended.
Yah, Sergey got me thinking too. What if you have a large file that
doesn't fit in memory by exactly one line?
Then I shuffle (n - 1) lines at random, and there remains the last line
to write. If I choose from either fragment with probability 0.5, the
last line will be with high probability among the first in the result!
> The easy way around this would be to maintain a list of files so only
> equal-sized files are merged (similar to how mergesort works). However,
> I think a better way would be to use some sort of bias for the random
> check above so that different-sized files are on equal footing. Perhaps
> write out the file with the first 4 bytes containing the number of lines
> in that file and use that to construct the bias. I'll leave those with
> better math-fu than I to come up with a formula which would be fair.
(In similar situations I've put the number of lines right in the name of
the temporary file. I kid you not! And it's O(1) too!)
It looks like any fragment should be chosen from with a probability
proportional to its length. The lingering question for me is whether
that probability needs to be updated, or can be fixed at the beginning
of merging. I think fixed doesn't quite work unless special measures are
taken at whenever a range is done.
Again many thanks to all participants for your illuminating insights.
Andrei
More information about the Digitalmars-d
mailing list