randomSample with unknown length

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Wed Feb 2 07:32:25 PST 2011


On 2/2/11 6:03 AM, Magnus Lie Hetland wrote:
> Reading the doc for std.random.randomSample, I saw that "The total
> length of r must be known". There are rather straightforward algorithms
> for drawing random samples *without* knowing this. This might be useful
> if one wants to support input ranges, I guess?
>
> Take, for example, the method described by Knuth (TAoP 2), for selecting
> n elements uniformly at random from an input range:
>
> - Select the first n elements as the current sample.
> - Each subsequent element is rejected with a probability of 1 - n/t,
> where t is the number seen so far.
> - If a new item is selected, it replaces a random item in the current
> sample.
>
> A cool property of this is that at any time, the current sample is one
> drawn randomly (i.e., uniformly, without replacement) from the items
> you've seen so far, so you could really stop at any point. That is, stop
> iterating over the input; you can't really give the output as a
> small-footprint range here (as far as I can see). Seems you have to
> allocate room for n pointers. (Or you *could* just keep track of which
> objects were swapped in -- might be worth the overhead if n is large
> compared to the input size.)

I posted a problem solved by the algorithm above (and others, more 
sophisticated ones) as a challenge to this group a couple of years ago.

randomSample is designed to subsample a large stream in constant space 
and without needing to scan the entire stream in order to output the 
first element. I used in in my dissertation where e.g. I had to select 
100K samples from a stream of many millions.

Having a reservoir sample available would be nice. I'd be thrilled if 
you coded up a reservoirSample(r, n) function for addition to std.random.


Andrei


More information about the Digitalmars-d mailing list