random k-sample of a file

Fri Oct 10 08:09:18 PDT 2008

bearophile wrote:
> Andrei Alexandrescu:
>>> Ah, ok. It works if you do that. So bearophile's code must say
>>>
>>> if random() > (1.0 / (i+1)):
>>>
>>> instead of
>>>
>>> if random() < (1.0 / (i+1)):
>>>
>>> (that is, if random() returns a real number between 0 and 1 with uniform 
>>> distribution)
>> His prize gets retired. Now how does Kirk's code fare? :o)
> 
> I don't like contests...
> 
> This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc.
> 
> The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335

No, that's a different problem. They want to extract n random samples 
from N records *without replacement*. It's a very different ballgame. 
See http://www.cs.duke.edu/~jsv/Papers/catalog/node139.html

The problem as I stated is reasonably easy and my implementation should 
be correct.

Andrei