random k-sample of a file
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Thu Oct 9 12:44:43 PDT 2008
bearophile wrote:
> I am not reading the anwers written by others, of course :-) With the help of "Programming Pearls" here is my second version, that spares the memory required for the chosen ones, so this code runs with very little memory:
>
> from sys import argv
> from random import random
>
> filename = argv[1]
> k = int(argv[2])
> nlines = sum(1 for _ in file(filename))
>
> if k >= nlines:
> for line in file(filename):
> print line
> else:
> select = k
> remaining = nlines
>
> for line in file(filename):
> if random() < float(select) / remaining:
> print line
> select -= 1
> remaining -= 1
>
> I'll think for a solution that avoids reading the file twice then...
>
> Bye,
> bearophile
This is unfair. First line in the original file is included with
probability k / n. The probability of including the second line is
dependent on the event of including the first line. If it wasn't
included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines
should be given equal chance.
Andrei
More information about the Digitalmars-d
mailing list