Random sampling next steps
Christophe Travert
travert at phare.normalesup.org
Mon Jul 23 08:42:51 PDT 2012
Joseph Rushton Wakeling , dans le message (digitalmars.D:172997), a
écrit :
> In other words, either your input range needs hasLength!R == true or
> you need to manually specify the total number of items when calling randomSample:
>
> But what if the total number of items is not known in advance? E.g. if you are
> reading a file, line by line, or reading records from a tape; you may know the
> total is finite, but you don't know what it actually _is_.
[snip]
> ... but doing something similar within RandomSample doesn't seem so easy. Why?
> Because the static if()s that you'd require within the struct would not depend
> just on whether hasLength!R == true, but also on whether you'd passed a size_t
> total to the constructor.
Why not using takeExactly ? this is the standard way select from a
subset of the original range. I wouldn't even have provided the overload
with 3 arguments, the user being able to use takeExactly when necessary
(which could be advised in the doc in case the user doesn't know).
struct RandomSample(R) if (isInputRange!R && hasLength!R)
{
...// always use r.length, never total/available
}
auto randomSample(R)(R r, size_t n, size_t total)
if(isInputRange!R)
{
return randomSample!(R, void)(takeExactly(r, total), n);
}
struct RandomSample(R) if(isInputRange!R && !hasLength!R)
{
...// always reservoir random sample
}
There is no more issue here.
> I also think it would be a good idea for the reservoir sampling technique to
> emit a warning when in debug mode, to prompt the user to be _sure_ that they
> can't specify the total number of points to sample from. Is there a recommended
> method of doing something like this?
I don't think library polluting compiler warnings is recommended.
> Alternatively, would people prefer to entirely separate the known-total and
> unknown-total sampling methods entirely, so the choice is always manual?
RandomSample is a lazy range. RandomReservoirSample is not, and has a
completely different implementation. IMHO, there is a fundamental
difference that justifies to have a separate function with a different
name.
> Finally, if hasLength!R == false, is there any way of guaranteeing that the
> input range is still going to be ultimately finite? There could be some very
> nasty worst-case behaviour in the case of infinite ranges.
IsInfinite!Range.
However, a finite range could return false on empty indefinitely, would
the implementer of the range just forget to make empty an enum, or the
user meet a corner case (e.g. repeat(1).until(2)). But that's a general
problem, that would make most eager algorithm result in an infinite
loop, starting with array and copy...
--
Christophe
More information about the Digitalmars-d
mailing list