Random sampling next steps

Mon Jul 23 08:42:51 PDT 2012

Joseph Rushton Wakeling , dans le message (digitalmars.D:172997), a
 écrit :
> In other words, either your input range needs hasLength!R == true or 
> you need to manually specify the total number of items when calling randomSample:
> 
> But what if the total number of items is not known in advance?  E.g. if you are 
> reading a file, line by line, or reading records from a tape; you may know the 
> total is finite, but you don't know what it actually _is_.
[snip]
> ... but doing something similar within RandomSample doesn't seem so easy.  Why? 
>   Because the static if()s that you'd require within the struct would not depend 
> just on whether hasLength!R == true, but also on whether you'd passed a size_t 
> total to the constructor.

Why not using takeExactly ? this is the standard way select from a 
subset of the original range. I wouldn't even have provided the overload 
with 3 arguments, the user being able to use takeExactly when necessary 
(which could be advised in the doc in case the user doesn't know).

struct RandomSample(R) if (isInputRange!R && hasLength!R)
{
 ...// always use r.length, never total/available
}

auto randomSample(R)(R r, size_t n, size_t total)
if(isInputRange!R)
{
     return randomSample!(R, void)(takeExactly(r, total), n);
}

struct RandomSample(R) if(isInputRange!R && !hasLength!R)
{
...// always reservoir random sample
}

There is no more issue here.

> I also think it would be a good idea for the reservoir sampling technique to 
> emit a warning when in debug mode, to prompt the user to be _sure_ that they 
> can't specify the total number of points to sample from.  Is there a recommended 
> method of doing something like this?

I don't think library polluting compiler warnings is recommended.

> Alternatively, would people prefer to entirely separate the known-total and 
> unknown-total sampling methods entirely, so the choice is always manual?

RandomSample is a lazy range. RandomReservoirSample is not, and has a 
completely different implementation. IMHO, there is a fundamental 
difference that justifies to have a separate function with a different 
name.

> Finally, if hasLength!R == false, is there any way of guaranteeing that the 
> input range is still going to be ultimately finite?  There could be some very 
> nasty worst-case behaviour in the case of infinite ranges.

IsInfinite!Range.

However, a finite range could return false on empty indefinitely, would 
the implementer of the range just forget to make empty an enum, or the 
user meet a corner case (e.g. repeat(1).until(2)). But that's a general 
problem, that would make most eager algorithm result in an infinite 
loop, starting with array and copy...

-- 
Christophe