Timon Gehr timon.gehr at gmx.ch
Fri Feb 17 18:40:15 PST 2012

```On 02/18/2012 03:16 AM, H. S. Teoh wrote:
> On Sat, Feb 18, 2012 at 02:52:15AM +0100, Alf P. Steinbach wrote:
>> On 18.02.2012 02:39, H. S. Teoh wrote:
>>> // Outputs a randomly selected line from standard input with equal
>>> // likelihood.
>>> import std.random;
>>> import std.stdio;
>>>
>>> void main() {
>>> 	auto n = 0;
>>> 	string choice;
>>> 	foreach (line; stdin.byLine()) {
>>> 		n++;
>>> 		if (uniform(0,n) == 0)
>>> 			choice = line.idup;
>>> 	}
>>> 	writeln(choice);
>>> }
>>>
>>>
>>> P.S. Proof that the probability any line is selected is exactly 1/n
>>> (where n is the total number of lines read) is left as an exercise
>>
>> Assuming that by "any" you mean "any particular", you would have to
>> read all the lines first. Otherwise, if the code selects the first
>> line with probability 1/K, then I can just input some other number of
>> lines.
>
> But therein lies the trick. The algorithm self-adapts to the number of
> lines it reads. It does not know in advance how many lines there are,
> but guarantees that by the end of the input, the probability that any
> particular line is selected is exactly 1/n, where n is the number of
> lines read. The key lies in the way uniform() is called. :)
>
> (This has been tested by running the program repeatedly on the same
> input and analysing the number of occurrences of each line. The proof of
> the algorithm is left as an exercise for the reader. ;-) )
>
>
>>> P.S.S. The .idup is a bit ugly, but necessary, since apparently byLine()
>>> calls readln() with a static buffer, so choice will be silently
>>> overwritten if the .idup is omitted.
>>
>> That sounds ominous. One should never have to be aware of low level
>> details in order to do simple string assignment or initialization,
>> when the source already is a string. Does one really have to do that
>> in D?
> [...]
>
> Well, there are two issues here. One is that stdin.byLine() returns
> char[], which cannot be assigned to string directly. Two is that the
> first version of the code didn't have the .idup (I used char[] instead
> of string for choice), and the result was garbage in the output due to
> said overwriting of buffer.
>
> Now I agree that one shouldn't need to know how byLine() is implemented
> in order to be able to use it, but this is the way the current Phobos
> implementation is, unfortunately.
>

You don't need to know how it is implemented. Everything you need to
know is stated in its interface + documentation comment.

> However, the one upside to all this is that the type system, in a sense,
> forces you to do the right thing: byLine() returns char[] which cannot
> be assigned to string, so naturally you have to write .idup, which
> automatically also avoids the buffer overwriting issue. Using string
> instead of char[] is a logical choice in this case (pun intended ;-)),
> because it's something whose value you want to retain until the next
> time it's modified. So using string for 'choice' naturally leads to
> needing an .idup in the loop. The system isn't perfect, but it's pretty
> good.
>
>
> T
>

Its pretty perfect. If byLine would return string it would be horribly
inefficient for the common case of processing the input without storing
it. It even allows in-place modification of the current input line.
```