Random string samples & unicode - Reprise

Sun Sep 12 17:28:12 PDT 2010

On Sunday 12 September 2010 17:09:04 bearophile wrote:
> Andrei Alexandrescu:
> > This goes into "bearophile's odd posts coming now and then".
> 
> I assume you have missed most of the things I was trying to say, maybe you
> have not even read the original post. So I try to explain better a subset
> of the things I have written.
> 
> This is a quite common piece of Python code:
> 
> from random import sample
> d = "0123456789"
> print "".join(sample(d, 2))

You do seem to try to do a lot of things that most other folks never even think 
of doing, let alone have a need to. This is one of them. That's probably why 
Andrei reacted the way that he did.

> I need to perform the same thing in D.
> For me it's not easy to do that in D2 with Phobos2.
> 
> This doesn't work:
> 
> import std.stdio, std.random, std.array, std.range;
> void main() {
>     string d = "0123456789";
>     string res = array(take(randomCover(d, rndGen), 2));
>     writeln(res);
> }
> 
> It returns:
> test.d(4): Error: cannot implicitly convert expression
> (array(take(randomCover(d,rndGen()),2u))) of type dchar[] to string

I've found that if you want a string out of array(), what you need to do is 
to!string(array(...))). I don't know about this particular case, and it's a bit 
annoying - particularly when you started with a string in the first place - so 
perhaps take(), and until(), and the others like them that have this problem 
should be altered so that array() would produce a string if you passed them a 
string, but for the moment to!string seems to be the solution.

I would point out, however, that if you're trying to grab random characters from 
a string, that's likely to work best with a dstring because it supports random 
access, so there's a decent chance that dstring is really what you want anyway, 
and trying to use string is just going to me a lot of conversions no matter how 
well put together the Phobos functions are, simply because the underlying 
algorithm works best with random access and string doesn't provide it. Just one 
of the irritations of UTF-8 vs UTF-16 vs UTF-32. Unicode is wonderful and 
unicode sucks. At least D handles in explicitly as part of the language, which 
is a big improvement over languages like C, C++, or Java.

- Jonathan M Davis