std.string and ranges

Tue Feb 10 12:50:26 PST 2009

Andrei Alexandrescu:

>Well at some point I hope to be able to rally a few volunteers to help.<

I hope to be able to understand how to do the things :-)

>I searched google for bearophile std.random site:digitalmars.com and couldn't find the message, so I'd appreciate a link<

I can just repeat the things I have said, improving them.

>There's plenty of more work to do on random, particularly on generating non-uniform distributions, and probably others so I'd appreciate your input.<

In my dlibs there's a random module too, I have not finished it, but it already contains some useful stuff.
http://www.fantascienza.net/leonardo/so/dlibs/random.html

A module like that in a standard lib needs functions like:
- choice: given an iterable of items, picks one at random
- randInt: gives a random integer in an interval, the lower bound is zero by default.
- shuffle: to shuffle randomly a given array (or range with random access, I'd say) in place.
- shuffled: like shuffle, but creates a copy of the items. Return an array.
- uniform: gives a random real/float/double in a given interval
- sample: extracts n items from an iterable with uniform distribution, without replacement
- weightedChoice: like choice, but the user can specify the relative probability of choosing each item.

I can assure you that such functions are hugely useful in tons of programs. Yet, you don't need more than few days to implement all those things (I already have implementations of most of them).

Few basic and easy distributions may be useful too.

A random module has to be used by lot of different people in many different situations:
- Some people need very good random numbers, even if they are slow to produce. Other people may also need a thread safe random generator.
- Some people need just to write a 15-lines long D program that looks like a script. Their code often doesn't need to be very fast, but the code must be as simple as possible, and the faster possible to write, so the syntax has to be short and very intuitive. The programmer doesn't want to read the docs to write such code.
- Some people need to produce tons of random numbers, as fast as possible, but they don't need to be high quality. In such situation thread safety may be less important.
- Some people just need to write safe and fast code, something normal.

So to fulfill all such purposes I think the random module needs needs three different random generators:
- A really fast one. I think R250/521 is the best for this purpose, it's faster than the usual one used in C and it's much better. It needs a certain amount of good random numbers to start, that can be given by the other slower and better random generators. Such code is designed to be as fast as possible.
- A multi-purpose average random generator, very easy to use, so it doesn't need to instantiate a class if you are writing a single threaded program, you just use simple functions. A generator like KISS of Tango is good here, but the API can be made simpler and very easy.
- A very good generator, thread safe. For this a variant of the Mersenne Twister is good.

Info about R250/521:
http://www.informs-cs.org/wsc97papers/0127.PDF

C implementation of R250/521:

/*
Copyright (c) 2006 the authors listed at the following URL:
http://literateprograms.org/R250/521_(C)?action=history&offset=20060711142344

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Retrieved from: http://literateprograms.org/R250/521_(C)?oldid=6914
*/

#include <stdlib.h>

#define R250_LEN     250
#define R521_LEN     521
#define ULONG_BITS   (8*sizeof(unsigned long))

unsigned long r250_buffer[R250_LEN];
unsigned long r521_buffer[R521_LEN];

int r250_index;
int r521_index;

void r250_521_init() {
    unsigned long mask1 = 1;
    unsigned long mask2 = (unsigned long)(-1L);

    int i = R521_LEN;

    while (i-- > R250_LEN) {
        r521_buffer[i] = rand();
    }
    while (i-- > ULONG_BITS) {
        r250_buffer[i] = rand();
        r521_buffer[i] = rand();
    }

    while (i-- > 0) {
        r250_buffer[i] = (rand() | mask1) & mask2;
        r521_buffer[i] = (rand() | mask1) & mask2;
        mask2 ^= mask1;
        mask1 >>= 1;
    }
    r250_buffer[0] = mask1;
    r521_buffer[0] = mask2;

    r250_index = 0;
    r521_index = 0;

}

#define R250_IA  (sizeof(unsigned long)*103)
#define R250_IB  (sizeof(unsigned long)*R250_LEN - R250_IA)
#define R521_IA  (sizeof(unsigned long)*168)
#define R521_IB  (sizeof(unsigned long)*R521_LEN - R521_IA)

unsigned long r250_521_random() {
    unsigned long r, s;
    unsigned char * b1 = (unsigned char*)r250_buffer;
    unsigned char * b2 = (unsigned char*)r521_buffer;
    unsigned long * tmp1, * tmp2;

    int i1 = r250_index;
    int i2 = r521_index;
    int j1, j2;

    j1 = i1 - R250_IB;
    if (j1 < 0)
        j1 = i1 + R250_IA;
    j2 = i2 - R521_IB;
    if (j2 < 0)
        j2 = i2 + R521_IA;

    tmp1 = (unsigned long *)(b1 + i1);
    r = (*(unsigned long *)(b1 + j1)) ^ (*tmp1);
    *tmp1 = r;
    tmp2 = (unsigned long *)(b2 + i2);
    s = (*(unsigned long *)(b2 + j2)) ^ (*tmp2);
    *tmp2 = s;

    i1 = (i1 != sizeof(unsigned long)*(R250_LEN-1)) ? (i1 + sizeof(unsigned long)) : 0;
    r250_index = i1;
    i2 = (i2 != sizeof(unsigned long)*(R521_LEN-1)) ? (i2 + sizeof(unsigned long)) : 0;
    r521_index = i2;

    return r ^ s;
}

>> What I have recently said regarding the lazy generators can be applied to other things: having two ways to do things is generally bad, but if one of those ways is really easy and short to write and the other is very flexible and fast, then two ways to do something may coexist. For example this idea can applied to the foreach, the std.random, etc.<<

>I'm not sure I understand this.<

I have expressed that idea once in the past too, and I have just explained an example of it regarding std.random. I can try again to explain myself.

When you design a feature in a language you have to find compromises among different needs:
- Such feature needs a short and easy syntax that is not error-prone.
- Such feature has to be efficient in space and time, and very flexible.

Often such two purposes are almost opposite, it's not easy to have both at the same time. A language like C++ often choses the second, and a language like Python or Ruby the first (but there are many exceptions).

So I may be tempted to ask you to put two different implementations of such feature:
- One very easy to use, short and easy to remember, not much efficient.
- One more flexible, more complex, less easy to remember (you may have to look into the docs to use it), very effficient.

But generally in a language duplicating things and having two ways to do the same thing is bad, it's extra complexity.

Let's call C1 the complexity (for the programmer) to use the first way, and C2 the complexity of using the second way.

In several situations I have seen that if C1 <<< C2 then C1+C2 ~= C2, so C1 doesn't add much complex to the language, so they both can coexist.

>I think your ideas on improving I/O are very sensible.<

I/O is used during debugging, logging, etc. What I'd like is to have a very nice, very readable, little/unambigous way to show any built-in type. I may also want two ways to represent such data, a readable one and one for ASCII serialization (but readable still, if possible), as in Python. Note this may need another default method beside toString, somethign like toRepr, or toSerialize, that returns a textual representation that given to the object (to a standard method, if you want, like fromSerialize? or fromRepr), re-creates the original built-in data or object.

I also print things often, so the find the printing syntax used by Tango confusing, and too much long. I think function names like put/putr/putf/putfr are the best, because they are clean, short to type, etc. put prints, put prints with a newline, and the versions with f format too in some way (using % as in C or {} as in C#/Tango, C# may be better here).

>(I wouldn't want to format tuples the same way though.)<

Well, nothing I have done is perfect, everything can be improved.
Do you mean structs or tuples?
If you give a tuple to one of my printing functions you just receive a string-fication of all the items, with no spaces beteen them. It's a bare-bones thing.
My put/putr show a standard representation of structs, if they miss a toString. I'd also like every struct to have a built in copy, toHash, opEquals (they already have this), a default toString, a default opCmp too, etc. My record()/Record!() structs already have such qualities, even recursively, even if you nest normal structs into them.

Bye,
bear hugs,
bearophile