Cantor tuple function, BigInt, and integer overflow

Thu Dec 20 11:28:35 PST 2012

On Thu, Dec 20, 2012 at 07:46:39PM +0100, Joseph Rushton Wakeling wrote:
> Hello all,
> 
> I've been playing around trying to implement the Cantor tuple function
> (see attached code), which is essentially a way to map a sequence x1,
> x2, ..., xN of non-negative integers to a single unique number.  See:
> https://en.wikipedia.org/wiki/Pairing_function#Cantor_pairing_function
> 
> The inspiration came from a need to check pairs of numbers for
> uniqueness.

This is an interesting problem. It is related to the computation of the
Cartesian product of infinite ranges: your unique number is simply the
index of the given pair in the range over the N-ary Cartesian product of
the natural numbers.  Of course, it's very inefficient to compute the
Cartesian product just to find the index of a particular sequence, so a
direct mapping is desirable.

The simplest method, mathematically speaking, is to make use of the
Prime Factorization theorem, that every non-negative integer is the
product of a unique set of primes powers. So to find the index of a
sequence, simply assign consecutive primes to them, say 2, 3, 5, 7, 11,
13, ... up to the N'th prime. Then the index is given by:

	I = 2^x1 * 3^x2 * 5^x3 * ... * pN^xN

This index is mathematically guaranteed to be unique for every sequence.
However, the problem with this is that it tends to produce very, VERY
large numbers, and is rather expensive to compute to boot.

Given that we're working on computers where the bit width of x1...xN is
fixed (currently 32-bit or 64-bit), we can do better. If we concatenate
the elements of the sequence to form a 32*N (or 64*N) bit BigInt, that
also gives us a unique index for the sequence. Assuming that the numbers
in your sequence cover a good part of the range of a 32-bit (or 64-bit)
integer, this encoding produces much smaller numbers than the prime
product method.

But then, concatenation just means that you might as well just store the
sequence as-is (since that's what concatenation amounts to!), and make
comparisons based on that.

However, you probably want to have a faster means of comparison. And so
this is where hashtables come in: if you only have a relatively small
number of sequences (compared with the potentially wide range of values
of elements in the sequence and wide variance of sequence length), then
computing huge unique indices for them is overkill. A more efficient
method is to compute a hash for the sequence that has a low collision
rate, and for colliding entries just keep a simple list of some sort
containing the actual sequences, so that you can fallback to bitwise
comparison.

IOW, bool[Sequence] may be all you need to solve your problem, depending
on what you're trying to do.

T

-- 
I see that you JS got Bach.