[OT] n-way union

Mon May 25 09:06:27 PDT 2009

Andrei Alexandrescu wrote:
> This is somewhat OT but I think it's an interesting problem. Consider 
> the following data:
> 
> double[][] a =
> [
>     [ 1, 4, 7, 8 ],
>     [ 1, 7 ],
>     [ 1, 7, 8],
>     [ 4 ],
>     [ 7 ],
> ];
> 
> We want to compute an n-way union, i.e., efficiently span all elements 
> in all arrays in a, in sorted order. You can assume that each individual 
> array in a is sorted. The output of n-way union should be:
> 
> auto witness = [
>     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
> ];
> assert(equal(nWayUnion(a), witness[]));
> 
> The STL and std.algorithm have set_union that does that for two sets: 
> for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
> n-way unions poses additional challenges. What would be a fast 
> algorithm? (Imagine a could be a very large range of ranges).
> 
> Needless to say, nWayUnion is a range :o).

If we'd know anything about the data, such as, the max value is always 
smaller than the total number of elements in the subarrays, then we'd 
probably more easily invent a decent algorithm.

But the totally general algorithm has to be more inefficient. And 
constructing (not worst-case, but) tough-case data is trivial. For 
example, take a thousand subarrays, each a thousand elements long, 
containing random uints from the inclusive range 0..uint.max.

It's likely that you then can pop the first element in a[0], but the 
second element is larger, and the first element in a[1] is also larger. 
This results in that a is still sorted, except that a[0] may belong to a 
later position.

On the other hand, in your example, after popping all the ones, a is 
severely out of order.

---------------------

You probably have to have an array of structs, where each struct 
contains a reference to a subarray, and a copy of the first value of 
this subarray.

Sort the array of structs, and beginning with the first struct, pop from 
the pointed-to subarray as long as it contains elements equal to the 
first one. Then pop from the subarray that belongs to following struct, 
and so on, until a subbarray's first element already is larger.

While popping, you obviously update the value copies, and append to an 
output array.

Then sort the array of structs, trim it in case some of the structs now 
point to empty subarrays, and start over.

This is only moderately efficient, but had Stepanof or anybody else come 
up with something essentially better, you'd have heard of it already.

The only upshot of this algorithm is that the array a doesn't have to be 
sorted initially. (The subarrays, of course still do.)

The algorithm could be enhanced by, instead of an array of structs, 
having a singly liked list of these structs. Each time you pop, you move 
the relevant struct to another list, poppedStructs. When done, you only 
have to sort the poppedStructs list, and then /merge/ it with the list 
of unpopped structs, which still is sorted. This would obviously 
drastically improve efficiency, since it is likely that what needs to be 
sorted now is only a few items per round.

Hmmm. Maybe this isn't all that inefficient, after all...