[OT] n-way union

Tue May 26 06:09:59 PDT 2009

Andrei Alexandrescu wrote:
> Don wrote:
>> BCS wrote:
>>> in sudocode
>>>
>>> 1. build a min heap out of the elements based on the first element of 
>>> each array
>>> 2. remove the first value from the root of the heap,
>>> 3. if empty array, drop it
>>> 4. if not empty re-heap
>>> 5. goto 2 if anything left
>>>
>>>
>> In practice, I bet it'll go faster with a couple more steps:
>> 2. remove the first value (from array X) from the root of the heap,
>> 2a. Determine new root of the heap (don't need to do a full re-heap yet).
>> 2b. move all values from X until empty or >= new root.
>> 4. reheap
> 
> Correct. Unfortunately my heap primitives didn't quite allow that so 
> currently I do it the "hard" way: pop from top, remove top from heap, 
> reinsert (if not empty). But this is definitely an optimization worth 
> doing. I think I'll implement it.
> 
>> There are interesting cases like:
>> [5,6,7,8,9,9,......9]
>> [30, 31,32,33,34...38]
>> [1,1,1,1,1,1,1,1,1,1,...1]
>> [10,11,12,13,14,15,16,16,......., 16]
>>
>> where in theory, the total number of comparisons required is a 
>> function only of the number of arrays, and is independent of the 
>> actual number of elements.
>> If it's common to have runs where consecutive values come from a 
>> single array, and the arrays are random access, you could do MUCH 
>> better than O(n).
> 
> I think you can't do better. Complexity is O(n) at a minimum because you 
> need to go through each element. The heap-based algorithm has O(n log 
> m). What you could improve on is the log m factor (m = the number of sets).

Depending on what "going through each element" means. Certainly, if you 
need to move them, you can't beat O(n) moves, but you can definitely get 
O(m*m) < < O(n) comparisons in the best case I described. And that would 
also be the number of splicing operations.

You could approximate this without hurting the worst-case too much, by 
(for example) once you get a run of more than K entries from the same 
array, step forward 2 entries; if that's still less than the root, copy 
both to the output, then step forward 4 entries, etc.

That should give a best-case O(m log n) comparisons/splices.
Which might not be of any practical use, of course.