Idempotent partition around median of 5?

Sat Feb 6 02:42:20 PST 2016

On Saturday, 6 February 2016 at 07:06:27 UTC, Ivan Kazmenko wrote:
> 1. Primitive types (cheap swap, cheap comparison).
>
> 2. Heavy structs A (expensive swap, cheap comparison - e.g., 
> compare one field of primitive type).
>
> 3. Heavy structs B (expensive swap, expensive comparison - 
> e.g., call a heavy external function).
>
> 4. Heavy classes (cheap swap - pointers only, expensive 
> comparison).
>
> So there's perhaps no single best solution.

That's right, but other factors are more important: preventing 
pipeline stalls. If you are collecting from 5 different 
cachelines in an array you are likely to get several 40-120 
cycles delays unless you do prefetching, and if you do, you need 
to have other instructions to fill in the latency gaps.

But also instructions have latency and concurrency issues. Which 
is why your version did reasonably well as it made the 
compares/swaps independent so that they could be concurrently 
scheduled.

Yet, Haswell has SIMD instructions that can do 8-16x 32-bit 
max/min operations per cycle, with a latency of only 1 cycle, and 
4-8x 64bit compares with a latency of 1 cycle.

So if you use as  small fixed N, like 5, it makes very little 
sense to count compares/swaps.

What makes sense is to focus on how you can avoid branching and 
build an algorithm with no pipeline stalls.

If sorting large arrays you also might want to look at 
multi-threaded parallel sort functions.