Idempotent partition around median of 5?
Ola Fosheim Grøstad via Digitalmars-d
digitalmars-d at puremagic.com
Sat Feb 6 02:42:20 PST 2016
On Saturday, 6 February 2016 at 07:06:27 UTC, Ivan Kazmenko wrote:
> 1. Primitive types (cheap swap, cheap comparison).
>
> 2. Heavy structs A (expensive swap, cheap comparison - e.g.,
> compare one field of primitive type).
>
> 3. Heavy structs B (expensive swap, expensive comparison -
> e.g., call a heavy external function).
>
> 4. Heavy classes (cheap swap - pointers only, expensive
> comparison).
>
> So there's perhaps no single best solution.
That's right, but other factors are more important: preventing
pipeline stalls. If you are collecting from 5 different
cachelines in an array you are likely to get several 40-120
cycles delays unless you do prefetching, and if you do, you need
to have other instructions to fill in the latency gaps.
But also instructions have latency and concurrency issues. Which
is why your version did reasonably well as it made the
compares/swaps independent so that they could be concurrently
scheduled.
Yet, Haswell has SIMD instructions that can do 8-16x 32-bit
max/min operations per cycle, with a latency of only 1 cycle, and
4-8x 64bit compares with a latency of 1 cycle.
So if you use as small fixed N, like 5, it makes very little
sense to count compares/swaps.
What makes sense is to focus on how you can avoid branching and
build an algorithm with no pipeline stalls.
If sorting large arrays you also might want to look at
multi-threaded parallel sort functions.
More information about the Digitalmars-d
mailing list