Without release, only the euclidean benchmark shows a more dramatic speed difference: Serial reduce: 6298 milliseconds. Parallel reduce with 4 cores: 567 milliseconds. I forgot to mention I'm on XP32. I could test these on a virtualized Linux, if that's worth testing.