btw, on my corei5, in debug build, reduce (using double): 11msec non_parallel: 37msec parallel with atomicOp: 123msec so, that is the reason for using parallel reduce, assuming the ulong range thing will get fixed.