Floating Point + Threads?

Sat Apr 16 05:07:38 PDT 2011

On 16-apr-11, at 05:22, dsimcha wrote:

> I'm trying to debug an extremely strange bug whose symptoms appear  
> in a std.parallelism example, though I'm not at all sure the root  
> cause is in std.parallelism.  The bug report is at https://github.com/dsimcha/std.parallelism/issues/1 
> #issuecomment-1011717 .
>
> Basically, the example in question sums up all the elements of a  
> lazy range (actually, std.algorithm.map) in parallel.  It uses  
> taskPool.reduce, which divides the summation into work units to be  
> executed in parallel.  When executed in parallel, the results of the  
> summation are non-deterministic after about the 12th decimal place,  
> even though all of the following properties are true:
>
> 1.  The work is divided into work units in a deterministic fashion.
>
> 2.  Within each work unit, the summation happens in a deterministic  
> order.
>
> 3.  The final summation of the results of all the work units is done  
> in a deterministic order.
>
> 4.  The smallest term in the summation is about 5e-10.  This means  
> the difference across runs is about two orders of magnitude smaller  
> than the smallest term.  It can't be a concurrency bug where some  
> terms sometimes get skipped.
>
> 5.  The results for the individual tasks, not just the final  
> summation, differ in the low-order bits.  Each task is executed in a  
> single thread.
>
> 6.  The rounding mode is apparently the same in all of the threads.
>
> 7.  The bug appears even on machines with only one core, as long as  
> the number of task pool threads is manually set to >0.  Since it's a  
> single core machine, it can't be a low level memory model issue.
>
> What could possibly cause such small, non-deterministic differences  
> in floating point results, given everything above?  I'm just looking  
> for suggestions here, as I don't even know where to start hunting  
> for a bug like this.

It might be due to context switch of threads, that might push out a  
double out of the higher precision 80-bit fpu register, and loose the  
extra precision.
SSE, or float should not have these problems. gcc has an option to  
always store the result in memory, and avoid the extra precision.
maybe having such an optionin dmd to debug such issues would be a nice  
thing.

Fawzi