problem with parallel foreach

Tue May 12 08:42:10 PDT 2015

On 13/05/2015 2:59 a.m., Gerald Jansen wrote:
> I am a data analyst trying to learn enough D to decide whether to use D
> for a  new project rather than Python + Fortran. I have recoded a
> non-trivial Python program to do some simple parallel data processing
> (using the map function in Python's multiprocessing module and parallel
> foreach in D). I was very happy that my D version ran considerably
> faster that Python version when running a single job but was soon
> dismayed to find that the performance of my D version deteriorates
> rapidly beyond a handful of jobs whereas the time for the Python version
> increases linearly with the number of jobs per cpu core.
>
> The server has 4 quad-core Xeons and abundant memory compared to my
> needs for this task even though there are several million records in
> each dataset. The basic structure of the D program is:
>
> import std.parallelism; // and other modules
> function main()
> {
>      // ...
>      // read common data and store in arrays
>      // ...
>      foreach (job; parallel(jobs, 1)) {
>          runJob(job, arr1, arr2.dup);
>      }
> }
> function runJob(string job, in int[] arr1, int[] arr2)
> {
>      // read file of job specific data file and modify arr2 copy
>      // write job specific output data file
> }
>
> The output of /usr/bin/time is as follows:
>
> Lang Jobs    User  System  Elapsed %CPU
> Py      1   45.17    1.44  0:46.65   99
> D       1    8.44    1.17  0:09.24  104
>
> Py      2   79.24    2.16  0:48.90  166
> D       2   19.41   10.14  0:17.96  164
>
> Py     30 1255.17   58.38  2:39.54  823 * Pool(12)
> D      30  421.61 4565.97  6:33.73 1241
>
> (Note that the Python program was somewhat optimized with numpy
> vectorization and a bit of numba jit compilation.)
>
> The system time varies widely between repititions for D with multiple
> jobs (eg. from 3.8 to 21.5 seconds for 2 jobs).
>
> Clearly simple my approach with parallel foreach has some problem(s).
> Any suggestions?
>
> Gerald Jansen

A couple of things comes to mind at the start of the main function.

---
import core.runtime : GC;
GC.disable;
GC.reserve(1024 * 1024 * 1024);
--

That will reserve 1gb of ram for the GC to work with. It will also stop 
the GC from trying to collect.

I would HIGHLY recommend that each worker thread have a preallocated 
amount of memory for it. Where that memory is then used to have the data 
put into.
Basically you are just being sloppy memory allocation wise.

Try your best to move your processing into @nogc annotated functions. 
Then make it as much as possible with no allocations.

Remember you can pass around slices of arrays as long as they are not 
mutated without memory allocation freely!