problem with parallel foreach

Wed May 13 06:40:32 PDT 2015

On Wednesday, 13 May 2015 at 11:33:55 UTC, John Colvin wrote:
> On Tuesday, 12 May 2015 at 18:14:56 UTC, Gerald Jansen wrote:
>> On Tuesday, 12 May 2015 at 16:35:23 UTC, Rikki Cattermole 
>> wrote:
>>> On 13/05/2015 4:20 a.m., Gerald Jansen wrote:
>>>> At the risk of great embarassment ... here's my program:
>>>> http://dekoppel.eu/tmp/pedupg.d
>>>
>>> Would it be possible to give us some example data?
>>> I might give it a go to try rewriting it tomorrow.
>>
>> http://dekoppel.eu/tmp/pedupgLarge.tar.gz (89 Mb)
>>
>> Contains two largish datasets in a directory structure 
>> expected by the program.
>
> I only see 2 traits in that example, so it's hard for anyone to 
> explore your scaling problem, seeing as there are a maximum of 
> 2 tasks.

Either way, a few small changes were enough to cut the runtime by 
a factor of ~6 in the single-threaded case and improve the 
scaling a bit, although the printing to output files still looks 
like a bit of a bottleneck.

http://dpaste.dzfl.pl/80cd36fd6796

The key thing was reducing the number of allocations (more 
std.algorithm.splitter copying to static arrays, less 
std.array.split) and avoiding File.byLine. Other people in this 
thread have mentioned alternatives to it that may be faster/have 
lower memory usage, I just read the whole files in to memory and 
then lazily split them with std.algorithm.splitter. I ended up 
with some blank lines coming through, so i added if(line.empty) 
continue; in a few places, you might want to look more carefully 
at that, it could be my mistake.

The use of std.array.appender for `info` is just good practice, 
but it doesn't make much difference here.