# First time using Parallel

Era Scarecrow rtcvb32 at yahoo.com
Sun Dec 26 15:22:48 UTC 2021

```On Sunday, 26 December 2021 at 11:24:54 UTC, rikki cattermole
wrote:
> I would start by removing the use of stdout in your loop kernel
> - I'm not familiar with what you are calculating, but if you
> can basically have the (parallel) loop operate from (say) one
> array directly into another then you can get extremely good
> parallel scaling with almost no effort.

I'm basically generating a default list of LFSR's for my Reed
Solomon codes. LFSR can be used in Pseudo random numbers, but in
this case it's to build a Galois field for Error Correction.

Using it is simple, you need to know a binary number that when
xored when a 1 bit exits the range, will result in the maximum
numbers (*excluding zero*). So if we do 4 bits (xor of 3) you'd
get:

```
0 0001 -- initial
0 0010
0 0100
0 1000
1 0011 <- 0000
0 0110
0 1100
1 1011 <- 1000
1 0101 <- 0110
0 1010
1 0111 <- 0100
0 1110
1 1111 <- 1100
1 1101 <- 1110
1 1001 <- 1010
1 0001 <- 0010 -- back to our initial value
```
As such the bulk of the work is done in this function. Other
functions leading to this mostly figure out what value should be
according to some rules i set before trying to work (*quite a few
only need 2 bits on*).

```d
bool testfunc(ulong value, ulong bitswide) {
ulong cnt=1, lfsr=2,up=1UL<<bitswide;
value |= up; //eliminates need to and the result
while(cnt < up && lfsr != 1) {
lfsr <<= 1;
if (lfsr & up)
lfsr ^= value;
cnt++;
}
return cnt == up-1;
}

//within main, cyclebits will call testfunc when value is
calculated
for(ulong bitson=2; bitson <= bitwidth; bitson+=1) {
ulong v = cyclebits(bitwidth, bitson, &testfunc);
if (v) {
writeln("\t0x", cast(void*)v, ",\t/*",bitwidth, "*/"); //only
place IO takes place
break;
}
}
}
```

rikki cattermole wrote:
>Your question at the moment doesn't really have much context to
>it so it's difficult to suggest where you should go directly.

I suppose, if I started doing work where I'm sharing resources
(*probably memory*) would i have to go with semaphores and locks.
I remember trying to read how to use threads in the past in C/C++
and it was a headache to setup where i just gave up.

I assume it's best to divide work up where it can be completed
without competing for resources or race conditions, hefty enough
work to make it worth the cost of instantiating the thread in the
first place. So aside from the library documentation is there a
good source for learning/using parallel and best practices? I'll
love to be using more of this in the future if it isn't as big a
blunder as it's made out to be.

> Not using in the actual loop should make the code faster even
> without threads because having a function call in the hot code
> will mean compilers optimizer will give up on certain
> transformations - i.e. do all the work as compactly as possible
> then output the data in one step at the end.

In this case I'm not sure how long each step takes, so I'm
hoping intermediary results i can copy by hand will work (*it may
take a second or several minutes*). If this wasn't a brute force
elimination of so many combinations I'm sure a different approach
would work.

On 27/12/2021 12:10 AM, max haughton wrote:
> It'll speed it up significantly.
>
> Standard IO has locks in it. So you end up with all
> calculations grinding to a halt waiting for another thread to
> finish doing something.

I assume that's only when they are trying to actually use it?
Early in the cycles (*under 30*) they were outputting quickly,
but after 31 it can be minutes between results, and each thread
(*if I'm right*) is working on a different number. So ones found
where 3,5,9 are pretty fast while all the others have a lot of
failures before i get a good result.
```