First time using Parallel

Sun Dec 26 15:22:48 UTC 2021

On Sunday, 26 December 2021 at 11:24:54 UTC, rikki cattermole 
wrote:
> I would start by removing the use of stdout in your loop kernel 
> - I'm not familiar with what you are calculating, but if you 
> can basically have the (parallel) loop operate from (say) one 
> array directly into another then you can get extremely good 
> parallel scaling with almost no effort.

  I'm basically generating a default list of LFSR's for my Reed 
Solomon codes. LFSR can be used in Pseudo random numbers, but in 
this case it's to build a Galois field for Error Correction.

  Using it is simple, you need to know a binary number that when 
xored when a 1 bit exits the range, will result in the maximum 
numbers (*excluding zero*). So if we do 4 bits (xor of 3) you'd 
get:

```
  0 0001 -- initial
  0 0010
  0 0100
  0 1000
  1 0011 <- 0000
  0 0110
  0 1100
  1 1011 <- 1000
  1 0101 <- 0110
  0 1010
  1 0111 <- 0100
  0 1110
  1 1111 <- 1100
  1 1101 <- 1110
  1 1001 <- 1010
  1 0001 <- 0010 -- back to our initial value
```
  As such the bulk of the work is done in this function. Other 
functions leading to this mostly figure out what value should be 
according to some rules i set before trying to work (*quite a few 
only need 2 bits on*).

```d
bool testfunc(ulong value, ulong bitswide) {
	ulong cnt=1, lfsr=2,up=1UL<<bitswide;
	value |= up; //eliminates need to and the result
	while(cnt < up && lfsr != 1) {
		lfsr <<= 1;
		if (lfsr & up)
			lfsr ^= value;
		cnt++;
	}
	return cnt == up-1;
}

//within main, cyclebits will call testfunc when value is 
calculated
	foreach(bitwidth; taskPool.parallel(iota(start, end))) {
		for(ulong bitson=2; bitson <= bitwidth; bitson+=1) {
			ulong v = cyclebits(bitwidth, bitson, &testfunc);
			if (v) {
				writeln("\t0x", cast(void*)v, ",\t/*",bitwidth, "*/"); //only 
place IO takes place
				break;
			}
		}
	}
```

rikki cattermole wrote:
>Your question at the moment doesn't really have much context to 
>it so it's difficult to suggest where you should go directly.

I suppose, if I started doing work where I'm sharing resources 
(*probably memory*) would i have to go with semaphores and locks. 
I remember trying to read how to use threads in the past in C/C++ 
and it was a headache to setup where i just gave up.

I assume it's best to divide work up where it can be completed 
without competing for resources or race conditions, hefty enough 
work to make it worth the cost of instantiating the thread in the 
first place. So aside from the library documentation is there a 
good source for learning/using parallel and best practices? I'll 
love to be using more of this in the future if it isn't as big a 
blunder as it's made out to be.

> Not using in the actual loop should make the code faster even 
> without threads because having a function call in the hot code 
> will mean compilers optimizer will give up on certain 
> transformations - i.e. do all the work as compactly as possible 
> then output the data in one step at the end.

  In this case I'm not sure how long each step takes, so I'm 
hoping intermediary results i can copy by hand will work (*it may 
take a second or several minutes*). If this wasn't a brute force 
elimination of so many combinations I'm sure a different approach 
would work.

On 27/12/2021 12:10 AM, max haughton wrote:
> It'll speed it up significantly.
>
> Standard IO has locks in it. So you end up with all 
> calculations grinding to a halt waiting for another thread to 
> finish doing something.

  I assume that's only when they are trying to actually use it? 
Early in the cycles (*under 30*) they were outputting quickly, 
but after 31 it can be minutes between results, and each thread 
(*if I'm right*) is working on a different number. So ones found 
where 3,5,9 are pretty fast while all the others have a lot of 
failures before i get a good result.