Performance of tables slower than built in?

Fri May 24 12:56:55 UTC 2019

On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
>> So it seems like a quarter-wave LUT is 27 times faster than 
>> sin…
>>
>
> If so then that is great and what I'd expected to achieve 
> originally.
>
> I guess this is using LDC though? I wasn't able to compile with 
> LDC since after updating I'm getting linker errors that I have 
> to go figure out.

Yes, the gist linked above is just your code with minor changes, 
that was 4-5 times faster. To get to 27 times faster you need to 
use the integer bit-manipulation scheme that I suggest above. 
Just beware that I haven't checked the code, so it might be off 
by ±1 and such.

Anyway, it is more fun for you to code up your own version than 
to try to figure out mine. Just follow the principles and you 
should get close to that performance, I think. (I'll refine the 
code later, but don't really have time now)

>> You just have to make sure that the generated instructions 
>> fills the entire CPU pipeline.
>
> What exactly does this mean? I realize the pipeline in cpu's is 
> how the cpu decodes and optimizes the instructions but when you 
> say "You have to make sure" that pre-supposes there is a method 
> or algorithm to know.

Yes, you have to look up information about the CPU in your 
computer. Each core has a set of "lanes" that are computed 
simultanously. Some instructions can go into many lanes, but not 
all. Then there might be bubbles in the pipeline (the lane) that 
can be filled up with integer/bit manipulation instructions. It 
is tedious to look that stuff up. So, last resort. Just try to 
mix simple integer with simple double computations (avoid 
division).

> Are you saying that I did not have enough instructions that the 
> pipeline could take advantage of?

Yes, you most likely got bubbles. Empty space where the core has 
nothing to send down a lane, because it is waiting for some 
computation to finish so that it can figure out what to do next.

Basic optimization:

Step 1: reduce dependencies between computations

Step 2: make sure you generate a mix of simple integer/double 
instructions that can fill up all the computation lanes at the 
same time

Step 3: make sure loops only contain a few instructions, the CPU 
can unroll loops in hardware if they are short. (not valid here 
though)

> Of course, a lot of that might simply be due to LDC and I 
> wasn't able to determine this.

I think I got better performance  because I filled more lanes in 
the pipeline.

> Half sin was done above but quarter sine can be used(there are 
> 4 quadrants but only one has to be tabularized because all the 
> others differ by sign and reversion(1 - x), it's a matter of 
> figuring out the sign).

Yes, as I mentioned, the first bit of the phase is the sign and 
the second bit of the phase is the reversion of the indexing.

> Of course it requires extra computation so it would be 
> interesting to see the difference in performance for the extra 
> logic.

It adds perhaps 2-5 cycles or so, my guessing.

> exp(x) can be written as exp(floor(x) + {x}) = 
> exp(floor(x))*exp({x})
[...]
> With linear interpolation one can get very accurate(for all 
> practical purposes) LUT table methods that, if your code is 
> right, is at least an order of magnitude faster. The memory 
> requirements will be quite small with linear interpolation

I think you need to do something with the x before you look up, 
so that you have some kind of fast nonlinear mapping to the 
indexes.

But for exp() you might prefer an approximation instead, perhaps 
polynomial taylor series perhaps.

Searching the web should give some ideas.

> It seems you already have the half-sin done.

I did the quarter sin though, not the half-sin (but that is 
almost the same, just drop the reversion of the indexing).

(Let's talk about this later, since we both have other things on 
our plate. Fun topic! :-)