Performance of tables slower than built in?

Fri May 24 12:24:02 UTC 2019

On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
> On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad 
> wrote:
>> On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
>>> Either way, sin it's still twice as fast. Also, in the code 
>>> the sinTab version is missing the writeln so it would have 
>>> been faster.. so it is not being optimized out.
>>
>> Well, when I run this modified version:
>>
>> https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c
>>
>> on https://run.dlang.io/
>>
>> then I get:
>>
>> LUT:    709
>> sin(x): 2761
>>
>> So the LUT is 3-4 times faster even with your quarter-LUT 
>> overhead.
>
> FWIW, as far as I can tell I managed to get the lookup version 
> down to 104 by using bit manipulation tricks like these:
>
> auto fastQuarterLookup(double x){
>     const ulong mantissa = cast(ulong)( (x - floor(x)) * 
> (cast(double)(1UL<<63)*2.0) );
>     const double sign = 
> cast(double)(-cast(uint)((mantissa>>63)&1));
>     … etc
>
> So it seems like a quarter-wave LUT is 27 times faster than sin…
>

If so then that is great and what I'd expected to achieve 
originally.

I guess this is using LDC though? I wasn't able to compile with 
LDC since after updating I'm getting linker errors that I have to 
go figure out.

> You just have to make sure that the generated instructions 
> fills the entire CPU pipeline.

What exactly does this mean? I realize the pipeline in cpu's is 
how the cpu decodes and optimizes the instructions but when you 
say "You have to make sure" that pre-supposes there is a method 
or algorithm to know.

Are you saying that I did not have enough instructions that the 
pipeline could take advantage of?

In any case, I'll check your code out and try to figure out the 
details and see exactly what is going on.

If it truly is a 27x faster then then that is very relevant and 
knowing why is important.

Of course, a lot of that might simply be due to LDC and I wasn't 
able to determine this.

Can you do some stats for dmd and ldc?

You seem to be interested in this, are you up for a challenge?

The idea is to use tables to optimize these functions.

Half sin was done above but quarter sine can be used(there are 4 
quadrants but only one has to be tabularized because all the 
others differ by sign and reversion(1 - x), it's a matter of 
figuring out the sign).

Of course it requires extra computation so it would be 
interesting to see the difference in performance for the extra 
logic.

Then there is exp

exp(x) can be written as exp(floor(x) + {x}) = 
exp(floor(x))*exp({x})

and so one can optimize this by tabulating exp(x) for 0<= x < 1 
which is the fractional part of exp(x).

Then tabulating it for a wide range of integers(probably in 2's).

e.g.,

exp(3.5) = exp(3)*exp(.5)

both come from a lookup table.

or one could do

exp(3) = exp(1 + 1 + 1) = exp(1)*exp(1)*exp(1)

(this requires iteration if we do not tabulate exp(3).

Hence one would limit the iteration count by tabulating things 
like exp(10^k) and exp(k) for -10 < k < 10.

The idea is basically one can get really dense LUT's for a small 
range that then are used to build the function for arbitrary 
inputs.

With linear interpolation one can get very accurate(for all 
practical purposes) LUT table methods that, if your code is 
right, is at least an order of magnitude faster. The memory 
requirements will be quite small with linear interpolation(and 
ideally quadratic or cubic if the costs are not too high).

That was what I was starting to work on before I got thrown off 
by it being much slower.

It seems you already have the half-sin done.

One could do things like sin, cos(obviously easy), exp, exp()^2, 
erf(x), sinh, cosh, etc. Things like sec could also be done as it 
would save a division since it seems they take about 30 cycles. 
But it would depend on the memory used.

[I can't mess with this now because I've gotta work other things 
at the moment]

Thanks.