Performance of tables slower than built in?

Thu May 23 21:30:47 UTC 2019

On Thursday, 23 May 2019 at 15:20:22 UTC, Timon Gehr wrote:
> On 23.05.19 12:21, Alex wrote:
>> On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
>>> On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
>>>> I am trying to create some fast sin, sinc, and exponential 
>>>> routines to speed up some code by using tables... but it 
>>>> seems it's slower than the function itself?!?
>>>
>>> There's intrinsic cpu instructions for some of those that can 
>>> do the math faster than waiting on memory access.
>>>
>>> It is quite likely calculating it is actually faster. Even 
>>> carefully written and optimized tables tend to just have a 
>>> very small win relative to the cpu nowadays.
>> 
>> Surely not? I'm not sure what method is used to calculate them 
>> and maybe a table method is used internally for the common 
>> functions(maybe the periodic ones) but memory access surely is 
>> faster than multiplying doubles?
>> ...
>
> Depends on what kind of memory access, and what kind of faster. 
> If you hit L1 cache then a memory access might be (barely) 
> faster than a single double multiplication. (But modern 
> hardware usually can do multiple double multiplies in parallel, 
> and presumably also multiple memory reads, using SIMD and/or 
> instruction-level parallelism.)
>
> I think a single in-register double multiplication will be 
> roughly 25 times faster than an access to main memory. Each 
> access to main memory will pull an entire cache line from main 
> memory to the cache, so if you have good locality (you usually 
> won't with a LUT), your memory accesses will be faster on 
> average. There are a lot of other microarchitectural details 
> that can matter quite a lot for performance.
>

I realize a lot of optimizations could be made but I still find 
it hard to believe that it could be done faster even with special 
hardware unless a LUT is used in the hardware already or some 
highly optimized algorithms.

The whole point of general purpose computing was to get away from 
specialized hardware because it was cheaper and one could spend 
the resources in improving raw cycle performance which would 
benefit the entire system.

But I guess because of complexity of hardware now days it's hard 
to really figure out what is going on.

http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/

Maybe when it is an "intrinsic" it can avoid some of the issues 
that might otherwise cost in a LUT(such as a context switch).

>> And most of the time these functions are computed by some 
>> series that requires many terms. I'd expect, say, to compute 
>> sin one would require at least 10 multiplies for any 
>> accuracy... and surely that is much slower than simply 
>> accessing a table(it's true that my code is more complex due 
>> to the modulos and maybe that is eating up the diff).
>> 
>> Do you have any proof of your claims? Like a paper that 
>> discusses such things so I can see what's really going on and 
>> how they achieve such performance(and how accurate)?
>
> Not exactly what you asked, but this might help:
> https://www.agner.org/optimize
>
> Also, look up the CORDIC algorithm.

I don't see how that can necessarily be faster. A LUT can give 
full 64-bit precision with one operation. The CORDIC needs 
iteration, at least 10 to be of any use. LUT's are precision 
independent assuming the creation cost is not included.

I couldn't find anything in that link that specifically addressed 
speeding up typical math functions.

It would be nice to know exactly what is going on, which 
functions are optimized and the typical speed up over a lut.

I have some code that uses sin, exp and a few other primitive 
algebraic functions and in one case the code is extremely slow(it 
uses exp).  I haven't done much testing of all this but something 
just seems off somewhere. Either the sin is heavily optimized and 
exp is not(which it too have a CORDIC like algorithm since 
exp(a+b) = exp(a)*exp(b)) or something else is happening that I'm 
not aware of.. which is why I went to tabularizing these 
functions in the first place.

Also, I don't know the kinda accuracy they have, which doesn't 
help much but I could use a more accurate algorithm which is 
slower since it is pre-computed.

That site I linked does talk about some of the issues and 
stuff... I guess I just need to update my thinking on modern cpus 
a little better.

I guess there is no tool that can tell one exactly what is 
happening to a piece of code in the cpu... basically an 
instruction level profiler?