Can this implementation of Damm algorithm be optimized?

Thu Feb 9 15:49:19 PST 2017

On Thursday, 9 February 2017 at 19:39:49 UTC, Nestor wrote:
> OK I changed the approach using a multidimensional array for 
> the matrix so I could ditch arithmetic operations altogether, 
> but curiously after measuring a few thousand runs of both 
> implementations through avgtime, I see no noticeable 
> difference. Why?

  Truthfully, because you'll need tens of millions or hundreds of 
millions in length to determine if it makes a difference and how 
much. Addition, subtraction and simple memory lookups take very 
little time, and since the entire array (100 bytes) fits in the 
cache, it is going to perform very very very well regardless if 
you can optimize it further.

  If you tested this on a much slower system, say an 8bit 6502 the 
differences would be far more pronounced, but not that much 
different.

  Since the algorithm is more or less O(n) optimizing it won't 
make many differences.

  It's possible you could get a speedup by making them ints 
instead of chars, since then it might not have a penalty for the 
'address not divisible by 4' that applies (which is more for ARM 
architectures and less for x86).

  Other optimizations could be to make it multiple levels, taking 
the basic 100 elements and expanding them 2-3 levels deep in a 
lookup and having it do it in more or less a single operation. 
(100 bytes for 1 level, 10,000 for 2 levels, 1,000,000 for 3 
levels, 100,000,000 for 4 levels, etc), but the steps of 
converting it to the array lookup won't give you that much gain, 
although fewer memory lookups but none of them will be cached, so 
any advantage from that is probably lost. Although if you bump up 
the size to 16x10 instead of 10x10, you could use a shift instead 
of *10 which will make that slightly faster (there will be unused 
empty padded spots)

  In theory if you avoid the memory lookup at all, you could gain 
some amount of speed, depending on how it searches a manual 
table, although using a switch-case and a mixin to do all the 
details it feels like it wouldn't give you any gain...

  Division operations are the slowest operations you can do, but 
otherwise most instructions run really fast. Unless you're trying 
to get it smaller (fewer bytes for the call) or shaving for speed 
by instruction cycle counting (like on the 6502) i doubt you'll 
get much benefit.