Can this implementation of Damm algorithm be optimized?
Era Scarecrow via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Thu Feb 9 15:49:19 PST 2017
On Thursday, 9 February 2017 at 19:39:49 UTC, Nestor wrote:
> OK I changed the approach using a multidimensional array for
> the matrix so I could ditch arithmetic operations altogether,
> but curiously after measuring a few thousand runs of both
> implementations through avgtime, I see no noticeable
> difference. Why?
Truthfully, because you'll need tens of millions or hundreds of
millions in length to determine if it makes a difference and how
much. Addition, subtraction and simple memory lookups take very
little time, and since the entire array (100 bytes) fits in the
cache, it is going to perform very very very well regardless if
you can optimize it further.
If you tested this on a much slower system, say an 8bit 6502 the
differences would be far more pronounced, but not that much
different.
Since the algorithm is more or less O(n) optimizing it won't
make many differences.
It's possible you could get a speedup by making them ints
instead of chars, since then it might not have a penalty for the
'address not divisible by 4' that applies (which is more for ARM
architectures and less for x86).
Other optimizations could be to make it multiple levels, taking
the basic 100 elements and expanding them 2-3 levels deep in a
lookup and having it do it in more or less a single operation.
(100 bytes for 1 level, 10,000 for 2 levels, 1,000,000 for 3
levels, 100,000,000 for 4 levels, etc), but the steps of
converting it to the array lookup won't give you that much gain,
although fewer memory lookups but none of them will be cached, so
any advantage from that is probably lost. Although if you bump up
the size to 16x10 instead of 10x10, you could use a shift instead
of *10 which will make that slightly faster (there will be unused
empty padded spots)
In theory if you avoid the memory lookup at all, you could gain
some amount of speed, depending on how it searches a manual
table, although using a switch-case and a mixin to do all the
details it feels like it wouldn't give you any gain...
Division operations are the slowest operations you can do, but
otherwise most instructions run really fast. Unless you're trying
to get it smaller (fewer bytes for the call) or shaving for speed
by instruction cycle counting (like on the 6502) i doubt you'll
get much benefit.
More information about the Digitalmars-d-learn
mailing list