Can this implementation of Damm algorithm be optimized?

Sun Feb 12 17:11:34 PST 2017

On Monday, 13 February 2017 at 00:56:37 UTC, Nestor wrote:
> On Sunday, 12 February 2017 at 05:54:34 UTC, Era Scarecrow 
> wrote:
>> Ran some more tests.
>
> Wow!
> Thanks for the interest and effort.

  Certainly. But the bulk of the answer comes down that the 2 
levels that I've already provided are the fastest you're probably 
going to get. Certainly we can test using shorts or bytes 
instead, but it's likely the results will only go down.

  To note my tests are strictly on my x86 system and it would be 
better to also test this on other systems like PPC, Linux, ARM, 
and other architectures to see how they perform, and possibly 
tweak them as appropriate.

  Still we did find out there is some optimization that can be 
done and successfully for the Damm algorithm, it just isn't going 
to be a lot.

  Hmmm... A thought does come to mind. Parallelizing the code; 
However that would require probably 11 instances to get a 2x 
speedup (calculating the second half with all 10 possibilities 
for the carry over, and also calculating the first half, then 
choosing which of the 10 based on the first half's output), which 
only really works if you have a ton of cores, and the input is 
REALLY REALLY large, like a meg or something. While the usage of 
the Damm code is more useful for adding a digit to the end of a 
code like UPC or Barcodes as error detection, and expecting 
larger than 32 for real applications is unlikely.

  But at this point I'm rambling.