Can this implementation of Damm algorithm be optimized?

Thu Feb 9 12:56:15 PST 2017

On Thursday, 9 February 2017 at 19:39:49 UTC, Nestor wrote:
> On Thursday, 9 February 2017 at 18:34:30 UTC, Era Scarecrow 
> wrote:
>> On Thursday, 9 February 2017 at 17:36:11 UTC, Nestor wrote:
>>> I was trying to port C code from the article in Wikiversity 
>>> [1] to D, but I'm not sure this implementation is the most 
>>> efficient way to do it in D, so suggestions to optimize it 
>>> are welcome:
>>>
>>> import std.stdio;
>>>
>>> static immutable char[] QG10Matrix =
>>>   "03175986427092154863420687135917509834266123045978" ~
>>>   "36742095815869720134894536201794386172052581436790";
>>>
>>> char checkDigit(string str) {
>>>   char tmpdigit = '0';
>>>   foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') + 
>>> (tmpdigit - '0') * 10];
>>>   return tmpdigit;
>>> }
>>
>> Well one thing is you can probably reduce them from chars to 
>> just bytes, instead of having to subtract you can instead add 
>> at the end. Although unless you're working with a VERY large 
>> input you won't see a difference.
>>
>> Actually since you're also multiplying by 10, you can 
>> incorporate that in the table too... (although a mixin might 
>> be better for the conversion than by hand)
>>
>>
>>  static immutable char[] QG10Matrix = [
>>     00,30,10,70,50,90,80,60,40,20,
>>     70,00,90,20,10,50,40,80,60,30,
>>     40,20,00,60,80,70,10,30,50,90,
>>     10,70,50,00,90,80,30,40,20,60,
>>     60,10,20,30,00,40,50,90,70,80,
>>     30,60,70,40,20,00,90,50,80,10,
>>     50,80,60,90,70,20,00,10,30,40,
>>     80,90,40,50,30,60,20,00,10,70,
>>     90,40,30,80,60,10,70,20,00,50,
>>     20,50,80,10,40,30,60,70,90,00];
>>
>>  char checkDigit(string str) {
>>    char tmpdigit = 0;
>>    foreach(chr; str) tmpdigit = QG10Matrix[(chr - '0') +
>>  tmpdigit];
>>    return (tmpdigit/10) + '0';
>>  }
>
>
>
> OK I changed the approach using a multidimensional array for 
> the matrix so I could ditch arithmetic operations altogether, 
> but curiously after measuring a few thousand runs of both 
> implementations through avgtime, I see no noticeable 
> difference. Why?
>
> import std.stdio;
>
> static immutable ubyte[][] QG10Matrix = [
>   [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3],
>   [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6],
>   [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1],
>   [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7],
>   [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0],
> ];
>
> static int charToInt(char chr) {
>   scope(failure) return -1;
>   return cast(int)(chr - '0');
> }
>
> ubyte checkDigit(string str) {
>   ubyte tmpdigit;
>   foreach(chr; str) tmpdigit = 
> QG10Matrix[tmpdigit][charToInt(chr)];
>   return tmpdigit;
> }
>
> enum {
>   EXIT_SUCCESS = 0,
>   EXIT_FAILURE = 1,
> }
>
> int main(string[] args) {
>   scope(failure) {
>     writeln("Invalid arguments. You must pass a number.");
>     return EXIT_FAILURE;
>   }
>   assert(args.length == 2);
>   ubyte digit = checkDigit(args[1]);
>   if(digit == 0) writefln("%s is a valid number.", args[1]);
>   else {
>     writefln("%s is not a valid number (but it would be, 
> appending digit %s).",
>       args[1], digit);
>   }
>
>   return EXIT_SUCCESS;
> }

The question is, why do you expect it to be noticably faster? On 
modern hardware it the optimizations are such that so little a 
change in code is very hard to link to a difference in running 
time. If you really want to show one the following code does the 
benchmark the right way (ie: using a very long input, tens of 
thousands of runs, and avoiding I/O and load time of the program 
to compare the bare function implementations):

import std.conv;
import std.stdio;
import std.range;
import std.array;
import std.algorithm;

string testcase;

static immutable char[] QG10MatrixOne =
   "03175986427092154863420687135917509834266123045978" ~
   "36742095815869720134894536201794386172052581436790";

char checkDigitOne(string str) {
   char tmpdigit = 0;
   foreach(chr; str) tmpdigit = QG10MatrixOne[(chr - '0') + 
(tmpdigit - '0') * 10];
   return tmpdigit;
}

void testCheckDigitOne() {
     checkDigitTwo(testcase);
}

static immutable ubyte[][] QG10MatrixTwo = [
   [0,3,1,7,5,9,8,6,4,2],[7,0,9,2,1,5,4,8,6,3],
   [4,2,0,6,8,7,1,3,5,9],[1,7,5,0,9,8,3,4,2,6],
   [6,1,2,3,0,4,5,9,7,8],[3,6,7,4,2,0,9,5,8,1],
   [5,8,6,9,7,2,0,1,3,4],[8,9,4,5,3,6,2,0,1,7],
   [9,4,3,8,6,1,7,2,0,5],[2,5,8,1,4,3,6,7,9,0],
];

static int charToInt(char chr) {
   scope(failure) return -1;
   return cast(int)(chr - '0');
}

ubyte checkDigitTwo(string str) {
   ubyte tmpdigit;
   foreach(chr; str) {
       tmpdigit = QG10MatrixTwo[tmpdigit][charToInt(chr)];
   }
   return tmpdigit;
}

void testCheckDigitTwo() {
     checkDigitTwo(testcase);
}

void main(string[] args) {
   testcase = 
iota(10).cycle.take(100000).map!(to!string).array.join;

   import std.datetime: benchmark;
   benchmark!(testCheckDigitOne, 
testCheckDigitTwo)(10000).each!writeln;
}

Yet on my machine I have the following times (note that the time 
itself depends on my hardware, what's important is the 
difference):

TickDuration(15785255852)
TickDuration(15784826803)

So it seems that the first version is slower than the second one, 
but by so little that it's hard to link to the actual 
implementation. If anything it shows the futility of such 
low-level tricks for meaningless optimization. And by the way 
note that the benchmark avoid measuring IO because we were 
interested in the functions. But if we were measuring it it would 
represent 4 fifth of the running time (on my machine, with 
numbers as long as the one used in the benchmark). This means 
even a 20% improvement in checkDigit would actually only 
represent a 4% improvement of the program.