Strange counter-performance in an alternative `decimalLength9` function

Thu Feb 27 09:41:20 UTC 2020

On Thursday, 27 February 2020 at 09:33:28 UTC, Dennis Cote wrote:
> On Wednesday, 26 February 2020 at 00:50:35 UTC, Basile B. wrote:
>> So after reading the translation of RYU I was interested too 
>> see if the decimalLength() function can be written to be 
>> faster, as it cascades up to 8 CMP.
>
> Perhaps you could try something like this.
>
> <code>
> int decimalDigitLength(ulong n) {
> 	if (n < 10000)
> 		if (n < 100)
> 			return n < 10 ? 1 : 2;
> 		else
> 			return n < 1000 ? 3 : 4;
> 	else
> 		if (n < 100000000)
> 			if (n < 1000000)
> 				return n < 100000 ? 5 : 6;
> 			else
> 				return n < 10000000 ? 7 : 8;
> 		else
> 			if (n < 1000000000000)
> 				if (n < 10000000000)
> 					return n < 1000000000 ? 9 : 10;
> 				else
> 					return n < 100000000000 ? 11 : 12;
> 			else
> 				if (n < 10000000000000000)
> 					if (n < 100000000000000)
> 						return n < 10000000000000 ? 13 : 14;
> 					else
> 						return n < 1000000000000000 ? 15 : 16;
> 				else
> 					if (n < 1000000000000000000)
> 						return n < 100000000000000000 ? 17 : 18;
> 					else
> 						return n < 10000000000000000000 ? 19 : 20;
> }										
> </code>
>
> This uses at most 6 compares for any 64 bit number and only 3 
> for the most common small numbers less than 10000.
>
> I was glad to see that with ldc at run.dlang.io using the -O3 
> optimization I could change the function signature to match 
> yours and the compiler eliminated all the unreachable dead code 
> for larger values. The compiler produced the following assembler
>
> <code>
> 	.section	.text.ubyte 
> onlineapp.decimalLength9(uint),"axG", at progbits,ubyte 
> onlineapp.decimalLength9(uint),comdat
> 	.globl	ubyte onlineapp.decimalLength9(uint)
> 	.p2align	4, 0x90
> 	.type	ubyte onlineapp.decimalLength9(uint), at function
> ubyte onlineapp.decimalLength9(uint):
> 	.cfi_startproc
> 	cmpl	$9999, %edi
> 	ja	.LBB1_5
> 	cmpl	$99, %edi
> 	ja	.LBB1_4
> 	cmpl	$10, %edi
> 	movb	$2, %al
> 	sbbb	$0, %al
> 	retq
> .LBB1_5:
> 	cmpl	$99999999, %edi
> 	ja	.LBB1_9
> 	cmpl	$999999, %edi
> 	ja	.LBB1_8
> 	cmpl	$100000, %edi
> 	movb	$6, %al
> 	sbbb	$0, %al
> 	retq
> .LBB1_4:
> 	cmpl	$1000, %edi
> 	movb	$4, %al
> 	sbbb	$0, %al
> 	retq
> .LBB1_9:
> 	cmpl	$1000000000, %edi
> 	movb	$10, %al
> 	sbbb	$0, %al
> 	retq
> .LBB1_8:
> 	cmpl	$10000000, %edi
> 	movb	$8, %al
> 	sbbb	$0, %al
> 	retq
> .Lfunc_end1:
> 	.size	ubyte onlineapp.decimalLength9(uint), .Lfunc_end1-ubyte 
> onlineapp.decimalLength9(uint)
> 	.cfi_endproc
> </code>
>
> for the same body with signature ubyte decimalLength9(uint n).
>
> This may be faster than your sequential comparison function 
> depending upon the distribution of numbers. In real 
> applications, small numbers are far more common so the reduced 
> number of compares for those values should be beneficial in 
> most cases.

Sorry but no. I think that you have missed how this has changed 
since the first message.
1. the way it was tested initially was wrong because LLVM was 
optimizing some stuff in some tests and not others, due to 
literals constants.
2. Apparently there would be a branchless version that's fast 
when testing with unbiased input (to be verified)

this version is:

---
ubyte decimalLength9_4(const uint v) pure nothrow
{
     return 1 +  (v >= 10) +
                 (v >= 100) +
                 (v >= 1000) +
                 (v >= 10000) +
                 (v >= 100000) +
                 (v >= 1000000) +
                 (v >= 10000000) +
                 (v >= 100000000) ;
}
---

but i cannot see the improvment when use time on the test program 
and 100000000 calls feeded with a random number.

see 
https://forum.dlang.org/post/ctidwrnxvwwkouprjszw@forum.dlang.org 
for the latest evolution of the discussion.