Bits rotations

Wed Oct 17 19:36:58 PDT 2012

In cryptographic code, and generally in bit-twiddling code, 
rotation of the bits in a word is a common operation. It's so 
common that Intel has made it an asm instruction.

A demo program:

uint foo(in uint x) pure nothrow {
     return (x << 11) | (x >> (32 - 11));
}

uint bar(uint x, in ubyte n) pure nothrow {
     asm {
         mov EAX, x;
         mov CL, n;
         rol EAX, CL;
         mov x, EAX;
     }
     return x;
}

void main() {
     import std.stdio;
     uint x = 4290772992U;
     writefln("%032b", x);
     uint y = foo(x);
     writefln("%032b", y);
     uint z = bar(x, 11);
     writefln("%032b", z);
}

Its output, dmd -O -release -inline:

11111111110000000000000000000000
00000000000000000000011111111110
00000000000000000000011111111110

Even with full optimizations DMD seems not able to detect the 
rotation in foo(), and writing assembly as in bar() is not an 
option, because the code is even longer and bar() can't be 
inlined. This is the ASM generated (32 bit):

_D4test3fooFNaNbxkZk:
         push    EAX
         mov ECX,[ESP]
         shl EAX,0Bh
         shr ECX,015h
         or  EAX,ECX
         pop ECX
         ret

_D4test3barFNaNbkxhZk:
         push    EBP
         mov EBP,ESP
         push    EAX
         mov EAX,8[EBP]
         mov CL,-4[EBP]
         rol EAX,CL
         mov 8[EBP],EAX
         mov EAX,8[EBP]
         mov ESP,EBP
         pop EBP
         ret 4

GDC 4.6.3 does better, recognizing the rol (foo() can be inlined, 
usually becoming 1 instruction in the middle of other code) (I 
like the demangling here!):

pure nothrow uint example.foo(const(uint)):
     movl    %edi, %eax
     roll    $11, %eax
     ret

pure nothrow uint example.bar(uint, const(ubyte)):
     movl  %edi, -4(%rsp)
     movb  %sil, -5(%rsp)
     movl -4(%rsp), %eax
     movb -5(%rsp), %cl
     roll %cl, %eax
     movl %eax, -4(%rsp)
     movl -4(%rsp), %eax
     ret

So I'd like to write:

uint spam(in uint x) pure nothrow @safe {
     import core.bitop: rol;
     return rol(x, 11);
}

This is better because:
- It's standard. If a CPU supports rol (or ror) the compiler uses 
it. If it doesn't support it, the compiler uses shifts and an or.
- It's shorter and more readable. For me "rol(x, 11)" is rather 
more easy to read and debug than code like "(x << 11) | (x >> (32 
- 11))".
- spam() is inlinable, just like foo() and unlike bar().
- It doesn't rely on compiler optimizations, that sometimes are 
not present. If the CPU supports the rol, the compiler doesn't 
need pattern matching, it just spits out a rol. DMD currently has 
such optimization, but apparently here it's not working, I don't 
know why. For such basic operation, that is built in many CPUs 
there is no need for compiler optimizations.

Bye,
bearophile