value range propagation for _bitwise_ OR

Tue Apr 13 09:09:15 PDT 2010

Adam D. Ruppe wrote:
> On Tue, Apr 13, 2010 at 11:10:24AM -0400, Clemens wrote:
>> That's strange. Looking at src/backend/cod4.c, function cdbscan, in the dmd sources, bsr seems to be implemented in terms of the bsr opcode [1] (which I guess is the reason it's an intrinsic in the first place). I would have expected this to be much, much faster than a user function. Anyone care enough to check the generated assembly?
> 
> The opcode is fairly slow anyway (as far as opcodes go) - odds are the
> implementation inside the processor is similar to Jerome's method, and
> the main savings come from it loading fewer bytes into the pipeline.
> 
> I remember a line from a blog, IIRC it was the author of the C++ FQA
> writing it, saying hardware and software are pretty much the same thing -
> moving an instruction to hardware doesn't mean it will be any faster,
> since it is the same algorithm, just done in processor microcode instead of
> user opcodes.
> 
It's fast on Intel, slow on AMD. I bet the speed difference comes from 
inlining max().