value range propagation for _bitwise_ OR

Tue Apr 13 09:15:09 PDT 2010

Don wrote:
> Adam D. Ruppe wrote:
>> On Tue, Apr 13, 2010 at 11:10:24AM -0400, Clemens wrote:
>>> That's strange. Looking at src/backend/cod4.c, function cdbscan, in 
>>> the dmd sources, bsr seems to be implemented in terms of the bsr 
>>> opcode [1] (which I guess is the reason it's an intrinsic in the 
>>> first place). I would have expected this to be much, much faster than 
>>> a user function. Anyone care enough to check the generated assembly?
>>
>> The opcode is fairly slow anyway (as far as opcodes go) - odds are the
>> implementation inside the processor is similar to Jerome's method, and
>> the main savings come from it loading fewer bytes into the pipeline.
>>
>> I remember a line from a blog, IIRC it was the author of the C++ FQA
>> writing it, saying hardware and software are pretty much the same thing -
>> moving an instruction to hardware doesn't mean it will be any faster,
>> since it is the same algorithm, just done in processor microcode 
>> instead of
>> user opcodes.
>>
> It's fast on Intel, slow on AMD. I bet the speed difference comes from 
> inlining max().

Specifically, bsr is 7 uops on AMD, 1 uop on Intel since the original 
Pentium. AMD's performance is shameful.

And bsr() is supported in the compiler; in fact DMC uses it extensively, 
which is why it's included in DMD!