suggested improvements to D

Tue Jan 9 02:12:35 PST 2007

Warren D Smith wrote:
> popcount and first1:
> Often one represents a set by means of an array of booleans, or
> equivalently by the bits of a machine word.
> Now, the cray used to have a wonderful "popcount" machine-instruction
>  a = popcount(b);
> which would cause the number of 1-bits in the uint b, to be placed in
> the integer a.
> E.g. popcount(21)=3.
> Also, some machines had a "first1" instruction which would cause, e.g,
>   c=0;
>   printf("here are the locations of the 1bits in b: ");
>   for(a = first1(b); b!=0; b >>= a){
>      c += a;
>      printf("%d,", c);
>   }
>   printf("\n");
> to work as a way to loop thru the elements of the set.
> 
> This was all very fine.  It led to TREMENDOUS speedup and
> simplification of the right kinds of programs.
> BUT, later hardware got rid of the instruction and languages
> did not have these as features. 

We have first1 in the form of the bsf and bsr intrinsics.
popcount is being added to X86 together with the next SSE revision, so 
it's coming back into hardware.
I would like to see a library implementation of popcount added.
Here's one I wrote some time ago:

/**
  *  Calculates the number of set bits in a 32-bit integer.
  *
  */
int bitcount(uint x)
{
     // Avoid branches, and the potential for cache misses which
     // could be incurred with a table lookup.

     // We need to mask alternate bits to prevent the
     // sum from overflowing.
     // add neighbouring bits. Each bit is 0 or 1.
     x = x - ((x>>1) & 0x5555_5555);
     // now each two bits of x is a number 00,01 or 10.
     // now add neighbouring pairs
     x = ((x&0xCCCC_CCCC)>>2) + (x&0x3333_3333);
     // now each nibble holds 0000-0100. Adding them won't
     // overflow any more, so we don't need to mask any more

     // Now add the nibbles, then the bytes, then the words
     // We still need to mask to prevent double-counting.
     // Note that if we used a rotate instead of a shift, we
     // wouldn't need the masks, and could just divide the sum
     // by 8 to account for the double-counting.
     // On some CPUs, it may be faster to perform a multiply.

     x += (x>>4);
     x &= 0x0F0F_0F0F;
     x += (x>>8);
     x &= 0x00FF_00FF;
     x += (x>>16);
     x &= 0xFFFF;
     return x;
}

unittest {
   assert(bitcount(0)==0);
   assert(bitcount(7)==3);
   assert(bitcount(0xAA)==4);
   assert(bitcount(0x8421_1248)==8);
   assert(bitcount(0xFFFF_FFFF)==32);
   assert(bitcount(0xCCCC_CCCC)==16);
   assert(bitcount(0x7777_7777)==24);
}

> [Aside:
> Another excellent hardware instruction that as far as I know was never
> implemented is
> the "dovetail" and "undovetail" instructions
> which place the even-indexed bits of a word into a halfword, and the
> odd-index ones
> into another (undovetail) and dovetail is the reverse.]

Great! I never had a name for those.

> Undenying access to arithmetic:
> hardware goes to a great amount of trouble to provide add-with-carry,
> multiply-two-singlewords-to-get-a-doubleword,
> long-division-with-quotient&remainder
> (if a is a doubleword divide a by b to get quotient q and remainder r,
> all singlewords), etc. to us.
> Another good thing available in a lot of hardware is shift-with-carry,
> and circular shift with and without carry.
> Why the hell does the language then tell the user to go screw
> themselves - we intend to deny
> you access to that power??!!!
> Hello?  This makes it very painful or
> difficult or slow
> to build a bignum package.  It is just gratuitously asinine.  If D
> delivered this it'd (a) be veyr easy, (b) instantly attract a lot of converts
> and you could build a portable bignum code that ran about 4
> times faster.

Rotations without carry would be very nice. But for the others, are 
there many applications of these OTHER than bignum arithmetic? My 
feeling is that you'd always want to do bignum in asm anyway.

Note that D is a great language for writing asm in; you can have a 
single source code shared between Windows, *nix, and the new x86 Macs.