tooling quality and some random rant

Sat Feb 19 12:19:44 PST 2011

Hello,

"Walter Bright" <newshound2 at digitalmars.com> wrote in message 
news:ijnt3o$22dm$1 at digitalmars.com...
> nedbrek wrote:
>> Reordering happens in the scheduler. A simple model is "Fetch", 
>> "Schedule", "Retire".  Fetch and retire are done in program order.  For 
>> code that is hitting well in the cache, the biggest bottleneck is that 
>> "4" decoder (the complex instruction decoder).  Reducing the number of 
>> complex instructions will be a big win here (and settling them into the 
>> 4-1-1(-1) pattern).
>>
>> Of course, on anything after Core 2, the "1" decoders can handle pushes, 
>> pops, and load-ops (r+=m) (although not load-op-store (m+=r)).
>>
>> Also, "macro op fusion" allows you can get a branch along with the last 
>> instruction in decode, potentially giving you 5 macroinstructions per 
>> cycle from decode.  Make sure it is the flags producing instruction 
>> (cmp-br).
>>
>
> I can't find any Intel documentation on this. Can you point me to some?

The best available source is the optimization reference manual 
(http://www.intel.com/products/processor/manuals/).  The latest version is 
248966.pdf, which mentions "Decodes up to four instructions, or up to five 
with macro-fusion" (page 33).  Also, page 36: "Macro-fusion merges two 
instructions into a single ?op. Intel Core microarchitecture is capable of 
one macro-fusion per cycle in 32-bit operation".  It's unclear if macro 
fusion is off entirely in 64 bit mode, and whether this has changed in more 
recent processors...

They recommend against aligning code in general to 4-1-1-1 (also page 36), 
but I'd assume this is for a very targeted application.  As always, it is 
best to run things both ways and measure.

The next section (2.1.2.5) talks about stack pointer tracking - which allows 
macro operations which used to be 2 uops (pop r -> load r = [esp]; inc esp) 
to become one (just the load).  Pushes, which used to be 3 uops 
(store_address esp, store_data r, dec esp) should also be one fused uop (via 
sta/std fusion and store point tracking).

----
Another good resource is "Real World Tech", particularly:
http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144

Page 4 covers the front end: "Macro-op fusion lets the decoders combine two 
macro instructions into a single uop. Specifically, x86 compare or test 
instructions are fused with x86 jumps to produce a single uop and any 
decoder can perform this optimization."

----
Finally, the Intel Technology Journal has some really good details (when you 
can find them! :)

For example:
http://download.intel.com/technology/itj/2003/volume07issue02/art03_pentiumm/vol7iss2_art03.pdf

details the original processor to use micro-op fusion (Pentium M or Banias - 
which was the base design for Dothan and Yonah).  See page 26 (epage 7/18) - 
which starts the section "MICRO-OPS FUSION".  It gives a lot of detail of 
the store address / store data fusion.

Hope that helps,
Ned