tooling quality and some random rant

Sat Feb 19 12:36:02 PST 2011

nedbrek Wrote:

> Hello,
> 
> "Walter Bright" <newshound2 at digitalmars.com> wrote in message 
> news:ijnt3o$22dm$1 at digitalmars.com...
> > nedbrek wrote:
> >> Reordering happens in the scheduler. A simple model is "Fetch", 
> >> "Schedule", "Retire".  Fetch and retire are done in program order.  For 
> >> code that is hitting well in the cache, the biggest bottleneck is that 
> >> "4" decoder (the complex instruction decoder).  Reducing the number of 
> >> complex instructions will be a big win here (and settling them into the 
> >> 4-1-1(-1) pattern).
> >>
> >> Of course, on anything after Core 2, the "1" decoders can handle pushes, 
> >> pops, and load-ops (r+=m) (although not load-op-store (m+=r)).
> >>
> >> Also, "macro op fusion" allows you can get a branch along with the last 
> >> instruction in decode, potentially giving you 5 macroinstructions per 
> >> cycle from decode.  Make sure it is the flags producing instruction 
> >> (cmp-br).
> >>
> >
> > I can't find any Intel documentation on this. Can you point me to some?
> 
> The best available source is the optimization reference manual 
> (http://www.intel.com/products/processor/manuals/).  The latest version is 
> 248966.pdf, which mentions "Decodes up to four instructions, or up to five 
> with macro-fusion" (page 33).  Also, page 36: "Macro-fusion merges two 
> instructions into a single ?op. Intel Core microarchitecture is capable of 
> one macro-fusion per cycle in 32-bit operation".  It's unclear if macro 
> fusion is off entirely in 64 bit mode, and whether this has changed in more 
> recent processors...

I remember reading that macro fusion is entirely off in 64 bit mode in Nehalem and earlier generations, and supported in Sandy Bridge.

When generating code for loops, the compiler could also make use of Loop Stream Coder to avoid i-cache misses.