tooling quality and some random rant

Fri Feb 18 05:37:05 PST 2011

Hello all,

"Walter Bright" <newshound2 at digitalmars.com> wrote in message 
news:ijeih9$2aso$2 at digitalmars.com...
> Don wrote:
>> That would really be fun.
>> BTW, the current Intel processors are basically the same as Pentium Pro, 
>> with a few improvements. The strange thing is, because of all of the 
>> reordering that happens, swapping the order of two (non-dependent) 
>> instructions makes no difference at all. So you always need to look at 
>> every instruction in the a loop, before you can do any scheduling.
>
> I was looking at Agner's document, and it looks like ordering the 
> instructions in the 4-1-1 or 4-1-1-1 for optimal decoding could work. This 
> would fit right in with the way the scheduler works.
>
> I had thought that with the CPU automatically reordering instructions, 
> that scheduling them was obsolete.

Reordering happens in the scheduler. A simple model is "Fetch", "Schedule", 
"Retire".  Fetch and retire are done in program order.  For code that is 
hitting well in the cache, the biggest bottleneck is that "4" decoder (the 
complex instruction decoder).  Reducing the number of complex instructions 
will be a big win here (and settling them into the 4-1-1(-1) pattern).

Of course, on anything after Core 2, the "1" decoders can handle pushes, 
pops, and load-ops (r+=m) (although not load-op-store (m+=r)).

Also, "macro op fusion" allows you can get a branch along with the last 
instruction in decode, potentially giving you 5 macroinstructions per cycle 
from decode.  Make sure it is the flags producing instruction (cmp-br).

(I used to work for Intel :)
Ned