GDC review process.

Wed Jun 20 07:15:28 PDT 2012

On 20/06/12 13:22, Manu wrote:
> On 20 June 2012 13:59, Don Clugston <dac at nospam.com
> <mailto:dac at nospam.com>> wrote:
>
>     You and I seem to be from different planets. I have almost never
>     written as asm function which was suitable for inlining.
>
>     Take a look at std.internal.math.biguintX86.d
>
>     I do not know how to write that code without inline asm.
>
>
> Interesting.
> I wish I could paste some counter-examples, but they're all proprietary >_<
>
> I think they key detail here is where you stated, they _always_ include
> a loop. Is this because it's hard to manipulate the compiler into the
> correct interaction with the flags register?

No. It's just because speed doesn't matter outside loops. A consequence 
of having the loop be inside the asm code, is that the parameter passing 
is much less significant for speed, and calling convention is the big

> I'd be interested to compare the compiled D code, and your hand written
> asm code, to see where exactly the optimiser goes wrong. It doesn't look
> like you're exploiting too many tricks (at a brief glance), it's just
> nice tight hand written code, which the optimiser should theoretically
> be able to get right...

Theoretically, yes. In practice, DMD doesn't get anywhere near, and gcc 
isn't much better. I don't think there's any reason why they couldn't, 
but I don't have much hope that they will.

As you say, the code looks fairly straightforward, but actually there 
are very many similar ways of writing the code, most of which are much 
slower. There are many bottlenecks you need to avoid. I was only able to 
get it to that speed by using the processor profiling registers.

So, my original two uses for asm are actually:
(1) when the language prevents you from accessing low-level 
functionality; and
(2) when the optimizer isn't good enough.

> I find optimisers are very good at code simplification, assuming that
> you massage the code/expressions to neatly match any architectural quirks.
> I also appreciate that good x86 code is possibly the hardest
> architecture for an optimiser to get right...

Optimizers improved enormously during the 80's and 90's, but the rate of 
improvement seems to have slowed.

With x86, out-of-order execution has made it very easy to get reasonably 
good code, and much harder to achieve perfection. Still, Core i7 is much 
easier than Core2, since Intel removed one of the most complicated 
bottlenecks (on core2 and earlier there is a max 3 reads per cycle, of 
registers you haven't written to in the previous 3 cycles).