Andrei Alexandrescu needs to read this

Wed Oct 23 23:51:16 UTC 2019

On Wed, Oct 23, 2019 at 04:22:08PM -0700, Walter Bright via Digitalmars-d wrote:
> On 10/23/2019 3:03 PM, Jonathan Marler wrote:
> > What I find funny is that there are alot of clever tricks you can do
> > to make your code execute less operations, but with modern CPUs it's
> > more about making your code more predictable so that the cache can
> > predict what to load next and which branches you're more likely to
> > take.  So in a way, as CPUs get smarter, you want to make your code
> > "dumber" (i.e .  more predictable) in order to get the best
> > performance.  When hardware was "dumber", it was better to make your
> > code smarter.  An odd switch in paradigms.

Indeed!  In the old days it was all about minimizing instructions. But
nowadays, minimizing instructions may make your code perform worse if
you increased the number of branches, thereby causing more branch
hazards.

On the flip side, some good optimizers can eliminate branch hazards in
certain cases, e.g.:

	bool cond;
	x = cond ? y+1 : y;

can be rewritten by the optimizer as:

	x = y + cond;

which allows for a branchless translation into machine code.

Generally, though, it's a bad idea to write this sort of optimizations
in the source code: it runs the risk of confusing the optimizer, which
may cause it to be disabled for that piece of code, resulting in poor
generated code.  It's usually better to just trust the optimizer to do
its job.

Another recent development is the occasional divergence of performance
characteristics of CPUs across members of the same family, i.e., the
same instruction on two different CPU models may perform quite
differently.  Meaning that this sort of low-level optimization is really
best left to the optimizer to optimize for the actual target CPU, rather
than to choose a fixed series of instructions in an asm block that may
perform poorly on some CPUs.  (This is also where JIT compilation can
win over static compilation, if you ship a generic binary that isn't
specifically targeted for the customer's CPU model.)

> Keep in mind that starting in the late 70's, CPUs started being
> designed around the way compilers generate code. (Before then,
> instruction sets were a wacky collection of seemingly unrelated
> instructions. Compilers like orthogonality, and specialized
> instructions to do things like stack frame setup / teardown.)
> 
> This means that unusual instruction sequences tend to perform less
> well than the ordinary stuff a compiler generates.

Yeah, nowadays with microcode, you can't trust the surface appearance of
the assembly instructions anymore. What looks like the same number of
instructions can have very different performance characteristics
depending on how it's actually implemented in the microcode.

> It's also true that code optimizers are tuned to what the local C/C++
> compiler generates, even if the optimizer is designed to work with
> multiple diverse languages.

Interesting, I didn't know this.

T

-- 
Guns don't kill people. Bullets do.