Do you think if statement as expression would be nice to have in D?

Bruce Carneal bcarneal at gmail.com
Tue Jun 7 21:09:08 UTC 2022


On Tuesday, 7 June 2022 at 18:21:57 UTC, Walter Bright wrote:
> On 6/7/2022 2:23 AM, Bruce Carneal wrote:
...
>
> I've never much liked autovectorization:

Same here, which is why my initial CPU-side implementation was 
all explicit __vector/intrinsics code (with corresponding static 
arrays to get a sane unaligned load/store capability).

>
> 1. you never know if it is going to vectorize or not. The 
> vector instruction sets vary all over the place, and whether 
> they line up with your loops or not is not determinable in 
> general - you have to look at the assembler dump.

I now take this as an argument for auto vectorization.

>
> 2. when autovectorization doesn't happen, the compiler reverts 
> to non-vectorized slow code. Often, you're not aware this has 
> happened, and the expected performance doesn't happen. You can 
> usually refactor the loop so it will autovectorize, but that's 
> something only an expert programmer can accomplish, but he 
> can't do it if he doesn't *realize* the autovectorization 
> didn't happen.  You said it yourself: "if perf drops"!

Well, presumably you're "unittesting" performance to know where 
the hot spots are so...  It's always nicer to know things at 
compile time but for me it's acceptable at "unittest time" since 
the measurements will be part of any performance code development 
setup.

>
> 3. it's fundamentally a backwards thing. The programmer writes 
> low level code (explicit loops) and the compiler tries to work 
> backwards to create high level code (vectors) for it! This is 
> completely backwards to how compilers normally work - specify a 
> high level construct, and the compiler converts it into low 
> level.

I see it as a choice on the "time to develop" <==> "performance 
achieved" axis.  Fortunately autovectorization can be a win here: 
develop simple/correct code with an eye to compiler-visible 
indexing and hand-vectorize if there's a problem.  (I actually 
went the other way, starting with hand optimized core functions, 
and discovered that auto-vectorization worked as well or better 
for many of those functions).

>
> 4. with vector code, the compiler will tell you when the 
> instruction set won't map onto it, so you have a chance to 
> refactor it so it will.

Yes, better to know things at compile time but OK to know them at 
perf "unittest" time.

Here are some of the reasons I'm migrating much of my code to 
auto-vectorization with perf regression tests from the initial 
__vector/intrinsic implementation:

1) It's more readable.

2) It is auto upgradeable (with @target meta programming for the 
multi-target deployability)

3) It's measurably (slightly) faster in many instances (it helps 
that I can shape the operand flows for this app)

4) It fits more readily with upcoming CPU-centric vector arch 
(SVE, SVE2, RVV...), Cray vectors ride again! :-)

5) It aligns stylistically with SIMT (I think in terms of index 
spaces and memory subsystem blocking rather than HW details).  
SIMT is where I believe we should be looking for future, 
significant performance gains (the PCIe bottleneck is a stumbling 
block but SoCs and consoles have the right idea).

The mid-range goal is to develop in an it-just-works, no-big-deal 
SIMT environment where the traditional SIMD awkwardness is in the 
rear view mirror and where we can surf the improving HW 
performance wave (clock increases were nice while they lasted but 
...).  dcompute is already a good ways down that road but it can 
be friendlier and more capable.  As I've mentioned elsewhere, I 
already prefer it to CUDA.

Finally, thanks for creating D.  It's great.




More information about the Digitalmars-d mailing list