Do you think if statement as expression would be nice to have in D?
Bruce Carneal
bcarneal at gmail.com
Tue Jun 7 21:09:08 UTC 2022
On Tuesday, 7 June 2022 at 18:21:57 UTC, Walter Bright wrote:
> On 6/7/2022 2:23 AM, Bruce Carneal wrote:
...
>
> I've never much liked autovectorization:
Same here, which is why my initial CPU-side implementation was
all explicit __vector/intrinsics code (with corresponding static
arrays to get a sane unaligned load/store capability).
>
> 1. you never know if it is going to vectorize or not. The
> vector instruction sets vary all over the place, and whether
> they line up with your loops or not is not determinable in
> general - you have to look at the assembler dump.
I now take this as an argument for auto vectorization.
>
> 2. when autovectorization doesn't happen, the compiler reverts
> to non-vectorized slow code. Often, you're not aware this has
> happened, and the expected performance doesn't happen. You can
> usually refactor the loop so it will autovectorize, but that's
> something only an expert programmer can accomplish, but he
> can't do it if he doesn't *realize* the autovectorization
> didn't happen. You said it yourself: "if perf drops"!
Well, presumably you're "unittesting" performance to know where
the hot spots are so... It's always nicer to know things at
compile time but for me it's acceptable at "unittest time" since
the measurements will be part of any performance code development
setup.
>
> 3. it's fundamentally a backwards thing. The programmer writes
> low level code (explicit loops) and the compiler tries to work
> backwards to create high level code (vectors) for it! This is
> completely backwards to how compilers normally work - specify a
> high level construct, and the compiler converts it into low
> level.
I see it as a choice on the "time to develop" <==> "performance
achieved" axis. Fortunately autovectorization can be a win here:
develop simple/correct code with an eye to compiler-visible
indexing and hand-vectorize if there's a problem. (I actually
went the other way, starting with hand optimized core functions,
and discovered that auto-vectorization worked as well or better
for many of those functions).
>
> 4. with vector code, the compiler will tell you when the
> instruction set won't map onto it, so you have a chance to
> refactor it so it will.
Yes, better to know things at compile time but OK to know them at
perf "unittest" time.
Here are some of the reasons I'm migrating much of my code to
auto-vectorization with perf regression tests from the initial
__vector/intrinsic implementation:
1) It's more readable.
2) It is auto upgradeable (with @target meta programming for the
multi-target deployability)
3) It's measurably (slightly) faster in many instances (it helps
that I can shape the operand flows for this app)
4) It fits more readily with upcoming CPU-centric vector arch
(SVE, SVE2, RVV...), Cray vectors ride again! :-)
5) It aligns stylistically with SIMT (I think in terms of index
spaces and memory subsystem blocking rather than HW details).
SIMT is where I believe we should be looking for future,
significant performance gains (the PCIe bottleneck is a stumbling
block but SoCs and consoles have the right idea).
The mid-range goal is to develop in an it-just-works, no-big-deal
SIMT environment where the traditional SIMD awkwardness is in the
rear view mirror and where we can surf the improving HW
performance wave (clock increases were nice while they lasted but
...). dcompute is already a good ways down that road but it can
be friendlier and more capable. As I've mentioned elsewhere, I
already prefer it to CUDA.
Finally, thanks for creating D. It's great.
More information about the Digitalmars-d
mailing list