__restrict, architecture intrinsics vs asm, consoles, and other

Fri Sep 23 22:25:15 PDT 2011

On 24.09.2011 00:47, Manu Evans wrote:
> == Quote from Andrei Alexandrescu (SeeWebsiteForEmail at erdani.org)'s
> article
>> On 9/22/11 1:39 AM, Don wrote:
>>> On 22.09.2011 05:24, a wrote:
>>>> How would one do something like this without intrinsics (the
> code is
>>>> c++ using
>>>> gcc vector extensions):
>>>
>>> [snip]
>>> At present, you can't do it without ultimately resorting to
> inline asm.
>>> But, what we've done is to move SIMD into the machine model: the
> D
>>> machine model assumes that float[4] + float[4] is a more
> efficient
>>> operation than a loop.
>>> Currently, only arithmetic operations are implemented, and on
> DMD at
>>> least, they're still not proper intrinsics. So in the long term
> it'll be
>>> possible to do it directly, but not yet.
>>>
>>> At various times, several of us have implemented 'swizzle' using
> CTFE,
>>> giving you a syntax like:
>>>
>>> float[4] x, y;
>>> x[] = y[].swizzle!"cdcd"();
>>> // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]
>>>
>>> which compiles to a single shufps instruction.
>>>
>>> That "cdcd" string is really a tiny DSL: the language consists
> of four
>>> characters, each of which is a, b, c, or d.
>> I think we should put swizzle in std.numeric once and for all. Is
> anyone
>> interested in taking up that task?
>>> A couple of years ago I made a DSL compiler for BLAS1
> operations. It was
>>> capable of doing some pretty wild stuff, even then. (The DSL
> looked like
>>> normal D code).
>>> But the compiler has improved enormously since that time. It's
> now
>>> perfectly feasible to make a DSL for the SIMD operations you
> need.
>>>
>>> The really nice thing about this, compared to normal asm, is
> that you
>>> have access to the compiler's symbol table. This lets you add
>>> compile-time error messages, for example.
>>>
>>> A funny thing about this, which I found after working on the DMD
>>> back-end, is that is MUCH easier to write an optimizer/code
> generator in
>>> a DSL in D, than in a compiler back-end.
>> A good argument for (a) moving stuff from the compiler into the
> library,
>> (b) continuing Don's great work on making CTFE a solid
> proposition.
>> Andrei
>
> This sounds really dangerous to me.
> I really like the idea where CTFE can be used to produce some pretty
> powerful almost-like-intrinsics code, but applying it in this
> context sounds like a really bad idea.
>
> Firstly, so I'm not misunderstanding, is this suggestion building on
> Don's previous post saying that float[4] is somehow intercepted and
> special-cased by the compiler, reinterpreting as a candidate for
> hardware vector operations?

No, it's completely unrelated. It has nothing in common.

> I think that's a wrong decision in its self, and a poor foundation
> for this approach.
> Let me try and convince you that the language should have an
> explicit hardware vector type, and not attempt to make use of any
> clever language tricks...
>
> If float[4] is considered a hardware vector by the compiler,
>    - How to I define an ACTUAL float[4]?
>    - How can I be confident that it actually WILL be a hardware
> vector?

float[4] is not considered to be a hardware vector. It is only passed as 
one. To pass it the C++ way, declare the parameter as float[], or pass 
by ref.
Everything after that is the reponsibility of the compiler/optimizer.
A big difference compared to C++, is that generally, it's pretty strange 
to pass fixed-length arrays as value parameters.
At this stage we don't have any way of forcing it to be a hardware 
vector. We've just introduced the parameter passing and the vector 
operations to make it easier for the compiler to use hardware registers.
Very little else is decided at this stage.

You make some excellent points.

> Hardware vectors are NOT float[4]'s, they are a reference to an
> 128bit hardware register upon which various vector operations may be
> supported, they are probably aligned, and they are only accessible
> in 128bit quantities. I think they should be explicitly defined as
> such.
>
> They may be float4, u/int4, u/short8, u/byte16, double2... All these
> types are interchangeable within the one register, do you intend to
> special case fixed length arrays of all those types to support the
> hardware functionality for those?
>
> Hardware vectors are NOT floats, they can not interact with the
> floating point unit, dereferencing of this style 'float x =
> myVector[0]' is NOT supported by the hardware and it should not be
> exposed to the programmer as a trivial possibility. This seemingly
> harmless line of code will undermine the entire reason for using
> hardware vector hardware in the first place.
>
> Allowing easy access of individual floats within a hardware vector breaks the languages stated premise that the path of least
> resistance also be the 'correct' optimal choice, whereby a seemingly
> simple line of code may ruin the entire function.
>
> float[4] is not even a particularly conveniently sized vector for
> most inexperienced programmers, the majority will want float[3].
> This is NOT a trivial map to float[4], and programmers should be
> well aware that there is inherent complexity in using the hardware
> vector architecture, and forced to think it through.
>
> Most inexperienced programmers think of results of operations like
> dot product and magnitude as being scalar values, but they are not,
> they are a scalar value repeated across all 4 components of a vector
> 4, and this should be explicit too.
>
> ....
>
> I know I'm nobody around here, so I can't expect to be taken too
> seriously, but I'm very excited about the language, so here's what I
> would consider instead:
>
> Add a hardware vector type, lets call it 'v128' for the exercise. It
> is a primitive type, aligned by definition, and does not have any
> members.
> You may use this to refer to a hardware vector registers explicitly,
> as vector register function arguments, or as arguments to inline asm
> blocks.
>
> Add some functions to the standard library (ideally implemented as
> compiler intrinsics) which do very specific stuff to vectors, and
> ideally expandable by hardware vendors or platform holders.
>
> You might want to have some classes in the standard library which
> wrap said v128, and expose the concept as a float4, int4, byte16,
> etc. These classes would provide maths operators, comparisons,
> initialisation and immediate assignment, and casts between various
> vector types.
>
> Different vector units support completely different methods of
> permutation. I would be very careful about adding intrinsic support
> into the library for generalised permutation. And if so, at least
> leave the capability of implementing intrinsic architecture-specific
> permutation but the hardware vendors/platform holders.
>
> At the end of the day, it is imperative that the code generator and
> optimiser still retain the concept of hardware vectors, and can
> perform appropriate load/store elimination, apply hardware specific
> optimisation to operations like permutes/swizzles, component
> broadcasts, load immediates, etc.
>
> ....
>
> The reason I am so adamant about this, is in almost all
> architectures (SSE is the most tolerant by far), using the hardware
> vector unit is an all-or-nothing choice. If you interact between the
> vector and float registers, you will almost certainly result in
> slower code than if you just used the float unit outright. Also,
> since people usually use hardware vectors in areas of extreme
> performance optimisation, it's not tolerable for the compiler to be
> making mistakes. As a minimum the programmer needs to be able to
> explicitly address the vector registers, pass it to and from
> functions, and perform explicit (probably IHV supplied) functions on
> them. The code generator and optimiser needs all the information
> possible, and as explicit as possible so IHV's can implement the
> best possible support for their architecture. The API should reflect
> this, and not allow easy access to functionality that would violate
> hardware support.
>
> Ease of programming should be a SECONDARY goal, at which point
> something like the typed wrapper classes I described would come in,
> allowing maths operators, comparisons and all I mentioned above, ie,
> making them look like a real mathematical type, but still keeping
> their distance from primitive float/int types, to discourage
> interaction at all costs.
>
> I hope this doesn't sound too much like an overly long rant! :)
> And hopefully I've managed to sell my point...
>
> Don: I'd love to hear counter arguments to justify float[4] as a
> reasonable solution. Currently no matter how I slice it, I just
> can't see it.
>
> Criticism welcome?
>
> Cheers!
> - Manu