Any usable SIMD implementation?

Wed Apr 6 20:27:31 PDT 2016

On 4/6/2016 7:43 PM, Manu via Digitalmars-d wrote:
>> 1. This has been characterized as a blocker, it is not, as it does not
>> impede writing code that takes advantage of various SIMD code generation at
>> compile time.
>
> It's sufficiently blocking that I have not felt like working any
> further without this feature present. I can't feel like it 'works' or
> it's 'done', until I can demonstrate this functionality.
> Perhaps we can call it a psychological blocker, and I am personally
> highly susceptible to those.

I can understand that it might be demotivating for you, but that is not a 
blocker. A blocker has no reasonable workaround. This has a trivial workaround:

    gdc -simd=AFX foo.d

becomes:

    gdc -simd=AFX -version=AFX foo.d

It's even simpler if you use a makefile variable:

     FPU=AFX

     gdc -simd=$(FPU) -version=$(FPU)

You also mentioned being blocked (i.e. demotivated) for *years* by this, and I 
assume that may be because we don't care about SIMD support. That would be 
wrong, as I care a lot about it. But I had no idea you were having a problem 
with this, as you did not file any bug reports. Suffering in silence is never 
going to work :-)

>> 2. I'm not sure these global settings are the best approach, especially if
>> one is writing applications that dynamically adjusts based on the CPU the
>> user is running on.
>
> They are necessary to provide a baseline. It is typical when building
> code that you specify a min-spec. This is what's used by default
> throughout the application.

It is not necessary to do it that way. Call std.cpuid to determine what is 
available at runtime, and issue an error message if not. There is no runtime 
cost to that. In fact, it has to be done ANYWAY, as it isn't user friendly to 
seg fault trying to execute instructions that do not exist.

> Runtime selection is not practical in a broad sense. Emitting small
> fragments of SIMD here and there will probably take a loss if they are
> all surrounded by a runtime selector. SIMD is all about pipelining,
> and runtime branches on SIMD version are antithesis to good SIMD
> usage; they can't be applied for small-scale deployment.
> In my experience, runtime selection is desirable for large scale
> instantiations at an outer level of the work loop. I've tried to
> design this intent in my library, by making each simd API capable of
> receiving SIMD version information via template arg, and within the
> library, the version is always passed through to dependent calls.
> The Idea is, if you follow this pattern; propagating a SIMD version
> template arg through to your outer function, then you can instantiate
> your higher-level work function for any number of SIMD feature
> combinations you feel is appropriate.

Doing it at a high level is what I meant, not for each SIMD code fragment.

> Naturally, this process requires a default, otherwise this usage
> baggage will cloud the API everywhere (rather than in the few cases
> where a developer specifically wants to make use of it), and many
> developers in 2015 feel SSE2 is a weak default. I would choose SSE4.1
> in my applications, xbox developers would choose AVX1, it's very
> application/target-audience specific, but SSE2 is the only reasonable
> selection if we are not to accept a hint from the command line.

I still don't see how it is a problem to do the switch at a high level. Heck, 
you could put the ENTIRE ENGINE inside a template, have a template parameter be 
the instruction set, and instantiate the template for each supported instruction 
set.

Then,

     void app(int simd)() { ... my fabulous app ... }

     int main() {
       auto fpu = core.cpuid.getfpu();
       switch (fpu) {
         case SIMD: app!(SIMD)(); break;
         case SIMD4: app!(SIMD4)(); break;
         default: error("unsupported FPU"); exit(1);
       }
     }

> I've done it with a template arg because it can be manually
> propagated, and users can extrapolate the pattern into their outer
> work functions, which can then easily have multiple versions
> instantiated for runtime selection.
> I think it's also important to mangle it into the symbol name for the
> reasons I mention above.

Note that version identifiers are not usable directly as template parameters. 
You'd have to set up a mapping.

And yes, if mangled in as part of the symbol, the linker won't pick the wrong one.