Targeting Vulkan and SPIR-V

Fri Mar 13 11:44:16 PDT 2015

Spir-V may be producable from HLL tools, but that doesn't mean 
it's perfectly ok to use any HLL. Capability for HLL-to-spir is 
exposed mainly for syntax sugar and shallow precompile 
optimisations, but mostly to avoid vendor-specific HLL bugs that 
have plagued GLSL and HLSL (those billion d3dx_1503.dll on your 
system are bugfixes). Plus, to give the community access to one 
or several opensource HLL compilers that they can find issues 
with and submit for everyone to benefit. So, it's mostly to get a 
flawless opensource GLSL compiler. Dlang's strengths are simply 
not applicable directly. Though with a bit of work can actually 
be applied completely. (I've done them in/with our GLSL/backend 
compilers)

- malloc. SpirV and such don't have malloc. Fix: Preallocate a 
big chunk of memory, and implement a massively-parallel allocator 
yourself (it should handle ~2000 requests to allocate per cycle, 
that's the gist of it). "atomic_add" on a memory location will 
help. If you don't want to preallocate too much, have a cpu 
thread poll while a gpu thread stalls (it should stall itself and 
60000 other threads) until the cpu allocates a new chunk for the 
heap and provides a base address. (hope the cpu thread responds 
quickly enough, or your gpu tasks will be mercilessly killed).

- function-pointers, largely a no-no. Extensions might give you 
that capability, but implement as big switch-case tables. With 
the extensions, you will need to guarantee an arbitrary number 
(64) of threads all happened to call the same actual function.

- stack. I don't know how to break it to you, there's no stack. 
Only around 256 dwords, that 8-200 threads get to allocate from. 
Your notion of a stack gets statically flattenized by the 
compilers. So, your whole program has e.g. 4 dwords to play 
around and have 64 things hide latency, or 64 dwords but only 4 
threads to hide latency - and is 2-4x slower for rudimentary 
things (and utterly fail at latency hiding, becoming 50 times 
slower with memory-accesses), or 1 thread with 256 dwords, which 
is 8-16 times slower at rudimentary stuff and 50+ times slower if 
you access memory even if cached. Add a manually-managed 
programmable memory-stack, and your performance goes poof.

- exceptions. A combined issue of the things above.

Combine the limitations of function-pointers and stack, and I 
hope you get the point. Or well, how pointless the exercise to 
get Dlang as we know and love it on a gpu. A single-threaded 
javascript app on a cpu will beat it at performance on everything 
that's not trivial.