Targeting Vulkan and SPIR-V
karl via Digitalmars-d
digitalmars-d at puremagic.com
Fri Mar 13 11:44:16 PDT 2015
Spir-V may be producable from HLL tools, but that doesn't mean
it's perfectly ok to use any HLL. Capability for HLL-to-spir is
exposed mainly for syntax sugar and shallow precompile
optimisations, but mostly to avoid vendor-specific HLL bugs that
have plagued GLSL and HLSL (those billion d3dx_1503.dll on your
system are bugfixes). Plus, to give the community access to one
or several opensource HLL compilers that they can find issues
with and submit for everyone to benefit. So, it's mostly to get a
flawless opensource GLSL compiler. Dlang's strengths are simply
not applicable directly. Though with a bit of work can actually
be applied completely. (I've done them in/with our GLSL/backend
compilers)
- malloc. SpirV and such don't have malloc. Fix: Preallocate a
big chunk of memory, and implement a massively-parallel allocator
yourself (it should handle ~2000 requests to allocate per cycle,
that's the gist of it). "atomic_add" on a memory location will
help. If you don't want to preallocate too much, have a cpu
thread poll while a gpu thread stalls (it should stall itself and
60000 other threads) until the cpu allocates a new chunk for the
heap and provides a base address. (hope the cpu thread responds
quickly enough, or your gpu tasks will be mercilessly killed).
- function-pointers, largely a no-no. Extensions might give you
that capability, but implement as big switch-case tables. With
the extensions, you will need to guarantee an arbitrary number
(64) of threads all happened to call the same actual function.
- stack. I don't know how to break it to you, there's no stack.
Only around 256 dwords, that 8-200 threads get to allocate from.
Your notion of a stack gets statically flattenized by the
compilers. So, your whole program has e.g. 4 dwords to play
around and have 64 things hide latency, or 64 dwords but only 4
threads to hide latency - and is 2-4x slower for rudimentary
things (and utterly fail at latency hiding, becoming 50 times
slower with memory-accesses), or 1 thread with 256 dwords, which
is 8-16 times slower at rudimentary stuff and 50+ times slower if
you access memory even if cached. Add a manually-managed
programmable memory-stack, and your performance goes poof.
- exceptions. A combined issue of the things above.
Combine the limitations of function-pointers and stack, and I
hope you get the point. Or well, how pointless the exercise to
get Dlang as we know and love it on a gpu. A single-threaded
javascript app on a cpu will beat it at performance on everything
that's not trivial.
More information about the Digitalmars-d
mailing list