Targeting Vulkan and SPIR-V
John Colvin via Digitalmars-d
digitalmars-d at puremagic.com
Sat Mar 14 06:55:04 PDT 2015
On Friday, 13 March 2015 at 18:44:18 UTC, karl wrote:
> Spir-V may be producable from HLL tools, but that doesn't mean
> it's perfectly ok to use any HLL. Capability for HLL-to-spir is
> exposed mainly for syntax sugar and shallow precompile
> optimisations, but mostly to avoid vendor-specific HLL bugs
> that have plagued GLSL and HLSL (those billion d3dx_1503.dll on
> your system are bugfixes). Plus, to give the community access
> to one or several opensource HLL compilers that they can find
> issues with and submit for everyone to benefit. So, it's mostly
> to get a flawless opensource GLSL compiler. Dlang's strengths
> are simply not applicable directly. Though with a bit of work
> can actually be applied completely. (I've done them in/with our
> GLSL/backend compilers)
>
> - malloc. SpirV and such don't have malloc. Fix: Preallocate a
> big chunk of memory, and implement a massively-parallel
> allocator yourself (it should handle ~2000 requests to allocate
> per cycle, that's the gist of it). "atomic_add" on a memory
> location will help. If you don't want to preallocate too much,
> have a cpu thread poll while a gpu thread stalls (it should
> stall itself and 60000 other threads) until the cpu allocates a
> new chunk for the heap and provides a base address. (hope the
> cpu thread responds quickly enough, or your gpu tasks will be
> mercilessly killed).
>
> - function-pointers, largely a no-no. Extensions might give you
> that capability, but implement as big switch-case tables. With
> the extensions, you will need to guarantee an arbitrary number
> (64) of threads all happened to call the same actual function.
>
> - stack. I don't know how to break it to you, there's no stack.
> Only around 256 dwords, that 8-200 threads get to allocate
> from. Your notion of a stack gets statically flattenized by the
> compilers. So, your whole program has e.g. 4 dwords to play
> around and have 64 things hide latency, or 64 dwords but only 4
> threads to hide latency - and is 2-4x slower for rudimentary
> things (and utterly fail at latency hiding, becoming 50 times
> slower with memory-accesses), or 1 thread with 256 dwords,
> which is 8-16 times slower at rudimentary stuff and 50+ times
> slower if you access memory even if cached. Add a
> manually-managed programmable memory-stack, and your
> performance goes poof.
>
> - exceptions. A combined issue of the things above.
>
> Combine the limitations of function-pointers and stack, and I
> hope you get the point. Or well, how pointless the exercise to
> get Dlang as we know and love it on a gpu. A single-threaded
> javascript app on a cpu will beat it at performance on
> everything that's not trivial.
The reason to use D for kernels / shaders would be for its
metaprogramming, code-generation abilities and type-system
(slices and structs in particular). Of course you wouldn't be
allocating heap memory, using function pointers or exceptions.
There's a still a lot that D has to offer without those. I
regularly write thousands of lines of D in that subset.
P.S. D is in pretty much the same boat as any other C-based
language w.r.t. stack space. You have to be careful with the
stack in OpenCL C, you would have to be careful with the stack in
SPIR-D.
More information about the Digitalmars-d
mailing list