Targeting Vulkan and SPIR-V

John Colvin via Digitalmars-d digitalmars-d at puremagic.com
Sat Mar 14 06:55:04 PDT 2015


On Friday, 13 March 2015 at 18:44:18 UTC, karl wrote:
> Spir-V may be producable from HLL tools, but that doesn't mean 
> it's perfectly ok to use any HLL. Capability for HLL-to-spir is 
> exposed mainly for syntax sugar and shallow precompile 
> optimisations, but mostly to avoid vendor-specific HLL bugs 
> that have plagued GLSL and HLSL (those billion d3dx_1503.dll on 
> your system are bugfixes). Plus, to give the community access 
> to one or several opensource HLL compilers that they can find 
> issues with and submit for everyone to benefit. So, it's mostly 
> to get a flawless opensource GLSL compiler. Dlang's strengths 
> are simply not applicable directly. Though with a bit of work 
> can actually be applied completely. (I've done them in/with our 
> GLSL/backend compilers)
>
> - malloc. SpirV and such don't have malloc. Fix: Preallocate a 
> big chunk of memory, and implement a massively-parallel 
> allocator yourself (it should handle ~2000 requests to allocate 
> per cycle, that's the gist of it). "atomic_add" on a memory 
> location will help. If you don't want to preallocate too much, 
> have a cpu thread poll while a gpu thread stalls (it should 
> stall itself and 60000 other threads) until the cpu allocates a 
> new chunk for the heap and provides a base address. (hope the 
> cpu thread responds quickly enough, or your gpu tasks will be 
> mercilessly killed).
>
> - function-pointers, largely a no-no. Extensions might give you 
> that capability, but implement as big switch-case tables. With 
> the extensions, you will need to guarantee an arbitrary number 
> (64) of threads all happened to call the same actual function.
>
> - stack. I don't know how to break it to you, there's no stack. 
> Only around 256 dwords, that 8-200 threads get to allocate 
> from. Your notion of a stack gets statically flattenized by the 
> compilers. So, your whole program has e.g. 4 dwords to play 
> around and have 64 things hide latency, or 64 dwords but only 4 
> threads to hide latency - and is 2-4x slower for rudimentary 
> things (and utterly fail at latency hiding, becoming 50 times 
> slower with memory-accesses), or 1 thread with 256 dwords, 
> which is 8-16 times slower at rudimentary stuff and 50+ times 
> slower if you access memory even if cached. Add a 
> manually-managed programmable memory-stack, and your 
> performance goes poof.
>
> - exceptions. A combined issue of the things above.
>
> Combine the limitations of function-pointers and stack, and I 
> hope you get the point. Or well, how pointless the exercise to 
> get Dlang as we know and love it on a gpu. A single-threaded 
> javascript app on a cpu will beat it at performance on 
> everything that's not trivial.

The reason to use D for kernels / shaders would be for its 
metaprogramming, code-generation abilities and type-system 
(slices and structs in particular). Of course you wouldn't be 
allocating heap memory, using function pointers or exceptions. 
There's a still a lot that D has to offer without those. I 
regularly write thousands of lines of D in that subset.

P.S. D is in pretty much the same boat as any other C-based 
language w.r.t. stack space. You have to be careful with the 
stack in OpenCL C, you would have to be careful with the stack in 
SPIR-D.


More information about the Digitalmars-d mailing list