Targeting Vulkan and SPIR-V

John Colvin via Digitalmars-d digitalmars-d at
Sat Mar 14 06:55:04 PDT 2015

On Friday, 13 March 2015 at 18:44:18 UTC, karl wrote:
> Spir-V may be producable from HLL tools, but that doesn't mean 
> it's perfectly ok to use any HLL. Capability for HLL-to-spir is 
> exposed mainly for syntax sugar and shallow precompile 
> optimisations, but mostly to avoid vendor-specific HLL bugs 
> that have plagued GLSL and HLSL (those billion d3dx_1503.dll on 
> your system are bugfixes). Plus, to give the community access 
> to one or several opensource HLL compilers that they can find 
> issues with and submit for everyone to benefit. So, it's mostly 
> to get a flawless opensource GLSL compiler. Dlang's strengths 
> are simply not applicable directly. Though with a bit of work 
> can actually be applied completely. (I've done them in/with our 
> GLSL/backend compilers)
> - malloc. SpirV and such don't have malloc. Fix: Preallocate a 
> big chunk of memory, and implement a massively-parallel 
> allocator yourself (it should handle ~2000 requests to allocate 
> per cycle, that's the gist of it). "atomic_add" on a memory 
> location will help. If you don't want to preallocate too much, 
> have a cpu thread poll while a gpu thread stalls (it should 
> stall itself and 60000 other threads) until the cpu allocates a 
> new chunk for the heap and provides a base address. (hope the 
> cpu thread responds quickly enough, or your gpu tasks will be 
> mercilessly killed).
> - function-pointers, largely a no-no. Extensions might give you 
> that capability, but implement as big switch-case tables. With 
> the extensions, you will need to guarantee an arbitrary number 
> (64) of threads all happened to call the same actual function.
> - stack. I don't know how to break it to you, there's no stack. 
> Only around 256 dwords, that 8-200 threads get to allocate 
> from. Your notion of a stack gets statically flattenized by the 
> compilers. So, your whole program has e.g. 4 dwords to play 
> around and have 64 things hide latency, or 64 dwords but only 4 
> threads to hide latency - and is 2-4x slower for rudimentary 
> things (and utterly fail at latency hiding, becoming 50 times 
> slower with memory-accesses), or 1 thread with 256 dwords, 
> which is 8-16 times slower at rudimentary stuff and 50+ times 
> slower if you access memory even if cached. Add a 
> manually-managed programmable memory-stack, and your 
> performance goes poof.
> - exceptions. A combined issue of the things above.
> Combine the limitations of function-pointers and stack, and I 
> hope you get the point. Or well, how pointless the exercise to 
> get Dlang as we know and love it on a gpu. A single-threaded 
> javascript app on a cpu will beat it at performance on 
> everything that's not trivial.

The reason to use D for kernels / shaders would be for its 
metaprogramming, code-generation abilities and type-system 
(slices and structs in particular). Of course you wouldn't be 
allocating heap memory, using function pointers or exceptions. 
There's a still a lot that D has to offer without those. I 
regularly write thousands of lines of D in that subset.

P.S. D is in pretty much the same boat as any other C-based 
language w.r.t. stack space. You have to be careful with the 
stack in OpenCL C, you would have to be careful with the stack in 

More information about the Digitalmars-d mailing list