The trouble with SPIR-V
If you have been dealing with Vulkan, or other modern GPU APIs in any capacity in the last 5 years, you probably heard of SPIR-V. To recap, SPIR-V is a binary format for writing programs that run on the GPU, and it is designed to be consumed by OpenCL and Vulkan. As well as those two, OpenGL 4.6 added support for SPIR-V shaders, and the WebGPU WGSL is essentially a text-form of SPIR-V.
In the case of Vulkan specifically, SPIR-V replaces GLSL as the default way to feed code to the GPU. This is without question a major improvement: GLSL is a human-readable programming language with a lot of syntactic forms that needs to be parsed correctly, and it has been a steady source of implementation bugs, runtime overhead and intellectual property concerns (as you essentially ship the source of every shader in your app!). SPIR-V, much like Vulkan itself, reduces the surface area of the driver responsibilities, here by moving all the front-end stages of compilation to the outside.
I’m going to complain a lot about SPIR-V soon, so I want to make it clear I consider it a vast improvement over its predecessor for graphics (GLSL). I don’t think there is actually something cripplingly terrible about the fundamentals of it, but the implementation of SPIR-V in the real world is problematic for a few reasons. It can and should be improved, this blog post is an argument about what is wrong and how to improve it.
Merging compute and graphics ? Think again
The SPIR-V specification defines it as an intermediate language for graphical shaders and compute kernels. The same specification is indeed shared for both OpenCL kernels and Vulkan shaders ! This is often used to make the claim Vulkan is an OpenCL replacement, since they share an intermediate language and Vulkan driver support is a lot better. The issue is these people did not read the fine print:
While they are defined in the same document, SPIR-V for Vulkan shaders and SPIR-V for OpenCL kernels are disjoint subsets. To be more precise, OpenCL kernels use the
Kernel execution model and capability, while Vulkan compute shaders use the
GLCompute1 model and the
Shader capability. For brevity I will from now on refer to SPIR-V programs using the
Shader capability as shaders, and likewise for those using the
Kernel capability. There is no interoperability between kernel & shaders: Vulkan won’t run
Kernels and OpenCL won’t run
Shaders, not even compute ones.
As you might have guessed, compute shaders are not as powerful as compute kernels, and I will go into more detail in the next sections. It’s crucial to understand the limitations I’m going to discuss are entirely down to software nonsense: all PC graphics hardware that runs Vulkan also has support for OpenCL and/or an even more capable compute API like CUDA or ROCm. This is not about what SPIR-V could be if the hardware was futuristic, it’s about what SPIR-V could be if it allowed unfettered access to cutting-edge PC hardware from 10 years ago.
Idio(ma)tic control flow
SPIR-V is an SSA-form IR and it mirrors LLVM: modules are made out of functions, functions are made out of basic blocks, basic blocks are sequences of instructions without any branches, terminated by a single control flow operation (jumping to another BB, some form of branching or exiting the function and returning a value). The basic blocks and their connections form a Control Flow Graph, which is an essential tool for program optimization.
SPIR-V shaders have a fundamental, crippling limitation on control flow: theirs must be structured. Loops & branches/switches (called selections in SPIR-V parlance) are to be augmented with headers & merge blocks, and control flow in and out of them must use those. What that means in regular programmer speak is that the graph has to be writeable usingly only if/else chains and loops, without the ability to break multiple loop levels at once. It also means anything involving
goto is illegal.
This is an issue because even though we’ve been told goto is considered harmful2, and that may or may not be a valid argument in language front-ends, inside a compiler you really do want the ability to do arbirary jumps. They are essential to efficiently implementing features such as exceptions, multi-level breaks, or simply dealing with previously altered control flow graphs. In fact with structured control flow, the control flow looks more like a tree than it does a regular directed graph.
A sadly common limitation
Forcing structured control flow in a intermediate language is not unique to SPIR-V, but it’s universally a bad idea: compilers do (should) not work that way. The argument that goto is bad because it is error-prone is idiotic here because no human writes control flow graphs: compilers do. Being stuck with structured CF inside a compiler creates a lot of unecessary friction and limits what the optimizer can do, and emitting structured CF from unstructured CF is a lossy process which results in worse code.
The reason I suspect we have to deal with this limitation in SPIR-V is probably the exact same as the reason WebAssembly has it: Whenever a new standard intermediary language is created, a lot of old compilers will be retooled to work with it instead of starting fresh. Some vendors have compilers which had the structured control flow constraints baked in since the start many years ago, which means re-engineering them is very expensive in terms of man-hours.
What does this mean for people writing compilers for these intermediate languages ? It means you must mangle your control flow back into a structured form, or (more likely) make up a structured form for code that never had one. This is a process of wedging a square peg through a round hole, so in the general case this means turning control flow into data flow and/or duplicating code, both operations being harmful to performance either directly or by crippling the complier ability to further optimize.
SPIR-V pointers, explained
The story for how pointers work in SPIR-V will surprise people most familiar with programming modern CPUs. This information, despite being freely available, is very poorly documented: by that I mean only the specification really explains it and it does so in a very formal and hard to follow manner. The limitations and precise capabilities are not explained clearly upfront but are pieced together after reading enough of the spec. This is an issue: the feasability of targetting SPIR-V/Vulkan is very hard to assess because of it, and you routinely have to re-evaluate your mental model of it.
I will do my best to boil away the specalese3 and explain this in plain english, and as faithfully as possible, but I’m only human. There are a bunch of storage classes in Vulkan SPIR-V, storage classes can be thought of a bit like different address spaces (and may well map to that). Here’s a quick recap of the relevant ones:
||Input / outputs for graphics pipelines|
||GPU Global memory, GLSL SSBOs4|
||Vulkan push constants|
||Private (to invocation) memory|
||Like private but further limited to current function scope|
You’ll notice there is no
Generic storage class. Well actually there is, but guess what: it’s OpenCL only ! This isn’t too bad though, Vulkan is meant to abstract over a bunch of hardware and pretending to have flat memory would be an over-abstraction that doesn’t reflect the reality very well. Even then, that OpenCL generic class was only ever generic over global, shared and private memory 5.
Addressing & storage classes
According to the Vulkan specification before any extensions (we’ll get there), shader programs have to use the
Logical addressing model. What the logical addressing model does is make it so they have no physical representation: you can only create pointers from known objects, and you cannot load/store pointers. These pointers are not really pointers at all, since the pointee is always known at compile-time it is equivalent to simply use the pointee directly.
This is a sane way to handle things like pipeline outputs or push constants: these can be implemented in a way where talking about actual addresses makes no sense, but making these use the standard load/store instructions makes for a simpler IR that can encode reading/writing to those special areas, whether or not they’re pointers under the hood or not. This leaves textures (which I won’t discuss here in detail 6) & the three tiers of “real” memory to address.
So we arrived here: using stock Vulkan you’re basically stuffed if you try to write algorithms or create/consume data structures with pointers. Data structures that would have used pointers have to be flattened either automatically or manually, and this massively limits the usefulness of Vulkan as a compute API. On top of the annoyance, having to flatten all your data structures means you have to pay a performance and complexity price to use arrays and offsets instead, and it becomes far less appealing to share code between CPU & GPU in the presence of such barriers.
Extensions to the rescue
Two major extensions address, mostly but not quite comprehensively, the limitations stock Vulkan has versus classical compute APIs:
SPV_KHR_variable_pointers, consolidated with SPIR-V 1.3 as a core feature (but still optional), eases the restriction of the
Logical addressing mode by allowing pointers to be created from
OpPtrAccessChain (pointer artithmetic !) and a couple others misc instructions. This means a lot of algorithms which use pointers to local variables can now be expressed neatly.
OpStore are now allowed to deal with pointers into
Private memory, meaning they were physical pointers all along! Sadly the courtesy was not extended to pointers to shared and global memory.
While the easing of restrictions suggests pointers are implemented as physical ones, they are still logical as far as the specification is concerned: you still cannot cast between pointers & integral types to mess with the bit-pattern directly for example.
BDA: a turning point ?
The Vulkan extension
VK_KHR_buffer_device_address, along with
SPV_KHR_physical_storage_buffer, give us what we really wanted all along: normal pointers. 64 bits of freely computable, storable and addressible goodness. From the SPIR-V shader side, a new storage class called
PhysicalStorageBuffer is introduced, and that class has actual physical addressing. The way this works is you (still) allocate buffers on the host side like usual, but instead of binding them you query the driver for their address and they are pinned to device memory so you can access them whenever you damn please.
There’s a catch: as those familiar with Vulkan terminology would have picked up, these pointers are only for buffers in global memory. You can’t use them to point to stuff in local/shared memory.
Bearing that in mind, you may do everything you want with buffer device addresses (BDAs), including building linked lists & trees with them, substracting them from a base pointer to get offsets for more compact pointers, whatever you want. They’re not exactly new functionality, but they’re new as a cross-vendor functionality. Even the mobile vendors are getting onboard, this is serious now…
BDAs are awesome and bypass much of the worries of having to bind buffers to descriptors: you can just put all the arguments to a shader into push constants, and at worst they might spill, but there is no more thinking in terms of buffer binding points & SSBO layouts: you just have pointers into GPU global memory and you can build whatever you want with them. Heck you could even garbage-collect them…
Until next time
So is this it ? No of course not. While BDA sure is nice, it opens the door to more questions, namely: why can’t we have physical addresses into shared memory7 ? What about unified addressing with the host, or even unified memory ? I’m still looking into the Vulkan & SPIR-V ecosystem and trying to understand its many complexities, and finding my way arround those limitations I outlined. There is much more to talk about: atomics, subgroup operations, sync, memory models, device-specific intrinsics, …
I want to make it clear both the control flow & many of the pointer restrictions have no basis in hardware constraints, as the hardware on PC is plenty capable of understanding pointers, has been for a long time, and such functionality is exposed through compute APIs, and in fact the OpenCL flavour of SPIR-V itself using the
Physical64 addressing models.
The story of bridging the gap between compute & graphics stacks is more than can fit into a blog post. I hope this has been informative for everyone interested in GPUs and the APIs to target them, and I hope, in my wildest dreams, that some CTO at a sentient sand machines company will see this and figure their company should be the first to unlock the full power of their GPUs in a graphics API on PC, including arbitrary control flow & proper pointers. I think it’s high time for this.
As far as the SPIR-V spec is concerned, OpenGL and Vulkan compute shaders are the same thing. ↩︎
I should explain this particular quote here is used in a somewhat sarcastic manner: Dijkstra comment applied to high-level languages of the time and the process of verifying invariants and characterizing forward progress during a program execution. Not only these comments are severely outdated by now, they are routinely taken out of context in online discussions to support flawed arguments against supporting non-local jumps inside functions.
In particular the case Dijkstra made in his letter does not apply here: in SSA form the usage of
gotois not “unbridled”, but in fact encoded explicitely in a control flow graph’s edges. In fact the very reason to use this representation for programs is to ease perforing flow analysises ! ↩︎
A portemanteau of Specification and Legalese (ie lawyer speak). It’s a great word you should use it :D ↩︎
SPIR-V 1.3 added an explicit storage class for general-purpose buffers in global memory:
StorageBuffers, and deprecated the usage of the
Uniformstorage class for them. ↩︎
CrossWorkgroupbeing the OpenCL flavour of global memory. ↩︎
And what about private memory ? Well with variable pointers, you may use them inside data structures as long as they’re not visible to other invocations (
Functionstorage classes). So considering we have the capacity to do pointer arithmetic on them, and they by definition should not leak, only messing with their bit-pattern is disallowed which isn’t such a big deal. ↩︎