The trouble with SPIR-V, 2022 edition

May 23, 2022

2022-05-02: I have started rewriting this article for improved style, reduced snark, more accurate information and I also added a new section about control flow that should be easier to digest and better expresses my thoughts on the topic. The section about memory has been removed and will reappear as another entry to appear later.

You probably heard of SPIR-V. Otherwise, SPIR-V is a binary format for writing programs that run on the GPU, and it is designed to be consumed by OpenCL and Vulkan. As well as those two, OpenGL 4.6 added support for SPIR-V shaders, and WebGPU’s WGSL is essentially a close cousin.

In the case of Vulkan specifically, SPIR-V replaces GLSL as the default way to feed code to the GPU. GLSL is a human-readable programming language with a lot of syntactic forms that needs to be parsed correctly, and it has been a steady source of implementation bugs, runtime overhead and intellectual property concerns (as you essentially ship the source of every shader in your app!).

SPIR-V, much like Vulkan itself, trades API surface area for dramatically less driver complexity: by offering an intermediate language for feeding the driver, you put all the complexity and risk of bugs when implementing the high level language outside of the driver.

This is -IMO- an unquestionable upgrade. We can now genuinely expect shaders to “just work” on another vendor. We can use any shading language we want, or make up our own, or even implement our own Vulkan software driver with considerably less work to do. SPIR-V, I argue, is a major component of Vulkan’s success, and I hope it continues to expand in this direction.

In this series, I’ll be talking about SPIR-V as compiler target and a language for GPUs, from the perspective of a graphics hobbyist turned research compiler author. We’ll be focused on mostly compute aspects, at least for now, since for graphics I believe SPIR-V is acquitting itself quite well and I don’t have a lot to say.

Merging compute and graphics: not quite.

The SPIR-V specification defines it as an intermediate language for graphical shaders and compute kernels. The same specification is indeed shared for both OpenCL kernels and Vulkan shaders - this often leads to the claim that Vulkan is an OpenCL replacement, since they share an intermediate language. Sadly, while they are defined in the same document, SPIR-V for Vulkan shaders and SPIR-V for OpenCL kernels are disjoint subsets.

To be more precise, OpenCL kernels use the Kernel execution model and capability, while Vulkan compute shaders use the GLCompute¹ model and the Shader capability. For brevity I will from now on refer to SPIR-V programs using the Shader capability as shaders, and likewise for those using the Kernel capability. There is no interoperability between kernels & shaders: Vulkan won’t run Kernels and OpenCL won’t run Shaders, not even compute ones.

As you might have guessed, compute shaders are quite different to CL kernels, and lack certain important features. I want to stress that all PC graphics hardware that runs Vulkan also has support for OpenCL and/or an even more capable dedicated compute API like CUDA or ROCm. The reasons there is any feature gap at all is a result of how we’ve approached building software for them, mostly the compilers.

Divergent control flow stories

SPIR-V models programs similarly to LLVM: modules are made out of functions, functions are made out of basic blocks (BBs), basic blocks are sequences of instructions that cannot do control flow and are terminated by a single control flow operation (jumping to another BB, some form of branching or exiting the function and returning a value). The basic blocks and their connections form a Control Flow Graph (CFG), which tells you which paths through basic blocks can be taken.

We will see that shaders have significant restrictions on their control flow: it must be structured. I found when rewriting this article that just giving those restrictions without proper context was unhelpful, so here is a small introduction so we’re all on the same page:

What is structured control flow ?

Let’s look at some imaginary piece of hardware. You’re given essentially a processor that can execute the same instruction on N pieces of data at once, ie a SIMD machine. What we want to do is take our shader code, that describes what one instance of the program does, and somehow run that piece of code on our machine and have it process N instances of the program (typically called threads), simultaneously, this is known the SIMT model.

The easiest way to achieve this is to simply take all the instructions of our original program and make them wide, ie turn them into vector operations. You take an add r0 1 and you turn that into a addv v0 <1 x 8>. This works brilliantly for programs without control flow, but problems arise where you have points in the program where some of the instances of the program want to do different things, like a conditional branch or an indirect function call. This phenomenom is known as thread divergence, where different logical SIMT threads executing on the same physical SIMD thread want to take different paths.

The classic strategy to deal with divergence is known as “if-conversion”: in short we execute both sides of a branch and ignore the results we don’t care about, it’s also called predicated or masked execution. Most hardware makes it easy for us by having a special mask register that will disable some of the SIMD lanes for us.

So the process of executing an if statement involves computing the condition, if any thread wants to execute the true branch, set the mask accordingly, execute the branch, then do so for the false branch, then finally set the mask back to what it was at the start of the branch.

extern <bool x 8> mask;

void foo(<bool x 8> condition)
    <bool x 8> old_mask = mask;
    if (condition != <false x 8>) {
        mask = old_mask & condition;
        statement_a;
    }
    if (condition != <true x 8>) {
        mask = old_mask & !condition;
        statement_b;
    }
    mask = old_mask;
}

In order to be able to perform this transformation to the code, we need reconvergence information, in plainer terms, we need to know for each point where divergence would occur, where the threads will join up again - so we don’t end up running the entire program twice after diverging.

This information can be recovered automatically in some cases, but by far the simplest approach is to simply require the program to contain it explicitly, by this so-called structured control-flow discipline. GLSL and HLSL already satisfy those rules by construction: since they do not have a goto statement like C, you can simply look what follows a loop or conditional statement in the AST.

But for SPIR-V this is troublesome, since we have a CFG, not an AST² : the OpenCL flavour just figures this out by itself using magic (complicated compiler analyses and transformations), but the Vulkan dialect handles that completely differently: Shaders are effectively required to explictly say where ifs and loops are, and to provide that information in the form of an OpSelectionMerge / OpLoopMerge in front of the jumps, effectively enforcing more or less the same structural constraints that the syntax of the high-level shading languages have³.

Problematic control flow

The main, conceptual issue with structured control flow is that not all code is naturally structured. Even worse, in the GPU niche there is a tension between structured and unstructured representations, because the easy canonical solution for different things involves structure, or relies on a simpler graph representation.

Modern SSA and continuations-based compiler literature works strictly with unstructured graphs to represent control-flow in programs. This is good because on a scalar machine, the control flow can be totally arbitrary, and this is the most efficient way to run, but also the most natural and effective way to represent complex programs with decision trees, finite state machines or multi-level breaks, all those things lack inherent structure.

Adding structure means changing the graph to get rid of the impossible paths, doing things such as forcing a loop to always go through one node, duplicating nodes that appear at different stages on different paths and adding special “railway switches” variables to keep the program behaviour the same even though we changed the graph. Essentially, we turn problematic control-flow into data-flow, which makes the code bigger, less readable and less efficient.

Yet, there are reasons GPU-specific compilers would like to have structured information to work with. We’ve talked about if-conversion before, and a similar thing needs to be done for loops, and this is much easier to do if you’re working with structured information, with which “where does X reconvergence” is an easy question to answer.

Furthermore, as we will see in the next section, dedicated GPU shading languagues expose instructions that are sensitive to the set of active threads, ie threads that are not currently masked off and skipping over instructions they don’t need to execute.

bb1:
%texel = texture_load_vec4(%tex, %coords);
br %4 bb2 bb3

bb2:
%5 = max %texel.0 %texel.1
%6 = max %texel.2 %texel.3
%7 = max %5 %6
j bb3

bb3:
%color = phi (0, bb1) (%7, bb2)

In this example, you might want to move the texture load inside bb2, but it would actually break the program because the texture load would have wrong results if it does not execute in uniform (ie all threads are active) control flow ! In the presence of such convergence-sensitive operations, a regular CFG representation is dangerous because optimisations could be moving things in a way that actually breaks the program, by losing information and invariants that purely unstructured representations fail to encode⁴ !

Subgroups and divergence

In the discussion about the place of structured control flow, versus unstructured, and what the underlying issues guiding our decision-making are (or rather, should be), we need to talk about subgroup level operations. A reasonnable approximation of what a subgroup is are those bags of N threads executing on a single SIMD processors as we described earlier. ⁵

Subgroup intrinsics let us do things that affect the whole subgroup such as:

Querying which threads are currently active
Ask if any thread thinks a predicate is true
Ask if all threads agree a predicate is true
Have threads vote on what the value of something is
Have threads exchange data with each other

The semantics of those operations are pretty self-explanatory when we’re in uniform control flow, you just ask all the threads in the subgroup. But if we use those operations inside divergent control-flow, suddently the set of active threads matters, it matters in the most dramatic of ways as it’s been made part of the programming model that the programmer is expecting to behave in a predictable fashion !

If the language is structured, answering the set of active threads is a matter of following the structure of the program: if/else branches will divide the set of active threads, and after the branch the sets should merge again, similar story for loops.

But if all you have is a control-flow graph, things get tricky: the typical example is a statement inside a basic block that ultimately exits the loop. Is that block inside or outside the loop ? According to the graph, it will always be outside, but syntactically it’s clear we want it to execute as part of the loop !

while (true) {
    ...
    if (blah) {
        please_execute_this_in_the_loop();
        break;
    }
}

It’s actually shocking just how ill-defined this problem area can be in state of the art graphics stacks. Commercial compilers often get snippets like this wrong, even if the source language is strictly structured ! I would say the fault is mostly on the models we use, it turns out there is no standard, water-tight way to represent convergence information in GPU languages that are unstructured, and that’s a huge problem.

Quick survey

OpenGL and Vulkan use structured control flow, either directly by using GLSL, or by using the structure annotations on control-flow SPIR-V has to offer. There is no way to do unstructured jumps, even for uniform control flow where it is not a problem.
DirectX uses an LLVM-derived IL that is not structured. AFAIU, it tries to perform “maximal reconvergence”, meaning it suffers from not being able to represent the loop-break-path example we just discussed.⁶
OpenCL just doesn’t try: it makes it undefined behaviour.⁷ Not all GPU programs use subgroup operations, so this may be acceptable sometimes, but it definitely feels like a big omission, and outright unworkable for graphics.
Modern CUDA uses explicit programmer-managed masks, which is powerful and takes advantage of their hardware specifics. But mis-using the mask can cause a deadlock, as divergent threads could simply never participate in a subgroup operation that expects them to, leaving the other threads to block forever. I can see why this solution leaves to be desired, as it just offloads the problem and the risk of misuse to the user.

In conclusion

SPIR-V shaders are not unique in enforcing structural constraints, other representations have analogous limitations, such as WebAssembly, and naturally people complain about that too. I personally don’t make it a secret I don’t like structured control flow, I especially don’t like it in WA since it is not meant to run on GPUs and the constraints seem to just be about technical debt in Javascript VMs.

The original version of this article just left it at me express my dislike for structured CF from a compiler author standpoint, without giving a constructive alternative, and without considering things like properly defining subgroup operations. It’s outside of the scope of this post (and my current knowledge) to survey the state of all potential solutions, but we can probably address one right now:

One obvious “simple” model is to simply drop the pretense and expose the SIMD units as variable-width vector units in the GPU APIs, and let the shading language become responsible for emitting masked code with no non-uniform jumps. But actually this is probably unrealistic to implement for a variety of reasons⁸, and I think there is a nicer workable solution somewhere, that will eventually answer the question satisfactorily.

In the meantime, I feel emitting structured code from non-structured code is a stuck brake to SPIR-V adoption: the transforms are hard to implement, lossy and result in ugly code. On the other hand, entirely unstructured code would break major properties we rely on for correctness. I think this area deserves a better solution, and ideally one that works for both consumer APIs.

Thanks to nanokatze, martty, madmann, juuso, L4, clepirelli and Jaker for giving feedback and pointing out typos in my definitely not flakey writing.

As far as the SPIR-V spec is concerned, OpenGL and Vulkan compute shaders are the same thing, or at least the same execution mode. ↩︎
Abstraxt Syntax Tree, a data structure representing the code as structured in the original file. Here this is used as a sort of handwavy way to mean “an IR with equivalent structural information as found in the AST”. See the footnote about NIR. ↩︎
There are some differences: in SPIR-V you can use OpPhi to emulate basic block parameters, you can “skip” cases in switch fallthrough, there’s an explicit OpUnreachable / OpKill etc. But generally you still need to obey the same spirit as if you had written code using if/else. ↩︎
It is such an issue that the NIR compiler (used by the Mesa project for implementing OpenGL and Vulkan drivers) actually enforces structural constraints as part of it’s internal representation. Jason Ekstrand wrote an excellent post about this. ↩︎
This is only an approximation since vendors can and have made processors where the threads within the subgroup can execute independently. However such vendors are stuck effectively emulating the observable characteristics of the simplified model we described. ↩︎
I believe this paper is what DirectX implements. ↩︎
From the OpenCL spec, specifically: “These built-in functions must be encountered by all work items in a subgroup executing the kernel.” ↩︎
On the top of my head: this would force vendors to implement divergence in exactly one way (the way the upstream shader compiler picks), it would be a hard break against the existing model that had been used up to now, there is no migration path, and most “production” vectorisers sadly don’t work with variable vector length. ↩︎

Gob's blog