Why your GPU can (and probably should) do real function calls

August 02, 2023

This is a post about function calls on GPUs. To be sure: we are talking about real function calls, the kind that involve jumping around in the binary back and forth, not the kind that gets systematically inlined away and effectively serves as a type-safe macro system.

TL;DR

Let’s not bury the lede: every graphics card that can run compute shaders is turing-complete¹, and therefore it can implement everything the host can run. Few people will dispute that fact, so the argument against function calls is usually some handwaving about performance. Starting from first principles, let’s see why GPUs can do, and probably should embrace real function calls.

GPUs are not special.

Much discourse treats GPUs like their own special thing, which they are, but mostly in the sense that they pack a bunch of specialised equipment to accelerate non-computational workloads and we don’t get to program them with nice general-purpose tools, in a classic case of a self-fulfilling prophecy.

Yet ever since we got programmable shading (that’s a story for another time), GPUs hardware rather quickly converged on being rather boring wide SIMD machines. That is to say, in terms of control-flow and assembly, programming a GPU is just like programming any remotely modern processor with a vector unit: you have vector registers and you have vector instructions.

Working directly with those is usually considered grading caveman work², and it just makes more sense to use a SIMT programming model where instead of writing code that controls the whole processor core, you write code meant to be executed by only one slice of the vector unit, and you leave the compiler to deal with packing multiple of those together to run on the same physical core. On a GPU you might do that using GLSL or CUDA, on a CPU you can do it using OpenMP or ISPC.

The hardware isn’t limited, but the software often is.

So why isn’t GLSL as capable as ISPC ? Why does ISPC get function pointers and GLSL does not ? Well let’s look at CUDA instead. CUDA has that stuff³. It has had it for a while ⁴. It should not be surprising actually, after all a function call is just a spicy jump.

The real question is, if this stuff is available on CUDA, and has been for so many years, how come GLSL doesn’t have it ? Well, the reason GLSL does not have true function calls nor function pointers are multiple … (with some speculation):

GLSL was initially designed at 3DLabs at the start of 2001 and was already enormously future-looking and disruptive. GPUs of the day could not run all the stuff GLSL was offering, and they had to stop somewhere.
While some of the hardware of the day could do arbitrary jumps just fine, some was too primitive to allow a practical implementation, or would have brought additional compilation challenges.
The scope of what was being done at the time was small enough that it did not matter much, programs were still very small. The maximum size of shader programs in hardware itself was rather tiny.

More puzzling is why this limitation never went away in the following two decades from graphics APIs, despite being ousted from the GPGPU world in its early days. I think that me having to write this post hints at the fact we might be looking at somewhat of a cultural problem, rather than a strictly technical one.

Function calls are not more expensive on GPUs

Immediately after conceding that GPUs can and do perform function calls, it’s usually brought up that function calls are big horrible performance-sucking slugs that will annihilate any resemblance of performance. It’s rarely backed up by a convincing argument as to how that works exactly.

To understand where that idea comes from, let’s look at the ISPC documentation. It mentions that varying function calls are implemented by first partitioning the set of callees and then calling each in turn. That means that the cost of that divergent call will be the cost of all the called functions, added together ! That’s really slow right ?!

Yes it is slow. But it’s expected: what’s really going on is that you are diverging, and the fact you did so using a function call or a switch statement makes little difference. Furthermore, even if such a divergent call is really slow, that slowness must be balanced with the expressiveness and utility of having the language take charge of supporting it for you.

The self-fulfilling prophecy

The default alternative to not doing true function calls is to just inline every function until you’re left with none. There are genuine performance benefits to inlining, because it allows specialisation of the callee’s body by the compiler. However it comes at a direct, rather unavoidable code size hit. Every optimising compiler makes liberal use of inlining, using clever heuristics to maximise the benefit/cost ratio.

The problem is that if you don’t implement real function calls, you are stuck having to inline everything. With most graphics APIs there is no such thing as inling heurisitcs, picking a good tradeoff or maximise the cost-benefit, it’s just the only path forwards.

This also means that in order to guarantee this strategy succeeds, you need to aggressively disallow certain patterns, effectively here that’s going to be banning both function pointers and recursion, as you cannot generally inline infinitely. With function pointers you can’t generally know what’s being called, or even a set of potential callees, so this is a losing game.

The unseen costs of not having true function calls

Why do functions matter anyways ? I could write a whole blogpost on that, and maybe that’s next. But the TL;DR is that functions are the single most important form of abstraction, and function calls are their runtime part. Runtime function calls are an amazingly versatile mechanism and enable large systems, that could never be built as a monolith, to work together.

Function pointers enable things like recursion, efficient closure-conversion and shared libraries, they make programs more expressive, more general, smaller and plain old better. They are essential to modularity. You could not have modern computing, with modern OSes, applications or games without them.

Yet within the field of graphics, that’s exactly what’s being asked of programmers. And it has real costs: the small scope of early shaders gave way to sprawling shader codebases, but we still handle shader programs much the same way. Which means it doesn’t scale !

And remember, GPUs are turing-complete, and not even GLSL manages to lock that away. So all this power is still within our grasps, it just cannot be achieved through sensible means. Which means that if you really need something like a function call, you can build your own, something that is difficult, error-prone and likely to actually be slow because you’re not given access to the proper hardware facilities.

The real cost of function calls and what we can do about it

I’ve often read hand waving about how GPUs strictly do all computation in registers and cannot support a stack. This is patently false, CUDA implements a stack and you can readily find mentions of that in NVidia’s documentation.

More broadly, every GPU driver is capable of spilling, spilling is what happens when you run out of registers ⁵, and while it is in fact very undesirable from a performance standpoint, a working program is infinitely faster than one that the compiler refused to compile because you exceeded some resource the compiler itself is supposed to manage for you. Internal compiler errors are not exactly user-friendly.

Still, the stack is expensive. Not just in term of slow memory accesses, but also the sheer number of individual stacks needed to support execution on hundreds of thousands of program instances. Here’s Olivier Giroux from NVidia talking about this.

It would be nice if there was a form of function calls that didn’t involve growing a stack… and there is! Tail-calls are function calls that happen as the last thing in a function. Because of that fact, and as long as you do not leak the addresses of local variables, you can implement a form of true function calls with no stack requirements at all.

Sure would be nice if we had that in Vulkan …

In fact it’s likely every OpenGL-capable graphics card full stop is in fact turing complete. It’s rather hard to make something incapable of being coaxed into doing computations. ↩︎
I’m sure I can expect angry comments from people for saying that. Maybe I’d be less foul-mouthed if vector intrinsics weren’t utterly awful. ↩︎
Make no mistake, while the link enforces some restrictions, everything else is allowed, ie you can recurse using __device__ functions! ↩︎
The fact it has had this stuff since CUDA 3.2 and sm_2x hardware should clarify to any reader that those capabilities have nothing to do with whatever magic special-sauce NVidia marketing is touting this week either. ↩︎
I know about the register pressure / number of waves tradeoffs. You can still run out of registers, partly because workgroup memory isn’t infinite, partly because you might pick spilling as the lesser evil and fit within fewer registers. You certainly would benefit from having the option. ↩︎

Gob's blog