Few topics prompt intense discussion like undefined behaviour. In it’s C++ form, UB is often reviled as something that is designed to hurt you and humiliate you in front of an audience of smug compiler writers. As a smug compiler writer I deny any involvement.
Unfortunately there is an important distinction often lost as everyone uses C++ specifically as the basis for the discussion 1. Which is, that undefined behaviour includes both things the designers would rather not define (for platform portability, compiler vendor portability or to allow more optimisations), and things they cannot define. Today I’d like to talk about such an undefinable thing, and try to explain in the simplest terms what the issue is, what little you can do about it.
The problem statement
Every useful programming language provides abstractions, it’s what they do. One of those common abstractions are variables 2, named things that refer to some value you can retrieve from them. Sometimes they’re immutable, sometimes you are allowed to change what that value is. Another very useful thing are arrays, special types that give you a flat data structure with constant random access, and happen to map really well to the hardware (flat memory, cache-friendly).
There is always a bit of danger with arrays though, because the data type you use to index into them (usually) has a much larger range than what the array can physically hold. A sane, easy solution is to simply check array accesses against a known size, throwing compile or runtime errors when attempting to access out of valid bounds. However that’s not always good enough, sometimes the cost of those extra checks is something we’d rather not pay. Enter the conundrum:
Out of bounds accesses, if uncaught, can corrupt the program memory and break the variables abstraction: they allow a variable to change without the program code explicitely referring to it, directly or indirectly. Which variable you do hit is entirely platform-determined and unknowable at the specification level. This completely breaks the logical framework in which you build your language specification, because now every single operation that reads variables has to accept a case where it behaves unpredictably:
for i in 0 .. Nmay not terminate
let x = true; if x then foo() else bar()might run
if x == 0 then 1 else 1/xmay cause a division by zero
let five = 5might not be in fact five when we look at it later.
This is, I hope anyone can agree, insane. It’s no-good, fuck-this-shit, nopetastic, nightmare material. It breaks the ability to reason about pretty much anything as it breaks the fundamental abstraction of variables, and in turn, anything you’d build on top, like functions or control flow constructs.
You can take this as a good reason to stick with verified bounds forever, and I would laud that choice, but there is another, darker path: you can simply refuse to think about the cases where memory corruption happens. Now to be logically sound you can’t just ignore the consequences of memory corruption, you have to assert its absence in text. In other words you need to not define what happens if out of bounds accesses happen. Here we are.
“But the hardware/implementation defines it !”
The immediate objection many have is to come back a couple paragraphs and argue than this is defined: the platform does ! Ignoring the whole discussion about relying on implementation details that aren’t even consistent run-to-run on a given machine, this does not solve anything. You still have broken the variables abstraction. It’s still impossible to meaningfully describe the behaviour of code fragments, or to guarantee reliable results at the macro level.
It’s also not desirable. I decided to make this blog post about undefinable things, but playing devil’s advocate for a second, if you define OOB accesses to be “whatever the platform does”: you have just made all those implementation details about the platform part of the language spec. The way everything is laid out into memory, where variables go and so on. This is the nuclear approach: a tautological “the behaviour is what the behaviour is”. You’ve essentially turned everything into the world’s biggest deterministic state machine, impossible to revise, fix or otherwise improve. And it gets worse.
It gets worse
The elephant in the room is people who refuse bounds checking are after some ideal of performance. I say ideal, because committing to any sort of physical memory layout is far more expensive than the single-digits perf hit of checking every array access. The reason for that is a peek inside the world of optimising compilers, and a stern reminder of basic performance facts.
One of the most important optimisations a compiler performs is placing things in registers, rather than on the stack. Memory these days is very slow, and a single cache miss can cost you hundreds of cycles. It is of the utmost importance to place variables in registers as much as possible. Where they won’t have addresses at all !
Furthermore the compiler will occasionally bounce variables to and from registers as things are computed and register pressure varies, trying to put as much work between spills, loads and usage of said loads to minimize potential stalling. It’s a beautiful, complex symphony, and it means the orthodox idea that a variable is just a slot in the stack is utterly inadequate to describe what’s truly going on.
Of course, as soon as someone takes the address to this stack slot and leaks it to unknown code, the fun ends. Since compilers can’t know what foreign code does, they have to throw the facade and keep the variable in memory. This isn’t very fun, and there are a few things that can trigger this. Now let’s think about what happens with out-of-bounds accesses, if you wanted them to be honored “in the obvious way”. That’s right - you need to turn all of that off, becuse any variables can be accessed from anywhere !
It’s still going to be broken mind you, RISC architectures don’t even allow you to use memory as operands directly, so what you’ll see in the stack will not necessarily reflect the true state of variables. This is just the logical conclusion of a broken mentality.
Correct and undefined
The intuition some have is that it is informal or lazy or even incorrect to simply leave certain behaviours undefined. In this case, the exact opposite is true:
If you have, in your language, a way to mutate unrelated variables by computing their addresses out of thin air, and you assume variables work like variables and don’t mutate unpredictably, you have a problem. If you do not carefully rule such behaviour out, your language specification is wrong and contradicts itself: it expects certain behaviour to never happen despite allowing for it. C++ is correct and consistent because it does not define out of bounds accesses as something legitimate.
Because you can’t. They’re undefinable.