Reflections on Trusting Trust, Ken Thompson, 1984.
Today we’re taking a look at Ken Thompson’s Turing award acceptance speech from 1984. Ken Thompson is a foundation of our computing world, and I think his work speaks for itself: he brought us Unix, B1, UTF-8, Golang, and probably a ton of other things that I don’t know about.
This is an interesting paper because Ken Thompson first starts out2 by saying that he would like to “present to you the cutest program I ever wrote”, but as we soon see, this cute little program has profound consequences. He does this in three stages.
To give a summary, he first shows us the concept of self-replicating programs, then how programs can “learn” something from previous versions of itself, and finally combines these two pieces to show how one may teach a compiler how to self-replicate a bug that isn’t visible from future source codes of the compiler. And thus, he concludes, we can’t really trust programs, we can only really trust people.
Stage I
In stage one, Ken Thompson recounts how, in the good old days, people amused themselves by writing self-reproducing programs. He also challenges the listener to try writing one for themselves, stating that “The discovery of how to do it is a revelation that far surpasses any benefit obtained by being told how to do it.”
Never one for having a puzzle being ruined, I decided to give it a go. As usual, thanks to my upbringing, I reached for C++ to do the job3. Here’s my attempt:
#include <iostream>
#include <fstream>
#include <string>
#include <map>
using namespace std;
map <int, string> m;
string escape (string input) {
string ret = "";
for (int i = 0; i < (int) input.length(); i++) {
if (m.find(input[i]) != m.end()) {
ret += m[input[i]];
} else {
ret += input[i];
}
}
return ret;
}
int main () {
m[92] = "\\\\";
m[10] = "\\n";
m[34] = "\\\"";
string rest_string = "\";\n ofstream file;\n file.open(\"trust.cpp\");\n file << init_string + escape(init_string) + rest_string;\n file.close();\n return 0;\n}\n";
string init_string = "#include <iostream>\n#include <fstream>\n#include <string>\n#include <map>\nusing namespace std;\n\nmap <int, string> m;\n\nstring escape (string input) {\n string ret = \"\";\n for (int i = 0; i < (int) input.length(); i++) {\n if (m.find(input[i]) != m.end()) {\n ret += m[input[i]];\n } else {\n ret += input[i];\n }\n }\n return ret;\n}\n\nint main () {\n m[92] = \"\\\\\\\\\";\n m[10] = \"\\\\n\";\n m[34] = \"\\\\\\\"\";\n string rest_string = \"\\\";\\n ofstream file;\\n file.open(\\\"trust.cpp\\\");\\n file << init_string + escape(init_string) + rest_string;\\n file.close();\\n return 0;\\n}\\n\";\n string init_string = \"";
ofstream file;
file.open("trust.cpp");
file << init_string + escape(init_string) + rest_string;
file.close();
return 0;
}
I believe the crux of the problem is that to write a program that reproduces itself, you need to embed the program within itself. If one isn’t careful, you can repeat this step indefinitely, like when you point two mirrors at each other. The trick is to short-circuit the process by allowing the program to produce two copies of itself. One copy to recreate the re-creation mechanism, and one copy to recreate the program that uses the recreation mechanism. Of course, the two outputs can’t be identical, so you also need a way to transform a copy into either the program itself, or the re-creation mechanism.
In my example above, init_string
is the re-creation mechanism. Outputting init_string
directly recreates the program, and outputting escape(init_string)
recreates the re-creation mechanism (i.e. the string itself). In the paper, Ken Thompson has a much simpler proposal that essentially does the same thing, but gets rid of some of the clunkiness in my program with the rest_string
business.
But this isn’t the main point of this stage. The point of stage one is this: it is possible to for code to replicate itself.
Stage II
In stage two, Ken Thompson talks about how a program comes to “learn” something4.
To demonstrate “learning”, he shows us that the C compiler “knows” that \n
represents a newline character5. Somewhere in the C compiler lives a piece of code that looks something like this:
...
c = next();
if (c != '\\')
return c;
c = next();
if (c == '\\')
return '\\';
if (c == 'n')
return '\n';
...
But notice that this could not possibly have been the source code of the compiler first used to introduce the knowledge that “\n
represents a newline character”. Before this knowledge was introduced, \n
is not a meaningful character, so the last line of code would not have made sense to the compiler. How did it come to be?
The answer is this: there must have been some older version of the compiler that was written like this:
...
c = next();
if (c != '\\')
return c;
c = next();
if (c == '\\')
return '\\';
if (c == 'n')
return 10;
...
Here, instead of returning the unknown character \n
, we return 10, which is the ASCII value for \n
. After compiling this source code with the C compiler however, the resulting compiler is now able to correctly interpret the sequence of characters “\” and “n” correctly as \n
, and thus we can abandon this nasty version of the source code with a mystical 10 and replace it with the first version with the more-readable \n
.
Thus, we see that the compiler has now learnt that “\n
represents a newline character”. But while this knowledge is able to perpetuate itself, it’s important to remember that the knowledge doesn’t come from this new source code (how could it learn from the future?), rather the knowledge was baked into its program binary.
Stage III
Ken Thompson now combines the first two stages to produce a devastating result. First he shows that one can write a C compiler that contains the following piece of code:
compile(string s):
if (s is code for the Linux kernel):
compile(bug that gives me admin privileges)
...
Using this compiler to compile the Linux kernel, we get a secret backdoor access built into the kernel.
Of course, this is a little too obvious, and someone could easily spot this in the source code and raise some alarms. So instead consider v2:
compile(string s):
if (s is code for the Linux kernel):
compile(bug that gives me admin privileges)
if (s is code for the C compiler):
compile(bug that inserts these two bugs)
...
We then compile this version of the C compiler, then remove these lines from the source code. Now nobody inspecting the source code will be able to spot the attack, and all C compilers in the future will replicate this bug. We have thus taught the C compiler knowledge on how to replicate a bug that lives on without its original source code.
Ken Thompson has a cheeky moment where he describes his invisible, self-perpetuating bug as follows: “If this were not deliberate, it would be called a compiler ‘bug’. Since it is deliberate, it should be called a ‘Trojan horse'”.
Reflections
Ken Thompson reflects that “No amount of source-level verification or scrutiny will protect you from using untrusted code.” Furthermore, these problems aren’t restricted to the C compiler. He continues by saying that “I could have picked on any program-handling program such as an assembler, a loader, or even hardware microcode. As the level of program gets lower, these bugs will be harder and harder to detect.”
And of course, the stack of program-handling programs that we use on a daily basis only grows. Nowadays, we have browsers running programs, and we rely on RPC frameworks to execute code over the network all the time.
Admittedly, the attack that Ken Thompson describes is only “invisible” in the sense that the behaviour isn’t visible from the source code. Given the means, it is possible to inspect the machine code of any program and spot the attack.
Possible, but not feasible.
At some point we have to give up — even if you verify your compiler, you must verify the operating system that it ran on; then you must verify the hardware that your operating system runs on. And you must do this for every version of, well, everything. Each piece is hard to verify on one’s own. Distributing the work among multiple people makes the task more plausible, but then we run into the same problem of having to trust other people. Can you even trust yourself to not let a bug slip by?
Ken Thompson proposed a daunting problem. And this isn’t just about Trojan horses, it’s about the everyday bugs that get embedded in programs that either become invisible when the source code changes (as in the example in this paper), or are effectively invisible by virtue of being drowned in the overwhelming amount of source code surrounding it.
But Ken Thompson also mentions the “solution” that we perhaps knew all along:
To what extent should one trust a statement that a program is free of Trojan horses? Perhaps it is more important to trust the people who wrote the software.
Ken Thompson
Reminding us that, at its heart, software is not a technical problem, it is a social one.
Notes
- A language which went on to influence the design of C.
- He also brings up a quote that I think is both endearing and full of wisdom: “Dance with the one that brought you”. In this case he was referring to his work on UNIX, but excuses himself for not talking about it since he hasn’t worked on it in many years.
- I tried, I really did, to do this in Python, like people have been encouraging me to do for small one-off programs. But the difference between a string’s internal representation and its displayed output kept throwing me off.
Too many things shift under your feet when using Python. The C++ version was much more pleasant, and quicker, to write.
- In some sense this is both different from and oddly reminiscent of machine learning. Ken Thompson described this as something “as close to a ‘learning’ program as I have seen”, but he also gave this talk in 1984. In both cases, the program “learns” things that exist outside of their source code. But in the case of this compiler trick, the “learning” is encoded in the program binary, and for some kinds of machine learning, the “learning” is encoded within the model’s parameters. One is embedded into the program, the other is separable from it.
- This is a quirk of compilers written in their own language. The knowledge has to exist within the compiler itself because there is no one else, other than itself, to interpret the compiler’s source code. How lonely.