Deadlocks are a classic (aka pain-in-the-butt) problem1 in concurrent computing. As distributed systems take over the world, developers will spend many hours debugging deadlock issues and businesses will throw millions into this kind of work whether they realise it or not.
I’ve heard people advocate for lockless data structures or some form of optimistic concurrency control. I’m personally in favour of such mechanisms too, where they are appropriate. The reasoning is that such systems are faster and allow us to do away with locks (but then you need to work with rollbacks and that’s not fun at all). However, this reasoning doesn’t always apply, so pessimistic concurrency control (i.e. using locks) isn’t going anywhere. Usually, optimistic concurrency control is beneficial only when there’s low contention for certain resources or data items, which is an assumption that can break down when there’s, for example, transactions that access disproportionately popular data items, or long-lived transactions2. Many of the systems we run today are also full of locks3, so we really can’t run from them.
Given that pessimistic concurrency control remains both relevant and prevalent, we had students in CSCI1270 Database Management Systems implement fine-grained locking for B+ Trees and Hash Tables, and strict two-phase locking for data items. However, many folks ran into trouble with their databases stalling, and didn’t realise that these were caused by deadlock errors under the hood. After helping several students resolve these kinds of bugs, I wrote up a short primer on circular locking to explain the concept. Here’s a shortened and adapted version.
The primer
In large systems with many mutexes and spinlocks and other forms of concurrency control, we start to see something that we refer to as a lock hierarchy. For example, we might have the following call stack:
F():
lock A
G():
lock B
H():
lock C
(Read this as “function F locks mutex A, then recursively calls function G which locks mutex B, then recursively calls function H which locks mutex C”.)
In this case we have a lock hierarchy of A –> B –> C. In general, whenever we’re holding on to some mutex i and then grab mutex j, there is a lock hierarchy of i –> j.
Now imagine if we have some other function that makes the following calls:
J():
lock C
K():
lock A
This call trace inverts the lock hierarchy and results in a circular lock dependency. i.e.
A –> B –> C –> A
This can easily result in a deadlock:
Thread0 Thread1
---- ----
lock A
lock B
lock C
lock A
lock C
*** DEADLOCK ***
So how was this all relevant to our database course? One place where this popped up was in the transaction manager that had three classes of locks: the transaction manager lock, transaction-specific locks, and a lock manager (that handles logical locks on data items) lock. Depending on each student’s implementation of the transaction manager, a lock hierarchy was inescapably established. When that happened, students had to ensure that they didn’t invert this hierarchy in other places of the code, because doing so would cause their program to deadlock with itself and stall.
Avoiding this kind of bug is not trivial, I’ve seen very senior software engineers commit lock inversion errors in large concurrent systems, and supporting infrastructure is sometimes needed to help catch these kind of mistakes. But hopefully this primer helps you reason about this class of errors, or at least have a gut feeling for when and where you might encounter a deadlock.
Why is this simply a ‘primer’?
The deadlock scenario described here is likely sufficient for anyone working above the operating system layer, but there are other ways that deadlocks can appear. With operating systems, we don’t just have to think in terms of concurrent threads of execution, we also have to think with portals: processes can go to sleep, and threads can be interrupted. Here’s one scary4, and very common5, example that I’ll leave to end off this post.
CPU0
----
lock A
<Interrupt>
lock A
*** DEADLOCK ***
Notes
- For the opportunistic programmer, this should start ringing bells because it means there’s a demand for engineers who are able and willing to deal with such kinds of problems.
- Google’s globally-distributed database Spanner implements a lock table because they have to deal with these kind of transactions.
- See the Linux Kernel
- Scary, because you need to be aware of all the contexts that a lock is used in. For example, if a lock is used in an interrupt, then all usage of the lock outside interrupts need to happen with interrupts disabled.
- I fixed this kind of bug in a number of places in the Linux Kernel, and it’s a well-documented class of deadlocks, if you doubt that it actually happens.
Pingback: The Role of Distributed State – Desmond Cheong