How MutexGuard was Sync When It Should Not Have Been
A couple of weeks ago, our ongoing effort to formalize Rust’s type system lead to us actually discovering a bug in the Rust standard library:
MutexGuard
implemented Sync
in cases where it should not. This could lead to data races in safe programs. Ouch.
I have to admit that I was somewhat hoping that this would be the case, after all, our paper will “sell” much better if we can point to an actual bug that was found through our effort ;) . I was, however, not giving this a high chance; after all, many eyes have looked at the standard library. Well, it turned out the trouble was not some code that someone had written and others had overlooked, the trouble was some code that had not been written… but let me start in the beginning.
The Bug
Here’s a piece of code demonstrating the problem:
Let us carefully step through this code to see what is going on.
First, we create a Mutex<Cell<i32>>
.
Then we acquire the lock on this mutex, and store the MutexGuard<Cell<i32>>
.
This is guaranteed to succeed immediately, after all, there is no other thread that could even try to acquire this lock.
Next, we run two closures in parallel (that’s what rayon::join
does), both accessing the same MutexGuard
g
.
This is somewhat unusual: Typically, each thread acquires the lock for itself, performs some operation, and then unlocks the mutex.
Here, the main thread acquires the lock, and then it hands a shared reference to the guard to two other threads.
This may sound odd, but it can in fact be a perfectly safe and pretty powerful pattern if both threads only perform read operations.
The trouble is, in the above example, both threads use Cell::set
to change the content of the cell.
Cell
uses unsychronized accesses, so this is a classical example of a data race: Two threads accessing the same location, and at least one of them writing to it.
This is exactly the kind of concurrency bug that Rust set out to prevent.
How does it come that the compiler accepted this code?
As already mentioned, the two closures both capture a shared reference to g
.
In other words, they both capture something of type &MutexGuard<Cell<i32>>
.
Following the type of rayon::join
, the compiler checks whether that type is Send
.
This is exactly the check that is supposed to prevent data races.
Because g
is a shared reference, it proceeds by checking whether MutexGuard<Cell<i32>>
is Sync
, and this is where things go wrong.
If we look at the source code of Mutex
before the fix, we can see that there is no implementation of Sync
for MutexGuard<T>
.
Now, Sync
is an “auto trait” (also often called OIBIT, but that’s really not a great name).
This means that the compiler considers a type like MutexGuard<T>
to be Sync
if all its fields are Sync
.
It turns out that if you follow this through, the compiler considers MutexGuard<T>
to be Sync
if T
is Send
.
It then checks whether Cell<i32>
is Send
– which it is – and accepts our program.
How the Bug Was Found
I found this bug in the process of proving safety of Mutex
.
In particular, this involves proving that the unsafe impl {Send, Sync}
are actually justified.
Now, MutexGuard<T>
did not have an unsafe impl Sync
– but that does not mean that it doesn’t implement Sync
!
Because of the way auto traits work, it could still implement Sync
automatically, based on the types of its fields.
The trouble is, just because the compiler implements Sync
automatically for us (so we don’t have to write an unsafe impl
) doesn’t mean that this automatic Sync
is correct!
The compiler does not understand what MutexGuad<T>
actually means, which invariants it maintains, nor does it understand the precise guarantees provided by Send
and Sync
.
It cannot possibly check how these two interact.
So what I did was write some little test programs to check when the compiler considers MutexGuard<T>
to be Sync
, and determined that it is sufficient for T
to be Send
.
(It turns out that it is much easier to figure this out by experiments than by staring at the source, and it is absolutely impossible to determine this based on the documentation. I would argue that is a problem on its own.)
Then I looked at what I had to prove, and quickly determined that this can’t possibly work – I needed T
to be Sync
, not Send
!
Given this observation, it wasn’t hard to “weaponize” this bug and write the racy program that you can see above.
The Fix
Clearly, to fix this problem, we have to make it so that MutexGuard<Cell<i32>>
is not Sync
.
The question is: When, in general, can we make MutexGuard<T>
Sync
?
To answer this question, we have to consider which operations can be performed on &MutexGuard<T>
.
We can consider the type Sync
if all these operations are thread-safe.
In this case, the only interesting operation is that we can use Deref::deref
to obtain an &T
.
In other words, for &MutexGuard<T>
to be thread-safe, we need &T
to be thread-safe.
This is exactly what it means for T
to be Sync
!
To sum this all up, we want the implementation of Sync
for MutexGuard<T>
to look something like this:
Notice the unsafe
here: The compiler cannot actually check whether we made any mistake in the analysis that we performed above, so we have to tell it to trust us in our assessment.
The actual fix is somewhat more verbose because MutexGuard
supports unsized types and because there is a lifetime involved, but that doesn’t really matter for our discussion here.
So What Now?
This fix landed in Rust master a while ago and will be part of Rust 1.19. If you read this post in the future and try the code above, you are likely using a sufficiently recent compiler for the code to be rejected – which means you are safe from this particular bug! So far, I’ve seen no reports of code that stopped compiling because of this fix, and the pattern involved in abusing the bug is sufficiently obscure that it seems unlikely for any code to actually be broken because of this.
So all’s well that ends well?
Not really.
Notice how the bug is not caused by an incorrect piece of code somewhere in the implementation of Mutex
.
One could go over that entire file, and prove every single line of it correct, and all proofs would go through.
The bug is in a line of code that was not written.
I don’t think it is very surprising that this was overlooked.
Ideally, types like MutexGuard<T>
would not get these automatic implementations of Send
and Sync
.
That would make sure that, if some type incorrectly implements these traits, at least it comes from some piece of incorrect unsafe
somewhere.
This is exactly the point of unsafe
: Marking clearly the points in the code that have to be audited extra-carefully.
The reality is somewhat more subtle than this, but nevertheless, unsafe
does its job pretty well – except for when it comes to auto traits.
To make things worse, imagine what happens when auto traits are stabilized, so that crates can introduce their own auto traits.
Some of them may want to express additional guarantees that the compiler cannot check, so some of these auto traits may be unsafe.
It is impossible to predict how the guarantees of these unsafe auto traits interact with the invariants of types like MutexGuard<T>
– but still, the compiler will consider the trait to be implemented for MutexGuard<T>
if it is implemented for all fields.
The types of these fields are an implementation detail and not even visible to other crates.
Nothing outside of the standard library should care about such details.
However, as we have seen, with user-defined auto traits, we actually care a lot!
Again, I think the solution here is that types like MutexGuard<T>
should not automatically implement unsafe auto traits.
What do I mean by “types like MutexGuard<T>
”?
The point that makes MutexGuard<T>
special here is that there is an invariant attached with this type that all instances of MutexGuard<T>
satisfy.
(I describe this concept of types with additional invariants in more detail in one of my previous blog posts).
My proposal is to tell the compiler that this is the case.
If the compiler understands that a type is “more than what it seems to be”, we could know during compilation that we have to be more cautious when handling this type – and in particular, we could have such types not automatically implement unsafe auto traits.
In fact, I proposed such an annotation previously for different reasons, and so did others.1
This of course significantly weakens the power of user-defined auto traits. However, I hope I convinced you that if we don’t act, errors like the one described here are bound to happen. That said, such decisions are of course going to go through the usual RFC process. It’s certainly possible that someone comes up with a compromise that preserves some of the usefulness of auto traits, without putting safety at risk.
Footnotes
-
I think having a notion of unsafe types can also help solve some of the unsafe code guidelines questions. But that is content for another blog post. ↩
Posted on Ralf's Ramblings on Jun 9, 2017 and licensed under CC BY-SA 4.0.
Comments? Drop me a mail or leave a note on reddit!