Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: make AtomicWaker panic safe #3689

Merged
merged 15 commits into from
Nov 2, 2021

Conversation

udoprog
Copy link
Contributor

@udoprog udoprog commented Apr 10, 2021

Make AtomicWaker panic safe by making sure that unwinding panics caused in user code restores the state of the AtomicWaker.

Motivation

Sometimes people (like me) want to make use of type which under the hood uses AtomicWaker in combination with std::panic::catch_unwind. Since the type itself does not implement the UnwindSafe family of traits, someone (like me) might be tempted to simply hope for the best and wrap the type with the AssertUnwindSafe escape hatch.

This might work well for a while, until a user comes around using a runtime (or a sub-scheduler) which happens to provide a Waker that can panic. And when it inevitably does, the system ends up in an interesting deadlock where new wakers won't be registered. This is obviously the users (my) fault for incorrect use of AssertUnwindSafe. But we can make the situation marginally better by avoiding the more egregious instances of where panic safety might trip you up.

Solution

Trace all code paths which might execute out-of-tokio provided code that might panic (e.g. into_waker or wake) and make sure that if a panic occurs the AtomicWaker it's restored to a well-defined, functional state. This also means it can implement RefUnwindSafe and UnwindSafe which is nice.

A viable alternative would be to poison the AtomicWaker if any panics are raised which might actually be "more correct", This would cause any future calls by any thread using the AtomicWaker to error or panic.

@udoprog
Copy link
Contributor Author

udoprog commented Apr 10, 2021

Ugh. The build fails because Wake trait is not available in testing. I'll fix the new test if we do actually end up wanting this. Or maybe look into if we can't run test code on a newer Rust as discussed elsewhere.

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-sync Module: tokio/sync labels Apr 11, 2021
@Darksonn
Copy link
Contributor

You would need to upgrade clippy, miri and san. Feel free to upgrade them if you want, although do it in a separate PR.

@carllerche
Copy link
Member

This is correct as far as I can tell. I would like to see a loom test ideally to cover the misc concurrency cases.

@udoprog
Copy link
Contributor Author

udoprog commented Apr 23, 2021

So there's two new things to note that has change in the process of writing the loom test.

PanicGuard now simply unsets the REGISTERING bit, so that it can preserve any prior wake's. This means that any wakes called prior to it panicking will be propagated in future calls when registering new wakers. Instead of being clobbered by forcing the state back to WAITING. I realized this was possible when considering what combination of effects you could observe out of an AtomicWaker. We could revert it to simply force it back to WAITING. But this change feels more correct to me.

If the loom test ever fails in the future it will spam a million lines caused by the intentional panic - one for each iteration. And I do not know how to avoid this since the only way to prevent it is to install a panic hook. See https://github.com/tokio-rs/tokio/pull/3689/files#diff-7a5023d8b8cdb6f299f962201022b3934e6445806adb30b3f12c810cae4297a6R63

Comment on lines 261 to 268
impl Drop for PanicGuard<'_> {
fn drop(&mut self) {
// On panics, we want to unset the REGISTERING state which will
// preserve any potential WAKING state for future calls to
// register.
let _ = self.0.fetch_and(!REGISTERING, AcqRel);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. If the state becomes WAKING due to this, then who sets it back to WAITING? It seems like the wake lock would just be held indefinitely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, so there's a chance it would get stuck in a WAKING state instead due to this.

This is being discussed on Discord, but either revert to old behavior or cope with it somehow. But in the current state at least, no merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that correct behavior would be to:

  1. Replace the current waker with None. Store the old waker for now.
  2. Call waker.into_waker().
  3. Set the current waker to Some(new_waker).
  4. Do all of the other stuff in the function.
  5. Just before returning from do_register (even if by panic), drop the old waker, if any. Do nothing special if it panics.

In principle this can double-panic in step five, but I do not think that this is a problem since it is normal that things get their destructor called during panics.

Comment on lines 182 to 194
self.waker.with_mut(|t| *t = Some(waker.into_waker()));
self.waker.with_mut(|t| *t = Some(new_waker));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this can panic too, since the old waker may have a destructor.

Copy link
Contributor Author

@udoprog udoprog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit and questions. Using catch_unwind feels better to me since there's less complexity to deal with unwind behavior. Looks good to me (note that I can't approve since I'm author of PR).

Ok(_) => {}
Ok(_) => {
// We don't want to give the caller the panic if it
// was someone else who put in that waker.
Copy link
Contributor Author

@udoprog udoprog Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this end in a situation where a panic in the Drop impl of a Waker is ignored, and if so is that OK?

My general stance towards panics is that they should be propagated somewhere - otherwise errors which cause them might end up going unnoticed and contribute to other unpredictable side effects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's the question really. There are certainly some places we don't want panics propagating, e.g. inside the runtime. The panic is printed to the console even if we drop it.

}

// We don't want to give the caller the panic if it
// was someone else who put in that waker.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Q here.

@@ -171,15 +175,35 @@ impl AtomicWaker {
where
W: WakerRef,
{
fn catch_unwind<F: FnOnce() -> R, R>(f: F) -> std::thread::Result<R> {
std::panic::catch_unwind(AssertUnwindSafe(f))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Waker's are UnwindSafe and the only things captured in closures below, so it might be possible to add that as a bound to F: UnwindSafe + FnOnce() -> R instead of AssertUnwindSafe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it might have been a previous version which was unhappy. I can try changing it to use catch_unwind directly.

@Darksonn Darksonn merged commit 09b770c into tokio-rs:master Nov 2, 2021
@udoprog udoprog deleted the panic-safe-atomic-waker branch November 2, 2021 12:44
hawkw added a commit that referenced this pull request Nov 15, 2021
# 1.14.0 (November 15, 2021)

### Fixed

- macros: fix compiler errors when using `mut` patterns in `select!`
  ([#4211])
- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])
- sync: make `AtomicWaker` panic safe ([#3689])
- runtime: fix basic scheduler dropping tasks outside a runtime context
  ([#4213])

### Added

- stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223])

### Changed

- io: updated `copy` buffer size to match `std::io::copy` ([#4209])

### Documented

- io: rename buffer to file in doc-test ([#4230])
- sync: fix Notify example ([#4212])

[#4211]: #4211
[#4226]: #4226
[#3689]: #3689
[#4213]: #4213
[#4179]: #4179
[#4223]: #4223
[#4209]: #4209
[#4230]: #4230
[#4212]: #4212
hawkw added a commit that referenced this pull request Nov 16, 2021
# 1.14.0 (November 15, 2021)

### Fixed

- macros: fix compiler errors when using `mut` patterns in `select!`
  ([#4211])
- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])
- sync: make `AtomicWaker` panic safe ([#3689])
- runtime: fix basic scheduler dropping tasks outside a runtime context
  ([#4213])

### Added

- stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223])

### Changed

- io: updated `copy` buffer size to match `std::io::copy` ([#4209])

### Documented

- io: rename buffer to file in doc-test ([#4230])
- sync: fix Notify example ([#4212])

[#4211]: #4211
[#4226]: #4226
[#3689]: #3689
[#4213]: #4213
[#4179]: #4179
[#4223]: #4223
[#4209]: #4209
[#4230]: #4230
[#4212]: #4212
hawkw added a commit that referenced this pull request Nov 16, 2021
# 1.14.0 (November 15, 2021)

### Fixed

- macros: fix compiler errors when using `mut` patterns in `select!`
  ([#4211])
- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])
- sync: make `AtomicWaker` panic safe ([#3689])
- runtime: fix basic scheduler dropping tasks outside a runtime context
  ([#4213])

### Added

- stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223])

### Changed

- io: updated `copy` buffer size to match `std::io::copy` ([#4209])

### Documented

- io: rename buffer to file in doc-test ([#4230])
- sync: fix Notify example ([#4212])

[#4211]: #4211
[#4226]: #4226
[#3689]: #3689
[#4213]: #4213
[#4179]: #4179
[#4223]: #4223
[#4209]: #4209
[#4230]: #4230
[#4212]: #4212
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-sync Module: tokio/sync
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants