Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weak memory emulation using store buffers #1963

Merged
merged 46 commits into from
Jun 6, 2022
Merged

Conversation

cbeuw
Copy link
Contributor

@cbeuw cbeuw commented Jan 17, 2022

This implements the second half of the Lidbury & Donaldson paper: weak memory emulation using store buffers. A store buffer is created over a memory range on atomic access. Stores will push store elements into the buffer and loads will search through the buffer in reverse modification order, determine which store elements are valid for the current load, and pick one randomly.

This implementation will never generate weak memory behaviours forbidden by the C++11 model, but it is incapable of producing all possible weak behaviours allowed by the model. There are certain weak behaviours observable on real hardware but not while using this.

Note that this implementation does not take into account of C++20's memory model revision to SC accesses and fences introduced by P0668. This implementation is not fully correct under the revised C++20 model and may generate behaviours C++20 disallows.

Rust follows the C++20 memory model (except for the Consume ordering and some operations not performable through C++'s std::atomic API). It is therefore possible for this implementation to generate behaviours never observable when the same program is compiled and run natively. Unfortunately, no literature exists at the time of writing which proposes an implementable and C++20-compatible relaxed memory model that supports all atomic operation existing in Rust. The closest one is A Promising Semantics for Relaxed-Memory Concurrency by Jeehoon Kang et al. However, this model lacks SC accesses and is therefore unusable by Miri (SC accesses are everywhere in library code).

Safe/sound Rust allows for more operations on atomic locations than the C++20 atomic API was intended to allow, such as non-atomically accessing a previously atomically accessed location, or accessing previously atomically accessed locations with a differently sized operation (such as accessing the top 16 bits of an AtomicU32). These scenarios are generally left undefined in formalisations of C++ memory model, even though they became possible in C++20 with std::atomic_ref<T>. In Rust, these operations can only be done through a &mut AtomicFoo reference or one derived from it, therefore these operations can only happen after all previous accesses on the same locations. This implementation is adapted to accommodate these.


TODOs:

  • Add tests cases that actually demonstrate weak memory behaviour (even if they are scheduler dependent)
  • Change {mutex, rwlock, cond, srwlock}_get_or_create_id functions under src/shims to use atomic RMWs instead of separate read -> check if need to create a new one -> write steps
  • Make sure Crossbeam tests still pass (Add spin loop hints in tests for Miri crossbeam-rs/crossbeam#831)
  • Move as much weak-memory related code as possible into weak_memory.rs
  • Remove "weak memory effects are not emulated" warnings
  • Accommodate certain mixed size and mixed atomicity accesses Rust allows on top of the C++ model

@RalfJung
Copy link
Member

Oh wow, this is amazing! Thanks so much for working on this. :)

Unfortunately I will have to put off a detailed review until mid-February or maybe even March, since interview season is approaching rapidly for me. So for now I can just give some high-level comments:

  • Adding store buffers means that loads have a choice wrt which value they will return. How do you resolve that choice? It might even be reasonable to have flags that control this, I guess.
  • Unfortunately I don't know which effect the SCfix has on operational models, but the Promising Semantics v2 is newer than SCfix so maybe they incorporated that? We could also email the SCfix authors -- and at least Jeehoon has a GitHub account.

In any case it's impossible to explore all weak memory behaviours allowed by the standard for various reasons, such as Miri's interpreter not being preemptive.

  • I assume that would also require some form of load buffering / speculation / "promises" for relaxed behaviors?

@cbeuw
Copy link
Contributor Author

cbeuw commented Jan 18, 2022

Of course! My term has started now as well so I will have less time to work on this than I did over Christmas

Adding store buffers means that loads have a choice wrt which value they will return. How do you resolve that choice? It might even be reasonable to have flags that control this, I guess.

Right now it's a uniform random choice using the existing rng in MemoryExtra. The race detection paper's author Christopher implemented it as an exponential distribution with the earliest one in mo having the greatest probability.
https://github.com/ChrisLidbury/tsan11/blob/ecbd6b81e9b9454e01cba78eb9d88684168132c7/lib/tsan/rtl/tsan_relaxed.cc#L290-L297

I agree we could have a flag for this with possible strategies like last mo, uniform, earliest valid etc.

but the Promising Semantics v2 is newer than SCfix so maybe they incorporated that?

I saw Promising Semantics was used to handle SC fences over in loom tokio-rs/loom#220, I didn't realise there's a v2. Will have a look! Both the author (Christopher) and the supervisor (Alastair, who actually taught my concurrency course last term) of the race detector paper are on GitHub as well. If we could incorporate something that works with scfix (plus the already-implemented change on simplified release sequences) into the model I'm sure they'll be interested.

I assume that would also require some form of load buffering / speculation / "promises" for relaxed behaviors?

Yes. Although by non-preemptive here I meant the scheduler not context switching until an explicit yield (which rules out a lot of weak behaviours that requires certain interleavings), but yes, for speculative loads it will require speculative execution in the interpreter

Copy link

@PeterWrighten PeterWrighten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry because it seems that there are some mistakes occurred, which let me approve this request without review. I consider I am not capable now to review this snippet, so maybe it is a greater idea to seek others to review it. Thanks:)

@bors
Copy link
Contributor

bors commented Apr 1, 2022

☔ The latest upstream changes (presumably #2045) made this pull request unmergeable. Please resolve the merge conflicts.

Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again for the long delay! It is hard to find time to review big complicated changes like this. This is some impressive work. :-)

I went over the code and left some comments. However, I honestly don't understand at least half of it, and I won't have time any time soon to dig into the paper and convince myself that the algorithm makes sense, or to meticulously check that you implemented the algorithm faithfully. This concerns me, because it means in case of any problems I have no way of debugging this. In that sense I quite appreciate that you put most of this into a new file, because it means if it all goes horribly wrong one day I can at least rip it out fairly easily. In fact it might be a good idea to further reduce the data_race.rs changes and move basically everything that works on AllocExtra::weak_memory into the new file (including the let ... if that even test whether that data is present).

So, under these caveats, I think I am okay with landing this. Ideally, if/when a problem comes up, I'll just ping you and you can help. :)

What is the situation with that C++20 revised definition? Is it fair to say that this simply implements C++11 semantics instead? That seems fine for me, it just needs documenting (including in the README). It's probably better to stick to a published algorithm rather than trying to adjust it... I certainly couldn't review those adjustments.
However you say

(CppMem still says it's consistent, but that's because it uses C++11 model)

so does what you implemented disagree with CppMem, or does "The test case RWC+syncs given in Lahav et al.'s paper fails" actually mean "it says it is consistent"? I am confused.
Is there a testcase you can add that should pass/fail but does the opposite, due to this?

It also would be good to have some test case that shows that store buffering is actually happening. Obviously this might have to run like 10 times and assert that it happens at least once, but still, at least a smoke test should be there -- right now, the test case you added would have already passed before your PR, so that's kind of a shame.

And finally, can you make sure that the crossbeam tests still pass with your patch? They have Miri on CI, and it'd be a shame if we broke their CI by landing this.

src/data_race.rs Outdated Show resolved Hide resolved
src/data_race.rs Outdated Show resolved Hide resolved
src/data_race.rs Show resolved Hide resolved
src/data_race.rs Outdated Show resolved Hide resolved
src/machine.rs Outdated Show resolved Hide resolved
src/weak_memory.rs Outdated Show resolved Hide resolved
src/weak_memory.rs Outdated Show resolved Hide resolved
tests/run-pass/concurrency/weak_memory.rs Outdated Show resolved Hide resolved
src/weak_memory.rs Outdated Show resolved Hide resolved
tests/run-pass/concurrency/weak_memory.rs Outdated Show resolved Hide resolved
@cbeuw
Copy link
Contributor Author

cbeuw commented Apr 14, 2022

Thanks for coming back to this!

In fact it might be a good idea to further reduce the data_race.rs changes and move basically everything that works on AllocExtra::weak_memory into the new file (including the let ... if that even test whether that data is present).

I'll look into this, though I suspect lifting code that accesses InterpCx, GlobalState, AllocExtra etc. simultaneously into separate functions may enrage the borrow checker.

What is the situation with that C++20 revised definition? Is it fair to say that this simply implements C++11 semantics instead?

All possible behaviours under the current implementation are allowed under C++11 semantics, this is what I would call correct or sound (albeit incomplete).

Very briefly, C++20 weakened SC loads and stores (accesses), and strengthened SC fences. This gives rises to 3 scenarios

  1. Pure SC accesses: since it's weakened in C++20, the permitted behaviours under C++11 is a strict subset of that of C++20. So the current implementation is still correct
  2. Pure SC fences: this is what Promising Semantics addressed and implemented in Implement Release, AcqRel, and SeqCst fences tokio-rs/loom#220. It would be very easy to do here as well.
  3. Interaction between SC accesses and SC fences: this is the part no one has figured out yet.

Note that loom doesn't support SC accesses (they are weakened to Acquire/Release). I don't think it's acceptable for us to do the same here because Miri as a dynamic checker shouldn't produce false positives (behaviours not allowed under the memory model), whereas Loom is a static checker and can favour coverage at the expense of overly-weak behaviours.

As such, we have to stick to what's given in the Lidbury & Donaldson paper and therefore C++11 model. I'm glad you are on board with this.

so does what you implemented disagree with CppMem, or does "The test case RWC+syncs given in Lahav et al.'s paper fails" actually mean "it says it is consistent"?

My implementation agrees with CppMem. The trouble is Rust's documentation specifies that Rust follows C++20 model, not C++11. Ideally we want to do C++20 (i.e. RWC+syncs passes), but we can only do C++11 so this will have to be documented as a caveat of Miri.

Is there a testcase you can add that should pass/fail but does the opposite, due to this?

test_rwc_syncs from the SCFix paper should pass if compiled by rustc using C++20 model, but fails with our implementation. When I added this I was thinking that I can make the implementation consistent with C++20, but now I'll treat it as known issues.

It also would be good to have some test case that shows that store buffering is actually happening.

The issue is that, whether a buffered store gets read at all is dependent on Miri's scheduler and the RNG that picks one store from potentially many candidates. The randomness can be dealt with by running the test many times, but the scheduler is completely deterministic and multiple runs make no difference. Of course, I can find an assertion that succeeds under the current version, but it's not robust against potentially subtle perturbations to the scheduler behaviour. Nonetheless I'm happy to add such test with extensive comments on why it may fail.

can you make sure that the crossbeam tests still pass with your patch

Will do


Quite a lot of UI tests are broken after rebasing. Looks like they're due to the same issue. Will investigate later.

src/weak_memory.rs Outdated Show resolved Hide resolved
@RalfJung
Copy link
Member

I'll look into this, though I suspect lifting code that accesses InterpCx, GlobalState, AllocExtra etc. simultaneously into separate functions may enrage the borrow checker.

Thanks! I was mostly imagining changes like moving this entire block into the other file, which I don't think should cause borrow checker trouble?

All possible behaviours under the current implementation are allowed under C++11 semantics, this is what I would call correct or sound (albeit incomplete).

So in contrast, the implementation is not 'correct' under C++20, in the sense of generating behaviors that C++20 disallows?

What about 'completeness' (generating all possible behaviors, given enough different random seeds)?

My implementation agrees with CppMem. The trouble is Rust's documentation specifies that Rust follows C++20 model, not C++11. Ideally we want to do C++20 (i.e. RWC+syncs passes), but we can only do C++11 so this will have to be documented as a caveat of Miri.

Seems fair, given that that's what other tools also do and that the entire weak memory support can be disabled.

test_rwc_syncs from the SCFix paper should pass if compiled by rustc using C++20 model, but fails with our implementation. When I added this I was thinking that I can make the implementation consistent with C++20, but now I'll treat it as known issues.

Okay. Please add this as a compile-fail testcase with comments explaining the situation.

The issue is that, whether a buffered store gets read at all is dependent on Miri's scheduler and the RNG that picks one store from potentially many candidates. The randomness can be dealt with by running the test many times, but the scheduler is completely deterministic and multiple runs make no difference. Of course, I can find an assertion that succeeds under the current version, but it's not robust against potentially subtle perturbations to the scheduler behaviour. Nonetheless I'm happy to add such test with extensive comments on why it may fail.

The scheduler can be 'nudged' by calling thread::yield_now. Is that sufficient?
Anyway we already have some other scheduler-dependent tests, so with sufficient comments that is fine. Thanks!

@cbeuw
Copy link
Contributor Author

cbeuw commented Apr 17, 2022

So in contrast, the implementation is not 'correct' under C++20, in the sense of generating behaviors that C++20 disallows?

Indeed it will generate behaviors that C++20 disallows.

What about 'completeness' (generating all possible behaviors, given enough different random seeds)?

It's not complete. There are categories of allowed weak behaviour this implementation will never produce, such as load buffering.

@RalfJung
Copy link
Member

Okay. That sounds like a good thing to include into the summary at the top of weak_memory.rs. :)

@cbeuw cbeuw force-pushed the weak-memory branch 2 times, most recently from c0be33b to 20db949 Compare May 7, 2022 00:46
@bors
Copy link
Contributor

bors commented May 10, 2022

☔ The latest upstream changes (presumably #2084) made this pull request unmergeable. Please resolve the merge conflicts.

Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some more comments; my understanding is a bunch of my previous ones are also still open.

This PR is getting huge! Do you think there is any way to stage this over multiple PRs, for ease of reviewing?

src/allocation_map.rs Outdated Show resolved Hide resolved
src/allocation_map.rs Outdated Show resolved Hide resolved
src/allocation_map.rs Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
src/machine.rs Outdated Show resolved Hide resolved
src/shims/posix/sync.rs Outdated Show resolved Hide resolved
src/sync.rs Outdated Show resolved Hide resolved
src/weak_memory.rs Outdated Show resolved Hide resolved
src/weak_memory.rs Outdated Show resolved Hide resolved
@cbeuw
Copy link
Contributor Author

cbeuw commented May 11, 2022

my understanding is a bunch of my previous ones are also still open.

Sorry I was thinking of adding my would-be responses as comments, though I tend to procrastinate them until after code changes...

Do you think there is any way to stage this over multiple PRs

Everything in 0b9c8b0 is standalone. I'll cherry-pick it into another PR.

All the changes to scheduler-dependent race detector tests are simply adding -Zmiri-disable-weak-memory-emulation. I could leave them (along with amending "thread support is experimental and incomplete: weak memory effects are not emulated." everywhere) until all other reviews are done if you find them noisy

@RalfJung
Copy link
Member

All the changes to scheduler-dependent race detector tests are simply adding -Zmiri-disable-weak-memory-emulation. I could leave them (along with amending "thread support is experimental and incomplete: weak memory effects are not emulated." everywhere) until all other reviews are done if you find them noisy

I don't mind them. The hard to review part is in src, not tests. :) But I did wonder why that flag is required there?

README.md Outdated Show resolved Hide resolved
@cbeuw
Copy link
Contributor Author

cbeuw commented May 11, 2022

But I did wonder why that flag is required there?

Take one for example

pub fn test_simple_release() {
    let mut a = 0u32;
    let b = &mut a as *mut u32;
    let c = EvilSend(b);

    unsafe {
        let j1 = spawn(move || {
            *c.0 = 1;
            SYNC.store(1, Ordering::Release);
        });

        let j2 = spawn(move || {
            if SYNC.load(Ordering::Acquire) == 1 {
                *c.0
            } else {
                0
            }
        });

        j1.join().unwrap();
        assert_eq!(j2.join().unwrap(),1);
    }
}

This test won't work on real hardware: nothing is guaranteeing SYNC.load(Ordering::Acquire) reads 1 on first go in the second thread. It works in Miri because the scheduler makes sure SYNC.store(1, Ordering::Release) is ran before thread 2 even started, PLUS the fact that everything reads the latest value without weak memory.

Although my PR doesn't change the scheduler, meaning SYNC.store(1, Ordering::Release) is still executed before SYNC.load(Ordering::Acquire), due to weak memory effects SYNC.load(Ordering::Acquire) can read many 0s before it observes the 1. This is why in my tests a synchronising load(Ordering::Acquire) is always in a loop until it reads the desired value.

@RalfJung
Copy link
Member

I see, makes sense.

That makes me wonder if we should let users control the probability of reading an old value, similar to the knob we have to control the probability of cmpxchg_weak spuriously failing.

src/data_race.rs Outdated Show resolved Hide resolved
bors added a commit that referenced this pull request May 13, 2022
Use atomic RMW for `{mutex, rwlock, cond, srwlock}_get_or_create_id` functions

This is required for #1963

`{mutex, rwlock, cond, srwlock}_get_or_create_id()` currently checks whether an ID field is 0 using an atomic read, allocate one and get a new ID if it is, then write it in a separate atomic write. This is fine without weak memory. For instance, in `pthread_mutex_lock` which may be called by two threads concurrently, only one thread can read 0, create and then write a new ID, the later-run thread will always see the newly created ID and never 0.
```rust
    fn pthread_mutex_lock(&mut self, mutex_op: &OpTy<'tcx, Tag>) -> InterpResult<'tcx, i32> {
        let this = self.eval_context_mut();

        let kind = mutex_get_kind(this, mutex_op)?.check_init()?;
        let id = mutex_get_or_create_id(this, mutex_op)?;
        let active_thread = this.get_active_thread();
```

However, with weak memory behaviour, both threads may read 0: the first thread has to see 0 because nothing else was written to it, and the second thread is not guaranteed to observe the latest value, causing a duplicate mutex to be created and both threads "successfully" acquiring the lock at the same time.

This is a pretty typical pattern requiring the use of atomic RMWs. RMW *always* reads the latest value in a location, so only one thread can create the new mutex and ID, all others scheduled later will see the new ID.
@bors
Copy link
Contributor

bors commented May 13, 2022

☔ The latest upstream changes (presumably #2114) made this pull request unmergeable. Please resolve the merge conflicts.

@cbeuw cbeuw force-pushed the weak-memory branch 2 times, most recently from 997e5fb to 1b32d14 Compare June 6, 2022 18:27
@cbeuw
Copy link
Contributor Author

cbeuw commented Jun 6, 2022

Squashed a bit and rebased

@RalfJung
Copy link
Member

RalfJung commented Jun 6, 2022

Awesome! Thank you so much for working on this and sticking with me through the many rounds of review. I feel like Miri will be an excellent tool for finding bugs in concurrent software. :)

@bors r+

@bors
Copy link
Contributor

bors commented Jun 6, 2022

📌 Commit 1b32d14 has been approved by RalfJung

@bors
Copy link
Contributor

bors commented Jun 6, 2022

⌛ Testing commit 1b32d14 with merge e6d3d98...

@cbeuw
Copy link
Contributor Author

cbeuw commented Jun 6, 2022

Thank you for reviewing this massive PR! I learnt quite a lot implementing this, and I'm glad we got to scrutinise Rust's atomics specification as a bonus

@bors
Copy link
Contributor

bors commented Jun 6, 2022

☀️ Test successful - checks-actions
Approved by: RalfJung
Pushing e6d3d98 to master...

@bors bors merged commit e6d3d98 into rust-lang:master Jun 6, 2022
JohnTitor added a commit to JohnTitor/rust that referenced this pull request Jun 22, 2022
clarify how Rust atomics correspond to C++ atomics

`@cbeuw` noted in rust-lang/miri#1963 that the correspondence between C++ atomics and Rust atomics is not quite as obvious as one might think, since in Rust I can use `get_mut` to treat previously non-atomic data as atomic. However, I think using C++20 `atomic_ref`, we can establish a suitable relation between the two -- or do you see problems with that `@cbeuw?` (I recall you said there was some issue, but it was deep inside that PR and Github makes it impossible to find...)

Cc `@thomcc;` not sure whom else to ping for atomic memory model things.
JohnTitor added a commit to JohnTitor/rust that referenced this pull request Jun 22, 2022
clarify how Rust atomics correspond to C++ atomics

``@cbeuw`` noted in rust-lang/miri#1963 that the correspondence between C++ atomics and Rust atomics is not quite as obvious as one might think, since in Rust I can use `get_mut` to treat previously non-atomic data as atomic. However, I think using C++20 `atomic_ref`, we can establish a suitable relation between the two -- or do you see problems with that ``@cbeuw?`` (I recall you said there was some issue, but it was deep inside that PR and Github makes it impossible to find...)

Cc ``@thomcc;`` not sure whom else to ping for atomic memory model things.
workingjubilee pushed a commit to tcdi/postgrestd that referenced this pull request Sep 15, 2022
clarify how Rust atomics correspond to C++ atomics

``@cbeuw`` noted in rust-lang/miri#1963 that the correspondence between C++ atomics and Rust atomics is not quite as obvious as one might think, since in Rust I can use `get_mut` to treat previously non-atomic data as atomic. However, I think using C++20 `atomic_ref`, we can establish a suitable relation between the two -- or do you see problems with that ``@cbeuw?`` (I recall you said there was some issue, but it was deep inside that PR and Github makes it impossible to find...)

Cc ``@thomcc;`` not sure whom else to ping for atomic memory model things.
@cbeuw cbeuw deleted the weak-memory branch June 27, 2023 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants