Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve thread scheduler memory ordering #43418

Merged
merged 4 commits into from
Jan 27, 2022
Merged

Improve thread scheduler memory ordering #43418

merged 4 commits into from
Jan 27, 2022

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Dec 13, 2021

Some of these should probably land separately, but collecting these as I go along with trying to improve thread-safety, and similar support.

@vtjnash vtjnash requested review from tkf, JeffBezanson and Keno December 13, 2021 20:31
@simeonschaub simeonschaub marked this pull request as ready for review December 13, 2021 20:39
@simeonschaub simeonschaub marked this pull request as draft December 13, 2021 20:40
src/threading.c Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member

vchuravy commented Jan 6, 2022

bump! some of this looks very good to have.

@giordano
Copy link
Contributor

giordano commented Jan 20, 2022

Confirming also here that 6bba175 appears to fix #41820 for me.

vtjnash added a commit that referenced this pull request Jan 20, 2022
Pickup TSAN/ASAN and LLVM assertion build fixes from #43418
@vtjnash vtjnash marked this pull request as ready for review January 20, 2022 20:38
@vtjnash vtjnash changed the title Improve threads Improve thread scheduler memory ordering Jan 20, 2022
@vchuravy vchuravy added this to the 1.8 milestone Jan 20, 2022
@vchuravy vchuravy added io Involving the I/O subsystem: libuv, read, write, etc. multithreading Base.Threads and related functionality labels Jan 20, 2022
src/partr.c Outdated
if (!multiq_check_empty()) {
// acquire sleep-check lock
jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping);
jl_fence(); // add a sequenced-before edge between the sleep_check_state modify and get_next_task checks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

Suggested change
jl_fence(); // add a sequenced-before edge between the sleep_check_state modify and get_next_task checks
jl_fence(); // ensures sleep_check_state is sequentially-consistent with enqueuing the task

as you wrote in jl_wakeup_thread? Since sequenced-before is a single-thread property (and we get it "for free"), we don't need a fence and I don't think this is very informative here. Rather, we need this for the global acyclicity of SC. (I usually informally reason that this is for "avoiding store-load reordering" because it's a Dekker-like situation.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is annoying to describe properly. I probably want seq-cst happens-before here. But it is not with enqueuing of the task, since that might only be atomic release. And anyways, seq-cst stores only create a global ordering with other seq-cst operations and fences, but IIUC, does not necessarily enforce an ordering on other nearby operations.

Copy link
Member

@tkf tkf Jan 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IANAL but I think a somewhat concrete argument would be something like the following. Using the terminology by Lahav et al. (2017) (as standardese is hard...) we see that we can have two reads-before edges:

  • rb1: multiq_check_empty -> multiq_insert
  • rb2: jl_atomic_load_relaxed(&ptls->sleep_check_state) in jl_wakeup_thread -> jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)

(Since they are inter-thread edges that start from a read and end with a write, they do not contribute to the usual happens-before edges.)

(Since rb1 touches multiple memory locations, I guess it's more like a "bundle of edges".)

Let us also name the two fences sandwiching these edges:

  • F1: the first jl_fence in jl_wakeup_thread
  • F2: the jl_fence in jl_task_get_next that I quoted above

These fences do double duty of promoting the two reads-before edges to the partial SC relation edges:

  • sc1: F2 -> rb1 -> F1
  • sc2: F1 -> rb2 -> F2

Since both sc1 and sc2 are part of the partial SC relation that is acyclic under C+20 memory model, they can't form a cycle. We can use this to show that the following unwated execution is impossible:

  • Initial state:
    • queue is empty
    • sleep_check_state is not_sleeping
  • Thread 1 (dequeuer) does:
    • 1a: jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)
    • 1b: multiq_check_empty returns true
  • Thread 2 (enqueuer) does:
    • 2a: multiq_insert
    • 2b: jl_atomic_load_relaxed(&ptls->sleep_check_state) in jl_wakeup_thread returns not_sleeping

i.e., the dequeuer misses the enqueue and enqueuer misses the sleep state transition.

This is impossible because it forms a cycle

  • 2b -> 1a (due to the value read in 2b)
  • 1a -> 1b (sequenced-before)
  • 1b -> 2a (due to the value read in 1b)
  • 2a -> 2b (sequenced-before)

which is forbidden because both 1b -> 2a (rb1) and 2b -> 1a (rb2) are part of the partial SC relation, thanks to the SC fences.

But this argument is essentially a copy of the argument about the standard "store buffering litmus test" so I don't think it's helpful to expand the explanation this much.

Copy link
Member

@tkf tkf Jan 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think it's very helpful to document which fences "pair up." Maybe we can use markdown footnote-like comment like this

jl_fence(); // [^store_buffering_1]

in the first fence in jl_wakeup_thread and the fence of jl_task_get_next that I quoted above? We can then document which stores and loads are taken care of by this fence outside the function body (so that it's OK to write a somewhat lengthy comment).

...
}
// [^store_buffering_1]: These fences are used to avoid the cycle 2b -> 1a -> 1b -> 2a -> 2b where
// * Dequeuer:
//   * 1a: `jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)`
//   * 1b: `multiq_check_empty` returns true
// * Enqueuer:
//   * 2a: `multiq_insert`
//   * 2b: `jl_atomic_load_relaxed(&ptls->sleep_check_state)` in `jl_wakeup_thread` returns `not_sleeping`
// i.e., the dequeuer misses the enqueue and enqueuer misses the sleep state transition.

src/partr.c Show resolved Hide resolved
src/partr.c Outdated Show resolved Hide resolved
src/partr.c Show resolved Hide resolved
uv_mutex_unlock(&sleep_locks[tid]);

if (jl_atomic_load_relaxed(&other->sleep_check_state) == sleeping) {
if (jl_atomic_cmpswap_relaxed(&other->sleep_check_state, &state, not_sleeping)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you technically need acq_rel ordering or a manual fence for the success path for a happens-before edge to jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping) in jl_task_get_next. But it would be very counter-intuitive if you need one...

(I also wondered if it can disrupt the Dekker-like pattern in jl_task_get_next and jl_wakeup_thread. But IIUC this is fine since the RMW edge extends the release sequence edge.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not release this elsewhere, so I think that ordering would be invalid here. In fact, I think the only valid ordering for this operation is relaxed, because we use relaxed stores elsewhere.

The relaxed swap is not permitted to fail spuriously, so if we get here, we know at least one thread will complete the transition from sleeping->not_sleeping, and that thread will also ensure that to wake the condition signal. (we might get more than one thread completing this transition, if we managed to be very slow here, and the thread was already awake for other reasons, finished processing the work, and then went back to sleep already. But we are not worried about spurious wake-ups.)

Copy link
Member

@tkf tkf Jan 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right..., I just realized that fence for jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping) is on the opposite side so it can't start a synchronizes-with edge (unless we put a fence before jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping) as well or strengthen its ordering).

I think the only valid ordering for this operation is relaxed, because we use relaxed stores elsewhere.

I believe it's OK. We can get happens-before edges even with mixed memory orderings as long as they are at least relaxed (not non-atomic) and we put fences before the write and after the read. See fence-atomic and atomic-fence synchronizations in https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence

The relaxed swap is not permitted to fail spuriously,

I think the difference between strong and weak CASes is orthogonal to the memory ordering, at least for the memory model of the abstract machine. compare_exchange_weak and compare_exchange_strong both have memory ordering arguments. Also, in the formal declarative memory model (by Lahav et al. 2017, which is the basis of C++20 memory model), the distinction of strong and weak CAS do not exist; IIUC, failed weak CASes just appear in the execution graph as repeated read immediately followed by a write (i.e., failed writes are ignored).

But we are not worried about spurious wake-ups.

I was more worried about the case that somehow the wake-up is missed. Intuitively, I thought you need to send the signal "after" you know that jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping) has happened. But since relaxed memory read/write doesn't enforce any ordering, we technically can't assume what the other task (that is supposedly about to sleep) is actually doing. I cannot help wondering if I'm either taking C/C+ committee's FUD way too seriously and/or I'm still misunderstanding how C++20 memory model works. It's really bizarre that we can't rely on the "causality" through relaxed memory accesses alone. IIRC, Hans Boehm has a very intricate example involving GVN and speculative execution but I don't know if that applies here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That page on fences doesn't appear to mention seq_cst, and that is what we require here. Merely knowing acquire and release is insufficient to provide any interesting constraint here on the behavior of the program. What we need to prove is that the release on thread 1 (the store to sleep_check_state) has a global consistent ordering with the release on thread 2 (the store to the workqueue) relative to the subsequent checks on those threads for the contents of the workqueue or the sleep_check_state, respectively.

For the seq_cst properties, we have that the description of the fence on https://en.cppreference.com/w/cpp/atomic/memory_order:

2) the sequentially-consistent fences are only establishing total ordering for the fences themselves, not for the atomic operations in the general case (sequenced-before is not a cross-thread relationship, unlike happens-before)

I was more worried about the case that somehow the wake-up is missed

Once we reach this point, we are into the land of locks and condition objects, which are hopefully much simpler to analyze and declare safe in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.cppreference.com/w/cpp/atomic/memory_order:

2) the sequentially-consistent fences are only establishing total ordering for the fences themselves, not for the atomic operations in the general case (sequenced-before is not a cross-thread relationship, unlike happens-before)

I think that's a bit stale info on cppreference side. That's actually one of the major improvements in the C++20 memory model. P0668R4: Revising the C++ memory model includes a patch to remove similar sentences from the standard. In particular, the first paragraphs in the section Strengthen SC fences provides nice background information.

[Notes for people who want to dig into the original resource; e.g., future me] The relevant definition in Lahav et al. is:

image

The [E^sc] prefix and suffix in psc_base indicate that arbitrary SC evens, including reads and writes, can start/end the partial SC relation, provided that read/write starts or ends the SC-before relation scb. On the other hand, fences ([F^sc]) can work in more general setting where scb is prefixed/suffixed by the happens-before relation hb (i.e., which is indicated by [F^sc]; hb^? and hb^?; [F^sc]). For example, this indicates that reads-before rb relation starting with a SC read and ends with a SC fence ([E^sc]; rb; [F^sc]) is a part of psc_base and hence psc.

Once we reach this point, we are into the land of locks and condition objects, which are hopefully much simpler to analyze and declare safe in this case.

Yes, I agree. I was just curious to know how do I reason about this if I want to do it rigorously.

src/jl_uv.c Show resolved Hide resolved
The svec set method needs to do a little bit more work (for asserts and
write barriers), which is not necessary if we know it was just allocated.
These were previously implied (on TSO platforms, such as x86) by the
atomic_store to sleeping and the sleep_locks acquire before the
wake_check loop, but this makes it more explicit. We might want to
consider in the future if it would be better (faster) to acquire each
possible lock on the sleeping path instead, so that we do each operation
with seq_cst, instead of using a fence to only order the operations we
care about directly.
Ensures better that we support nesting of threaded regions and access to
this variable from any thread. Previously, it appears that we might
toggle this variable on another thread (via thread migration), but would
then not successfully wake up the thread 0 and cause it to resume
sleeping on the event loop.
@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Jan 26, 2022
@tkf tkf merged commit 05e0cb9 into master Jan 27, 2022
@tkf tkf deleted the improve-threads branch January 27, 2022 01:42
@tkf tkf removed the merge me PR is reviewed. Merge when all tests are passing label Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants