-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internalize task distinction into TaskPool #4740
Conversation
I had a slightly different idea on how to combine the task pools. I wanted to give tasks spawned on a scope the highest priority; give tasks spawned onto the task pool a lower priority; and then add a new spawn_blocking function that works like it does in tokio where these tasks are spawned onto a different set of threads. It reuses an idle thread if there are any or spawns a new thread when there aren't any. |
I think the first two parts are still reconcilable with the current proposed approach; however, the problem with |
isn't that a problem with this approach too. If a user starts too many blocking tasks the compute task pool won't have any threads to work with either. edit: thinking on this more I guess. The approach here prevents the user from running more than |
If and only if they start it as a normal compute task, something that would require |
bors try |
tryBuild failed: |
# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by #2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like #4998. Fixes #4996. Fixes #6081. Fixes #5285. Fixes #5054. Supersedes #2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la #4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
@cart @hymm I've mostly updated the PR and this is once again up for review. #4466 added an unusual interaction since we need a temporary Should we:
2 would move us closer to a
|
We don't necessarily need a temporary executor. We just need a way of running tasks that only run on the thread the scope is running on. This could potentially be a thread local storage queue that can't be stolen from. A consideration here is that in #6503, I removed the temporary executor for a global main thread executor that has run time checks to only allow it to be ticked on the main thread. But not sure if that is going to get merged or not. |
I'm leaning towards option 2. Seems like there isn't much need for users to construct their own taskpools now that we're using global task pools. The one issue might be if users will want to reconfigure the number of threads or thread distribution during execution. On another note, it might be a good idea to port the micro benchmarks over from async_executor to get a better idea of how the current executor compares to the original. |
Will this completely block us off from using the tokio ecosystem forever on our task pools? Or would we still be able to switch the tokio executor into our task pools somehow even if it doesn't have all the niceties (priorities)? |
My biggest problem with reviewing this PR is that I have no idea how to understand the changes that were made to the async executor. I can look at the diff https://github.com/james7132/bevy/compare/b5fc0d2..task-pool-internalization and figure out what it's doing, but I have no context for why some things were changed. If you could port over the tests and benches from async executor I would have a lot more confidence that things are working as they're supposed to. Bonus points if you can get loom working somehow. |
Just something to note: this sort of PR, where we have very tricky performance implications due to multithreading, different CPU cores being involved, etc, can be very sensitive to CPU hardware boost behaviors and such! I am worried that there is a high risk of benchmarks being wildly inaccurate (when it comes to judging how much more/less efficient bevy's scheduling has become), due to different thread utilization, causing different boost clocks of different CPUs, etc. The benchmark results may therefore be skewed because of the particular CPU model of the computer they are run on. I recommend, if possible, to benchmark this PR (and other multithreading/scheduling things) on a machine where all CPU cores can be forced to run at the same clock speeds (via BIOS settings, etc, to disable boost and clock frequency scaling). Otherwise we can't really know how much we actually win/lose due to better multithreading / thread utilization, and how much of it was due to the particular CPU hardware boosting differently due to its cores being loaded differently. This kind of thing is important in general, but for these sorts of PRs especially so. I would not trust benchmarks that were not done that way. |
# Objective Right now, the `TaskPool` implementation allows panics to permanently kill worker threads upon panicking. This is currently non-recoverable without using a `std::panic::catch_unwind` in every scheduled task. This is poor ergonomics and even poorer developer experience. This is exacerbated by bevyengine#2250 as these threads are global and cannot be replaced after initialization. Removes the need for temporary fixes like bevyengine#4998. Fixes bevyengine#4996. Fixes bevyengine#6081. Fixes bevyengine#5285. Fixes bevyengine#5054. Supersedes bevyengine#2307. ## Solution The current solution is to wrap `Executor::run` in `TaskPool` with a `catch_unwind`, and discarding the potential panic. This was taken straight from [smol](https://github.com/smol-rs/smol/blob/404c7bcc0aea59b82d7347058043b8de7133241c/src/spawn.rs#L44)'s current implementation. ~~However, this is not entirely ideal as:~~ - ~~the signaled to the awaiting task. We would need to change `Task<T>` to use `async_task::FallibleTask` internally, and even then it doesn't signal *why* it panicked, just that it did.~~ (See below). - ~~no error is logged of any kind~~ (See below) - ~~it's unclear if it drops other tasks in the executor~~ (it does not) - ~~This allows the ECS parallel executor to keep chugging even though a system's task has been dropped. This inevitably leads to deadlock in the executor.~~ Assuming we don't catch the unwind in ParallelExecutor, this will naturally kill the main thread. ### Alternatives A final solution likely will incorporate elements of any or all of the following. #### ~~Log and Ignore~~ ~~Log the panic, drop the task, keep chugging. This only addresses the discoverability of the panic. The process will continue to run, probably deadlocking the executor. tokio's detatched tasks operate in this fashion.~~ Panics already do this by default, even when caught by `catch_unwind`. #### ~~`catch_unwind` in `ParallelExecutor`~~ ~~Add another layer catching system-level panics into the `ParallelExecutor`. How the executor continues when a core dependency of many systems fails to run is up for debate.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### ~~Emulate/Copy `tokio::JoinHandle` with `Task<T>`~~ ~~`tokio::JoinHandle<T>` bubbles up the panic from the underlying task when awaited. This can be transitively applied across other APIs that also use `Task<T>` like `Query::par_for_each` and `TaskPool::scope`, bubbling up the panic until it's either caught or it reaches the main thread.~~ `async_task::Task` bubbles up panics already, this will transitively push panics all the way to the main thread. #### Abort on Panic The nuclear option. Log the error, abort the entire process on any thread in the task pool panicking. Definitely avoids any additional infrastructure for passing the panic around, and might actually lead to more efficient code as any unwinding is optimized out. However gives the developer zero options for dealing with the issue, a seemingly poor choice for debuggability, and prevents graceful shutdown of the process. Potentially an option for handling very low-level task management (a la bevyengine#4740). Roughly takes the shape of: ```rust struct AbortOnPanic; impl Drop for AbortOnPanic { fn drop(&mut self) { abort!(); } } let guard = AbortOnPanic; // Run task std::mem::forget(AbortOnPanic); ``` --- ## Changelog Changed: `bevy_tasks::TaskPool`'s threads will no longer terminate permanently when a task scheduled onto them panics. Changed: `bevy_tasks::Task` and`bevy_tasks::Scope` will propagate panics in the spawned tasks/scopes to the parent thread.
i have a machine with 272 threads, where the cores are all running the same speed, and as long as at least 3 cores are active then there is only 100mhz difference between base and boost clocks (1400MHz vs 1500MHz; 1600 is possible with 2 or fewer cores active), and i may be able to disable boost entirely. i also have a simulation workload for that machine, which is affected by #1907, with rather severe underutilization of threads compared to what i would hope for. i will test this PR with that machine/workload and share the results. |
Going to return to the drawing board on this one. Taking on an entire fork of async_executor might not be in our interests. In the meantime, I'm potentially looking to experiment with |
Objective
Fixes #1907. IO and Async Compute threads are sitting idle even when the compute threads are at capacity. This results in poor thread utilization.
Solution
Move the distinction between compute, async compute, and IO threads out of userspace, and internalize the distinction into
TaskPool
. This creates three tiers of threads and tasks: one for each group. Higher priority tasks can run on lower priority threads, if and only if they're currently idle, but the lower priority threads will prioritize their specific workload over higher priority ones. Priority goescompute > io > async compute
.For example, heavy per-frame compute workload, async compute and IO otherwise sitting idle: compute tasks will be scheduled onto async compute and IO threads whenever idle. Any IO task that is awoken will take precedence over the compute tasks.
In another case, a heavy IO workload will schedule onto idle async compute threads, but will never use a compute thread. This prevents lower priority tasks from starving out higher priority tasks.
The priority scheme chosen assumes well-behaved scheduling that adheres to the following principles:
TODO:
TaskPool
Changelog
Changed: Newtypes of
TaskPool
(ComputeTaskPool
,AsyncComputeTaskPool
, andIoTaskPool
) have been replaced withTaskPool::spawn_as
with the appropriateTaskGroup
. Thread utilization across these segments has been generally been increased via internal task prioritization. For convenience,TaskPool::spawn
will always spawn on the compute task group.Migration Guide
ComputeTaskPool
,AsyncComputeTaskPool
, andIoTaskPool
have been removed. These individual TaskPools have been merged together, and can be accessed byRes<TaskPool>
. To spawn a task on a specific segment of the threads, useTaskPool::spawn_as
with the appropriateTaskGroup
.Before:
After: