-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel queries #292
Parallel queries #292
Conversation
I've added a small example for parallel iteration. Running the example revealed some bugs in my implementation, which I've fixed. Now the example produces the expected output. Obviously, running correctly for one example doesn't prove the ParallelIterator implementation is correct, but I've had enough chance to think about it now that I'm pretty confident it is. Note that I've pub used rayon in bevy_ecs, so that the ParallelIterator traits are available, but I'm not actually sure what the best practice is here. |
808a242
to
52db1bd
Compare
I'm reworking this to emulate the std::iter::Iterator API using bevy_tasks. |
11d7d3f
to
464acd9
Compare
I think this is ready to merge. |
da74491
to
c3057c2
Compare
looking now |
@aclysma @lachlansneff no pressure (i'm pretty comfortable reviewing this), but as the authors of bevy_tasks I would welcome your input here. |
This is a very interesting PR. I appreciate how much work you've put into emulating the |
I had that worry at first too, but actually, since it sends whole iterators to threads in the task pool, the mapped function is run on worker threads. You basically build up a computation description, then send it with batches of data to worker threads. That's also why the map function has to be Send, Sync and Clone. |
Ah, I see, you have to call |
No benchmarking yet. I'd like to try it on the 50000 boids example @TheJasonLessard was working on though. I could run a benchmark with a more synthetic test. |
Keep me and @aclysma posted. |
This looks really neat. I'd love to see some practical uses of it in action. The thing is, I don't find myself using iterator-heavy functional style programming that much as a personal preference. So I'm probably not a great person to give an opinion on this! :) |
BenchmarksI've added three very basic benchmarks to the repo. They use the Criterion crate to run the benchmarks with
Overhead of ParallelIteratorBencmark description: for each element of a 10000 element Vec, do nothing. ParallelIterators use a batch size of 100.
† I suspect rustc is optimizing this test out
|
Test | Average Duration |
---|---|
Iterator | 135.50 ms |
ParallelIterator 1 Threads | 136.07 ms |
ParallelIterator 2 Threads | 76.61 ms |
ParallelIterator 4 Threads | 48.38 ms |
ParallelIterator 8 Threads | 29.99 ms |
ParallelIterator 16 Threads | 30.14 ms |
ParallelIterator 32 Threads | 30.22 ms |
10 map()
s of an expensive operation over a long Vec
Benchmark description: For each element of a 10000 element Vec, map a function that doubles its input 1000 times over the Vec 10 times. ParallelIterators use a batch size of 100.
Test | Average Duration |
---|---|
Iterator | 146.95 ms |
ParallelIterator 1 Threads | 146.85 ms |
ParallelIterator 2 Threads | 81.17 ms |
ParallelIterator 4 Threads | 51.20 ms |
ParallelIterator 8 Threads | 33.83 ms |
ParallelIterator 16 Threads | 33.80 ms |
ParallelIterator 32 Threads | 33.96 ms |
I think Criterion may be useful for reproducibly benchmarking code throughout bevy, but I can revert adding the benchmark the repo if we don't want to use Criterion or store benchmarks in the repo in general. |
Yeah adding benchmarks is definitely something we've been wanting to do for awhile, and criterion has my preference here. Adding criterion benchmarks to this pr is definitely welcome. I'll try to wrap up my review today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work! I have no comments on the various iterator implementations, they all look correct and close to ideal.
Just a couple of superficial comments on the high level apis then I think this is ready to go.
Eventually I'd love to find a way to make this work (in a way that doesn't tank performance):
query.par_iter(10).for_each(&pool, |a| a.do_thing())
But thats a separate issue that non-parallel queries suffer from too.
crates/bevy_ecs/src/system/query.rs
Outdated
/// | ||
/// Useful for distributing work over a threadpool using the | ||
/// ParallelIterator interface. | ||
pub fn iter_batched<'q>(&'q mut self, batch_size: u32) -> BatchedIter<'q, 'w, Q> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to rename this to par_iter()
and rename BatchedIter
to ParIter
(or BatchedParIter
). The returned iterator is only usable in a parallel context, it better captures the intent of callers, improves discoverability, and would help differentiate this from the internal hecs BatchedIter (which implements Iterator).
examples/ecs/parallel_query.rs
Outdated
"Processing entity {}", | ||
i.fetch_add(1, atomic::Ordering::Relaxed) | ||
); | ||
thread::sleep(Duration::from_secs(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this helps illustrate "expensive" tasks. This example helped prove to me that this works as expected, but I think I would prefer it if the example was a bit more practical. I don't want newbies thinking that they need to throw thread::sleep or AtomicUsize in their parallel iterators.
Can we make this more like a "minimal example showing how to set up parallel execution"?
95eb439
to
e15af00
Compare
I renamed |
@GrantMoyer: Do you think it'd be possible to automatically figure out the optimum batch size based on the number of task pools, so users can just call |
I would avoid automatic batch sizing. It would be bad to split 32 extremely quick operations across 32 cores, for example. The end-user should make an active decision based on what the workload is like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a deeper look, I think the high-level questions I have are:
- Quantity:
- Do we want something like this at all? (Probably)
- How much of it should we have? (20% of work that covers 80% of uses, or exhaustive?). I generally favor starting simple/minimal and growing it based on need and practical use than trying to anticipate all possible needs. This is maybe a little more than I personally would start with, but it's a matter of taste and this certainly doesn't seem over the top. :)
- Quality
- Is the high-level design appropriate as a first-pass? (I think so)
- Are the implementations correct for what they set out to do?
- Some of them I'm not sure if they necessarily "work" for parallel iterator
- A short one-sentence summary, link to in-depth stdlib documentation, and short remarks on implementation/caveats would be helpful without being onerous to write. I wouldn't object to this being done in a follow-on PR.
/// | ||
/// Useful for distributing work over a threadpool using the | ||
/// ParallelIterator interface. | ||
pub fn par_iter<'q>(&'q mut self, batch_size: u32) -> ParIter<'q, 'w, Q> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[inline] would make this more consistent with some of the other functions in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its worth pointing out that inline sometimes regresses performance. iterators are especially weird in that respect. its worth testing perf for every inline/non-inline decision made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't profiled inline vs not inline here, so I'd prefer to let the compiler make the decision for now.
crates/bevy_ecs/src/system/query.rs
Outdated
} | ||
|
||
unsafe impl<'q, Q: HecsQuery> Send for Batch<'q, Q> {} | ||
unsafe impl<'q, Q: HecsQuery> Sync for Batch<'q, Q> {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively I'm not sure why Batch needs to be Sync? When do two threads need to have access to the same batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah lets remove these if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out only unsafe impl Send for Batch {}
is needed.
pub(crate) iter: Option<P>, | ||
} | ||
|
||
impl<B, P> ParallelIterator<B> for Fuse<P> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about fuse. It seems like changing your batch size would change the output in surprising ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fuse just makes sure that once next_batch() returns None, it always returns None.
For flatten, I think it's important to the iter-like API. Flatten necessarily changes the batch size, but note that the batch size isn't always fixed to begin with. In particular, query::ParIter doesn't have a fixed batch size.
crates/bevy_tasks/src/iter/mod.rs
Outdated
|
||
// TODO: Investigate optimizations for less copying | ||
/// Note that find consumes the whole iterator, and return a Vec of all results | ||
fn find<F>(mut self, pool: &TaskPool, f: F) -> Vec<Self::Item> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the same as filter and collect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yeah. It should probably be removed or changed to only return the first.
crates/bevy_tasks/src/iter/mod.rs
Outdated
let mut pos = None; | ||
for item in batch { | ||
if newf(item) { | ||
pos = pos.or(Some(len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this break to short circuit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really make sense to short circuit, since the whole batch has already been consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread this, didn't catch the .or().
However, do we want to continue calling newf for every element if pos is not None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll remove the unnecessary calls to the predicate.
crates/bevy_ecs/src/system/query.rs
Outdated
} | ||
|
||
unsafe impl<'q, 'w, Q: HecsQuery> Send for ParIter<'q, 'w, Q> {} | ||
unsafe impl<'q, 'w, Q: HecsQuery> Sync for ParIter<'q, 'w, Q> {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about forcing sync for ParIter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParIter is BatchedIter from hecs, but with ParallelIterator implemented instead of Iterator. Hecs implements Sync for BatchedIter, but I haven't thought much about if that's sound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can remove the impls and things still work, lets remove them 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
(0, None) | ||
} | ||
|
||
fn count(mut self, pool: &TaskPool) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned people are going to end up putting extremely small units of work into some of these and blast it across threads and end up being a lot slower because of it. "Some people might misuse it" isn't a great reason to not do something but I feel kind of obligated to say it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should add advice to the docs saying not to use ParallelIterator without profiling your code, because it has significant overhead relative to some tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah a quick doc calling that out is a good idea. Its also worth adding a comment to the example, as many people learn using those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I'd like the example to use much more than 128 sprites, since it's probably faster to just do the parallel_query example serially. However, for now, the limiting factor is how long it takes to draw a large number of sprites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added documentation in a few places calling out the overhead concerns with ParallelIterator.
} | ||
|
||
// TODO: Investigate optimizations for less copying | ||
fn partition<C, F>(mut self, pool: &TaskPool, f: F) -> (C, C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely understand not wanting to duplicate the docs for the standard library, but a very short description for what it does and a link to the standard docs would be helpful. In fact, I think it's good to say a bit less so that the implementation-specific notes (for example all() not short-circuiting) are more likely to be seen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential improvement would be to partition by batch instead of item and return an iterator for each set of batches. (I feel like it's something that could be done in a follow-up change although it does change behavior in a non-trivial way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added short decriptions of each ParallelIterator method and linked each to the relevant Iterator docs.
For partition, I'd rather leave performance improvements for the future. I think small breaking changes are still acceptable.
}); | ||
} | ||
|
||
// Bounce sprties outside the window |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: sprites
examples/ecs/parallel_query.rs
Outdated
for _ in 0..128 { | ||
commands | ||
.spawn(SpriteComponents { | ||
material: materials.add(texture_handle.into()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i want to encourage users to reuse materials when they can (as different materials require gpu state changes during rendering). can we hoist this above the for loop like this?
let material = materials.add(texture_handle.into());
e15af00
to
2580d3b
Compare
I rebased on master to avoid a conflict with the recent ParallelExecutor improvement. |
CI failure was from a nightly clippy update that added new criteria. I've resolved the issues in master. |
2580d3b
to
cdfe1cd
Compare
Ok I think this is good to merge. I do share @aclysma's concern that we may have started with a "bigger" implementation than we should have, but I'm happy enough with each iterator impl that I'm comfortable merging this as-is. Thanks for your hard work @GrantMoyer! |
Add support for Parallel Queries
Implements IntoParallelIterator for QueryBorrow for rayon integration. I didn't end up using hecs built in BatchedIter, since it doesn't really mesh well with rayon.Implements a ParallelIterator trait on top of bevy_task, which operates on batches. Then implements ParallelIterator for bevy_ecs::query's BatchedIterator.
Still needs to be tested.Example included