Parallel queries #292

GrantMoyer · 2020-08-22T21:49:36Z

~~Implements IntoParallelIterator for QueryBorrow for rayon integration. I didn't end up using hecs built in BatchedIter, since it doesn't really mesh well with rayon.~~

Implements a ParallelIterator trait on top of bevy_task, which operates on batches. Then implements ParallelIterator for bevy_ecs::query's BatchedIterator.

~~Still needs to be tested.~~ Example included

GrantMoyer · 2020-08-22T23:12:12Z

I've added a small example for parallel iteration. Running the example revealed some bugs in my implementation, which I've fixed. Now the example produces the expected output.

Obviously, running correctly for one example doesn't prove the ParallelIterator implementation is correct, but I've had enough chance to think about it now that I'm pretty confident it is.

Note that I've pub used rayon in bevy_ecs, so that the ParallelIterator traits are available, but I'm not actually sure what the best practice is here.

GrantMoyer · 2020-08-31T02:53:29Z

I'm reworking this to emulate the std::iter::Iterator API using bevy_tasks.

GrantMoyer · 2020-09-02T00:18:31Z

I think this is ready to merge.

cart · 2020-09-02T00:20:50Z

looking now

cart · 2020-09-02T00:30:24Z

@aclysma @lachlansneff no pressure (i'm pretty comfortable reviewing this), but as the authors of bevy_tasks I would welcome your input here.

lachlansneff · 2020-09-02T00:42:19Z

This is a very interesting PR. I appreciate how much work you've put into emulating the ParallelIterator trait with bevy_tasks. My only worry is that it's not quite the same as something like rayon. .map won't run that closure on the task pool, so the performance characteristics are not quite the same.

GrantMoyer · 2020-09-02T00:44:51Z

I had that worry at first too, but actually, since it sends whole iterators to threads in the task pool, the mapped function is run on worker threads. You basically build up a computation description, then send it with batches of data to worker threads. That's also why the map function has to be Send, Sync and Clone.

lachlansneff · 2020-09-02T00:51:19Z

Ah, I see, you have to call iter_batched first which gives you an iterator over batches of a particular size. Have you done any benchmarking?

GrantMoyer · 2020-09-02T00:55:48Z

No benchmarking yet. I'd like to try it on the 50000 boids example @TheJasonLessard was working on though. I could run a benchmark with a more synthetic test.

lachlansneff · 2020-09-02T01:05:24Z

Keep me and @aclysma posted.

aclysma · 2020-09-02T04:19:26Z

This looks really neat. I'd love to see some practical uses of it in action. The thing is, I don't find myself using iterator-heavy functional style programming that much as a personal preference. So I'm probably not a great person to give an opinion on this! :)

GrantMoyer · 2020-09-03T00:58:12Z

Benchmarks

I've added three very basic benchmarks to the repo. They use the Criterion crate to run the benchmarks with cargo bench. The three benchmarks are detailed below. System specs for benchmark:

Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB

Overhead of ParallelIterator

Bencmark description: for each element of a 10000 element Vec, do nothing. ParallelIterators use a batch size of 100.

Test	Average Duration
Iterator	1.43 ns†
ParallelIterator 1 Threads	35,760. ns
ParallelIterator 2 Threads	45,060. ns
ParallelIterator 4 Threads	58,769. ns
ParallelIterator 8 Threads	63,104. ns
ParallelIterator 16 Threads	61,878. ns
ParallelIterator 32 Threads	62,084. ns

† I suspect rustc is optimizing this test out

`for_each()` of expensive operation over a long Vec

Benchmark description: for each element of a 10000 element Vec, run a 10000 iteration loop. ParallelIterators use a batch size of 100.

Test	Average Duration
Iterator	135.50 ms
ParallelIterator 1 Threads	136.07 ms
ParallelIterator 2 Threads	76.61 ms
ParallelIterator 4 Threads	48.38 ms
ParallelIterator 8 Threads	29.99 ms
ParallelIterator 16 Threads	30.14 ms
ParallelIterator 32 Threads	30.22 ms

10 `map()`s of an expensive operation over a long Vec

Benchmark description: For each element of a 10000 element Vec, map a function that doubles its input 1000 times over the Vec 10 times. ParallelIterators use a batch size of 100.

Test	Average Duration
Iterator	146.95 ms
ParallelIterator 1 Threads	146.85 ms
ParallelIterator 2 Threads	81.17 ms
ParallelIterator 4 Threads	51.20 ms
ParallelIterator 8 Threads	33.83 ms
ParallelIterator 16 Threads	33.80 ms
ParallelIterator 32 Threads	33.96 ms

GrantMoyer · 2020-09-03T22:56:06Z

I think Criterion may be useful for reproducibly benchmarking code throughout bevy, but I can revert adding the benchmark the repo if we don't want to use Criterion or store benchmarks in the repo in general.

cart · 2020-09-04T22:30:53Z

Yeah adding benchmarks is definitely something we've been wanting to do for awhile, and criterion has my preference here. Adding criterion benchmarks to this pr is definitely welcome.

I'll try to wrap up my review today.

cart

Fantastic work! I have no comments on the various iterator implementations, they all look correct and close to ideal.

Just a couple of superficial comments on the high level apis then I think this is ready to go.

Eventually I'd love to find a way to make this work (in a way that doesn't tank performance):

query.par_iter(10).for_each(&pool, |a| a.do_thing())

But thats a separate issue that non-parallel queries suffer from too.

cart · 2020-09-05T01:56:38Z

crates/bevy_ecs/src/system/query.rs

+    ///
+    /// Useful for distributing work over a threadpool using the
+    /// ParallelIterator interface.
+    pub fn iter_batched<'q>(&'q mut self, batch_size: u32) -> BatchedIter<'q, 'w, Q> {


I think it makes sense to rename this to par_iter() and rename BatchedIter to ParIter (or BatchedParIter). The returned iterator is only usable in a parallel context, it better captures the intent of callers, improves discoverability, and would help differentiate this from the internal hecs BatchedIter (which implements Iterator).

cart · 2020-09-05T02:06:33Z

examples/ecs/parallel_query.rs

+            "Processing entity {}",
+            i.fetch_add(1, atomic::Ordering::Relaxed)
+        );
+        thread::sleep(Duration::from_secs(1));


I like that this helps illustrate "expensive" tasks. This example helped prove to me that this works as expected, but I think I would prefer it if the example was a bit more practical. I don't want newbies thinking that they need to throw thread::sleep or AtomicUsize in their parallel iterators.

Can we make this more like a "minimal example showing how to set up parallel execution"?

GrantMoyer · 2020-09-05T16:17:24Z

I renamed BatchedIter to ParIter and I changed the example to be more practical. The new example is much more beautiful, and also demonstrates .filter() nicely.

lachlansneff · 2020-09-05T18:03:35Z

@GrantMoyer: Do you think it'd be possible to automatically figure out the optimum batch size based on the number of task pools, so users can just call .par_iter() instead of .par_iter(size_of_batches)?

aclysma · 2020-09-06T02:05:18Z

I would avoid automatic batch sizing. It would be bad to split 32 extremely quick operations across 32 cores, for example. The end-user should make an active decision based on what the workload is like.

aclysma

Took a deeper look, I think the high-level questions I have are:

Quantity:
- Do we want something like this at all? (Probably)
- How much of it should we have? (20% of work that covers 80% of uses, or exhaustive?). I generally favor starting simple/minimal and growing it based on need and practical use than trying to anticipate all possible needs. This is maybe a little more than I personally would start with, but it's a matter of taste and this certainly doesn't seem over the top. :)
Quality
- Is the high-level design appropriate as a first-pass? (I think so)
- Are the implementations correct for what they set out to do?
  - Some of them I'm not sure if they necessarily "work" for parallel iterator
  - A short one-sentence summary, link to in-depth stdlib documentation, and short remarks on implementation/caveats would be helpful without being onerous to write. I wouldn't object to this being done in a follow-on PR.

aclysma · 2020-09-06T02:14:42Z

crates/bevy_ecs/src/system/query.rs

+    ///
+    /// Useful for distributing work over a threadpool using the
+    /// ParallelIterator interface.
+    pub fn par_iter<'q>(&'q mut self, batch_size: u32) -> ParIter<'q, 'w, Q> {


#[inline] would make this more consistent with some of the other functions in this file

its worth pointing out that inline sometimes regresses performance. iterators are especially weird in that respect. its worth testing perf for every inline/non-inline decision made.

I haven't profiled inline vs not inline here, so I'd prefer to let the compiler make the decision for now.

aclysma · 2020-09-06T02:20:11Z

crates/bevy_ecs/src/system/query.rs

+}
+
+unsafe impl<'q, Q: HecsQuery> Send for Batch<'q, Q> {}
+unsafe impl<'q, Q: HecsQuery> Sync for Batch<'q, Q> {}


Intuitively I'm not sure why Batch needs to be Sync? When do two threads need to have access to the same batch?

Yeah lets remove these if we can.

Turns out only unsafe impl Send for Batch {} is needed.

aclysma · 2020-09-06T02:51:35Z

crates/bevy_tasks/src/iter/adapters.rs

+    pub(crate) iter: Option<P>,
+}
+
+impl<B, P> ParallelIterator<B> for Fuse<P>


I'm not sure about fuse. It seems like changing your batch size would change the output in surprising ways.

Fuse just makes sure that once next_batch() returns None, it always returns None.

For flatten, I think it's important to the iter-like API. Flatten necessarily changes the batch size, but note that the batch size isn't always fixed to begin with. In particular, query::ParIter doesn't have a fixed batch size.

aclysma · 2020-09-06T02:55:26Z

crates/bevy_tasks/src/iter/mod.rs

+
+    // TODO: Investigate optimizations for less copying
+    /// Note that find consumes the whole iterator, and return a Vec of all results
+    fn find<F>(mut self, pool: &TaskPool, f: F) -> Vec<Self::Item>


is this the same as filter and collect?

Oops, yeah. It should probably be removed or changed to only return the first.

aclysma · 2020-09-06T02:58:24Z

crates/bevy_tasks/src/iter/mod.rs

+                    let mut pos = None;
+                    for item in batch {
+                        if newf(item) {
+                            pos = pos.or(Some(len));


should this break to short circuit?

It doesn't really make sense to short circuit, since the whole batch has already been consumed.

I misread this, didn't catch the .or().

However, do we want to continue calling newf for every element if pos is not None?

Good point. I'll remove the unnecessary calls to the predicate.

aclysma · 2020-09-06T03:01:46Z

crates/bevy_ecs/src/system/query.rs

+}
+
+unsafe impl<'q, 'w, Q: HecsQuery> Send for ParIter<'q, 'w, Q> {}
+unsafe impl<'q, 'w, Q: HecsQuery> Sync for ParIter<'q, 'w, Q> {}


not sure about forcing sync for ParIter

ParIter is BatchedIter from hecs, but with ParallelIterator implemented instead of Iterator. Hecs implements Sync for BatchedIter, but I haven't thought much about if that's sound.

If we can remove the impls and things still work, lets remove them 😄

aclysma · 2020-09-06T03:05:10Z

crates/bevy_tasks/src/iter/mod.rs

+        (0, None)
+    }
+
+    fn count(mut self, pool: &TaskPool) -> usize {


I'm a little concerned people are going to end up putting extremely small units of work into some of these and blast it across threads and end up being a lot slower because of it. "Some people might misuse it" isn't a great reason to not do something but I feel kind of obligated to say it.

Maybe I should add advice to the docs saying not to use ParallelIterator without profiling your code, because it has significant overhead relative to some tasks.

Yeah a quick doc calling that out is a good idea. Its also worth adding a comment to the example, as many people learn using those.

Ideally, I'd like the example to use much more than 128 sprites, since it's probably faster to just do the parallel_query example serially. However, for now, the limiting factor is how long it takes to draw a large number of sprites.

I've added documentation in a few places calling out the overhead concerns with ParallelIterator.

aclysma · 2020-09-06T03:09:06Z

crates/bevy_tasks/src/iter/mod.rs

+    }
+
+    // TODO: Investigate optimizations for less copying
+    fn partition<C, F>(mut self, pool: &TaskPool, f: F) -> (C, C)


I definitely understand not wanting to duplicate the docs for the standard library, but a very short description for what it does and a link to the standard docs would be helpful. In fact, I think it's good to say a bit less so that the implementation-specific notes (for example all() not short-circuiting) are more likely to be seen

One potential improvement would be to partition by batch instead of item and return an iterator for each set of batches. (I feel like it's something that could be done in a follow-up change although it does change behavior in a non-trivial way.)

I've added short decriptions of each ParallelIterator method and linked each to the relevant Iterator docs.

For partition, I'd rather leave performance improvements for the future. I think small breaking changes are still acceptable.

cart · 2020-09-06T05:25:01Z

examples/ecs/parallel_query.rs

+    });
+}
+
+// Bounce sprties outside the window


nit: sprites

cart · 2020-09-06T05:27:28Z

examples/ecs/parallel_query.rs

+    for _ in 0..128 {
+        commands
+            .spawn(SpriteComponents {
+                material: materials.add(texture_handle.into()),


i want to encourage users to reuse materials when they can (as different materials require gpu state changes during rendering). can we hoist this above the for loop like this?

let material = materials.add(texture_handle.into());

GrantMoyer · 2020-09-07T15:45:56Z

I rebased on master to avoid a conflict with the recent ParallelExecutor improvement.

cart · 2020-09-07T22:07:02Z

CI failure was from a nightly clippy update that added new criteria. I've resolved the issues in master.

cart · 2020-09-08T19:17:48Z

Ok I think this is good to merge. I do share @aclysma's concern that we may have started with a "bigger" implementation than we should have, but I'm happy enough with each iterator impl that I'm comfortable merging this as-is.

Thanks for your hard work @GrantMoyer!

Add support for Parallel Queries

GrantMoyer force-pushed the parallel_queries branch from 808a242 to 52db1bd Compare August 23, 2020 03:56

karroffel added A-ECS Entities, components, systems, and events C-Feature A new feature, making something new possible labels Aug 23, 2020

GrantMoyer marked this pull request as ready for review August 25, 2020 22:44

GrantMoyer marked this pull request as draft August 31, 2020 02:52

GrantMoyer force-pushed the parallel_queries branch from 11d7d3f to 464acd9 Compare August 31, 2020 02:57

GrantMoyer marked this pull request as ready for review September 1, 2020 04:05

GrantMoyer force-pushed the parallel_queries branch from da74491 to c3057c2 Compare September 2, 2020 00:19

cart reviewed Sep 5, 2020

View reviewed changes

GrantMoyer force-pushed the parallel_queries branch from 95eb439 to e15af00 Compare September 5, 2020 16:15

aclysma reviewed Sep 6, 2020

View reviewed changes

cart reviewed Sep 6, 2020

View reviewed changes

GrantMoyer force-pushed the parallel_queries branch from e15af00 to 2580d3b Compare September 7, 2020 15:36

GrantMoyer added 22 commits September 7, 2020 18:13

Partial ParallelIterator interface

c04ada8

Refactor into modules

66bdc4a

Take task pool as an argument

15119a0

Cleanup code style and use statements

31c0c67

Remove zip method implementation for now

9ba35a1

Add Iterator method implementations

4251572

Fix doc typos

2525d00

Implement ParallelIterator for Query

e7c1d21

Unimplement Iterator for BatchedIter to prevent method name collision

089f59e

Export ParallelIterator in prelude

bccdb47

Add parallel query example

54b0907

Add basic ParallelIterator benchmarks

8e8fd3a

Rename BatchedIter to ParIter

25a50a5

Make more practical parallel query example

028f94d

Remove unnessesary impls for Send + Sync

751bd3a

Remove find method for now

709728f

Remove unessesary calls to predicate from position()

632604c

Add documentation about when to use or not use ParallelIterator

9958389

Take a &TaskPool for future parallelization

f795532

Document ParallelIterator methods

b134391

Typo

43ce660

Reuse material

cdfe1cd

GrantMoyer force-pushed the parallel_queries branch from 2580d3b to cdfe1cd Compare September 7, 2020 22:13

cart merged commit 586303f into bevyengine:master Sep 8, 2020

mrk-its pushed a commit to mrk-its/bevy that referenced this pull request Oct 6, 2020

Parallel queries (bevyengine#292)

b5318d0

Add support for Parallel Queries

Parallel queries #292

Parallel queries #292

Conversation

GrantMoyer commented Aug 22, 2020 • edited Loading

GrantMoyer commented Aug 22, 2020

GrantMoyer commented Aug 31, 2020

GrantMoyer commented Sep 2, 2020

cart commented Sep 2, 2020

cart commented Sep 2, 2020

lachlansneff commented Sep 2, 2020

GrantMoyer commented Sep 2, 2020 • edited Loading

lachlansneff commented Sep 2, 2020

GrantMoyer commented Sep 2, 2020

lachlansneff commented Sep 2, 2020

aclysma commented Sep 2, 2020

GrantMoyer commented Sep 3, 2020 • edited Loading

Benchmarks

Overhead of ParallelIterator

for_each() of expensive operation over a long Vec

10 map()s of an expensive operation over a long Vec

GrantMoyer commented Sep 3, 2020

cart commented Sep 4, 2020

cart left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GrantMoyer commented Sep 5, 2020

lachlansneff commented Sep 5, 2020

aclysma commented Sep 6, 2020

aclysma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GrantMoyer Sep 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GrantMoyer Sep 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GrantMoyer Sep 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GrantMoyer commented Sep 7, 2020

cart commented Sep 7, 2020

cart commented Sep 8, 2020

GrantMoyer commented Aug 22, 2020 •

edited

Loading

GrantMoyer commented Sep 2, 2020 •

edited

Loading

GrantMoyer commented Sep 3, 2020 •

edited

Loading

`for_each()` of expensive operation over a long Vec

10 `map()`s of an expensive operation over a long Vec

cart left a comment •

edited

Loading

GrantMoyer Sep 6, 2020 •

edited

Loading

GrantMoyer Sep 6, 2020 •

edited

Loading

GrantMoyer Sep 6, 2020 •

edited

Loading