-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Demonstrate new GroupHashAggregate
stream approach (runs more than 2x faster!)
#6800
Conversation
GroupHashAggregate
stream approach
|
||
/// The actual group by values, stored in arrow Row format | ||
/// the index of group_by_values is the index | ||
/// https://github.com/apache/arrow-rs/issues/4466 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The actual group by values, stored in arrow Row format | ||
/// the index of group_by_values is the index | ||
/// https://github.com/apache/arrow-rs/issues/4466 | ||
group_by_values: Vec<OwnedRow>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a buffer of some sort? OwnedRow
has a copy of the RowConfig
per value. If we want to keep using rows(?), something like the following would do:
pub struct AppendableRows {
/// Underlying row bytes
buffer: Vec<u8>,
/// Row `i` has data `&buffer[offsets[i]..offsets[i+1]]`
offsets: Vec<usize>,
/// The config for these rows
config: RowConfig,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dandandan -- that is an excellent point. That is what I was trying to get at with apache/arrow-rs#4466
Note that the formulation in this PR is no worse than what is on master I don't think (which also has an OwnedRow
per group)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I saw you mentioned the need for it in the feature request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's interesting currently already it does with an OwnedRow
, didn't realize that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I am feeling very good about the ability to make the code faster 🚀 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW @tustvold is being a hero. Here is a PR to help apache/arrow-rs#4470)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change. It is important to reduce the memory size of group rows/keys.
One optimization we can do further is when the group keys are fixed length, we can void the offsets
vec also.
create_hashes(group_values, &self.random_state, &mut batch_hashes)?; | ||
|
||
for (row, hash) in batch_hashes.into_iter().enumerate() { | ||
let entry = self.map.get_mut(hash, |(_hash, group_idx)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could get this more in line with the hash join, with the following steps:
- Create candidates (possible matches) based on hash-equality
- Compare keys (column-wise) in a vectorized fashion (
take
+eq
+and
) - Filter candidates based on filter (
filter
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is interesting 🤔 This is basically what the existing grouping operator does. I'll try and check out the join code at some point and see if i can transfer any of the learnings over here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our experiences, convert_columns
is also quite expensive. It may worth considering to directly compare column by column, and only do the row conversion when spilling is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can special case single column grouping like we do for SortPreservingMerge, the row format is only really beneficial when dealing with multiple columns as it avoids per-field type dispatch.
FWIW row conversion should have similar performance to the take kernel, with exception to dictionaries. I would be interested if this is not the case, as that is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can special case single column grouping like we do for SortPreservingMerge, the row format is only really beneficial when dealing with multiple columns as it avoids per-field type dispatch.
Yes, I think this would be an excellent idea.
Basically @sunchao I think we have seen that for single column sorting (in this case grouping) keeping the native representation is better than converting to row format. However, once there are sort(or group) columns involved the dynamic dispatch logic for comparsions quickly dominates the row conversion costs.
I am a bit concerned about "boiling the ocean" when improving grouping. Any work will take a significant amount of time, so keeping the scope down is important to make the change practical
That being said, if we with go with the formulation in this PR, we'll be in a much better place to try and special group storage -- it may not be obvious but the actual operator / stream code in this PR is quite a bit simpler than the existing row_hash
even though it has all the same features. This difference is largely due to not tracking parallel sets of aggregators (row and Accumulator)s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree, I think we should focus on getting a consistent accumulator representation and interface, before undertaking additional optimisation work of the hash table machinery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I totally agree with the approach. Getting the other changes ironed out is definitely more important for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also agree we should finish the other changes first as it will get too big otherwise 👍
I might do some experiments in the future with a similar approach as I mentioned above. I think the conversion might be relatively fast, but it will make other operations (e.g. equality) slower as it is not specialized on fixed size types and not as well vectorized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our experiences,
convert_columns
is also quite expensive. It may worth considering to directly compare column by column, and only do the row conversion when spilling is required.
I think the encoder implemented by @tustvold is very efficient. In the past I did some test on this code path it almost take no time.
31335b4
to
e02c35d
Compare
|
||
use super::AggregateExec; | ||
|
||
/// Grouping aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code follows the basic structure of row_hash
but the aggregate state management is different
|
||
match self.mode { | ||
AggregateMode::Partial | AggregateMode::Single => { | ||
acc.update_batch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is one key difference -- each accumulator is called once per input batch (not once per group)
let avg_fn = | ||
move |sum: i128, count: u64| decimal_averager.avg(sum, count as i128); | ||
|
||
Ok(Box::new(AvgGroupsAccumulator::<Decimal128Type, _>::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a specialized accumulator -- it will be instantiated once per native type or other type we need to support in the accumulator, but this will result in a specialized accumulator for each native type. 👨🍳 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also serves the purpose of allowing us to eventually deprecate the ScalarValue binary operations - #6842
// TODO combine the null mask from values and opt_filter | ||
let valids = values.nulls(); | ||
|
||
// This is based on (ahem, COPY/PASTA) arrow::compute::aggregate::sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular code is likely to be very common across most accumulators so I would hope to find some way to generalize it into its own function / macro
use arrow_array::{ArrayRef, BooleanArray}; | ||
use datafusion_common::Result; | ||
|
||
/// An implementation of GroupAccumulator is for a single aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the new GroupsAccumulator
trait that all accumulators would have to implement.
I would also plan to create a struct that implements this trait for aggregates based on Accumulator
s
struct GroupsAdapter {
groups: Vec<Box<dyn Accumulator>>
}
impl GroupsAccumulator for GroupsAdapter {
...
}
So in that way we can start with simpler (but slower) Accumulator
implementations for aggregates, and provide a fast GroupsAccumulator for the aggregates / types that need the specialization
GroupHashAggregate
stream approachGroupHashAggregate
stream approach (runs more than 2x faster!)
GroupHashAggregate
stream approach (runs more than 2x faster!)GroupHashAggregate
stream approach (runs more than 2x faster!)
I did some profiling on the current version on query 17: seems that a portion (at least 10% but could be more) of the time is spent now around |
@alamb do you continue this PR on your own or would some form of assistance help? E.g. writing some of those accumulators? |
group_indicies, | ||
values, | ||
opt_filter, | ||
|group_index, _new_value| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this compiles into the same code as with only iterating over group_indicies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be super helpful if you could test that / figure out if it is worth specializing -- the original version didn't handle input nulls correctly
|
||
if values.null_count() == 0 { | ||
accumulate_all( | ||
group_indicies, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group_indicies, | |
group_indices, |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦
git commit -a -m 'fix spelling of indices'
[alamb/hash_agg_spike d760a5f115] fix spelling of indices
4 files changed, 24 insertions(+), 24 deletions(-)
I just have the last two accumulators
To complete and I think I'll be ready to create a PR for review |
Found time for a small optimization (to reuse the buffer to create the hashes). |
@@ -111,6 +111,8 @@ pub(crate) struct GroupedHashAggregateStream { | |||
/// first element in the array corresponds to normal accumulators | |||
/// second element in the array corresponds to row accumulators | |||
indices: [Vec<Range<usize>>; 2], | |||
// buffer to be reused to store hashes | |||
hashes_buffer: Vec<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ this is a good change -- thanks @Dandandan . Pretty soon there will be no allocations while processing each batch (aka the hot loop) 🥳 -- I think with #6888 we can get rid of the counts in the sum accumulator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this change was made to the existing row_hash (not the new one). I will port the change to the new one as part of #6904
Ok, here are some numbers (TPCH SF1). I am quite pleased My next plan is to turn this into a PR
|
Tracking my plan in #6889 |
I also tried some Clickebench queries and I got a similar speedup (3x) -- I am feeling good about this one
Main:
This branch
🚀 |
Amazing 🚀 I think for this query we should also consider avoiding the conversion to the row-format as this likely will be one of the more expensive things now. |
That is a good idea -- it worked well for sorting as well. I put a note on #6889 to track writing up a real ticket |
counts: Vec<u64>, | ||
|
||
/// Sums per group, stored as the native type | ||
sums: Vec<T::Native>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to combine the counts
and sums
into one property, like avg_states: Vec<(T::Native, u64)>
? Since one sum and the related count are always used together, I think it's better to put them together for better cache locality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @alamb sounds like a useful suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to combine the counts and sums into one property, like avg_states: Vec<(T::Native, u64)>? Since one sum and the related count are always used together, I think it's better to put them together for better cache locality.
Thank you for the comment @yahoNanJing
The reason the sums and counts are stored separately is to minimize copying when forming the final output -- since the final output is columnar (two columns) keeping the data as two Vec
s allows the final ArrayRefs
to be created directly from that data.
It would be an interesting experiment to see if keeping them together and improving cache locality outweighed the extra copy.
BTW if people are looking to optimize the inner loops more, I think removing the bounds checks with unsafe might also help (but I don't plan to pursue it until I find need to optimize more)
So instead of
let sum = &mut self.sums[group_index];
*sum = sum.add_wrapping(new_value);
unsafe {
let sum = sums.get_unchecked_mut(group_index);
*sum = sum.add_wrapping(new_value);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to make a tuple (T::Native, u64)
as a primitive type at the arrow-rs side so that we can create an array of tuple? Then we don't need to return two arrays for the state()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah -- I see what you are saying -- I think we could potentially use a StructArray
for the state (which would be a single "column" in arrow) but the underlying storage is still two separate contiguous arrays.
Maybe we could use FixedSizeBinaryArray
🤔 and pack/unpack the tuples to the appropriate size
It would be an interesting experiment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid both of the StructArray
and FixedSizeBinaryArray
may have additional overhead.
If T::Native
can be a tuple, then we can provide a new array, called TupleArray
. The element type is a tuple, (T::Native, T::Native). Then this tuple can be any nested tuples. And this new TupleArray
can cover any nested tuple cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would definitely be a cool thing to try
For anyone following along, I have created a proposed PR with these changes that is ready for review: #6904 |
TLDR
This branch executes
Q17
(which has a high cardinality grouping on a Decimal128) in less than half (44%) of the time asmain
. 🥳Which issue does this PR close?
Related to #4973
This PR contains a technical spike / proof of concept that the hash aggregate approach described in #4973 (comment) will improve performance dramatically
I do not intend to ever merge this PR, but rather if it proves promising, I will break it up and incrementally merge it into the existing code (steps TBD)
Rationale for this change
We want faster grouping behavior, especially when there are large numbers of distinct groups
What changes are included in this PR?
GroupedHashAggregateStream2
operator that implements vectorized / multi-group updatesGroupsAccumulator
trait with a proposed vectorized API for managing and updating group stateGroupsAccumulator
forAVG
for PrimitiveArray (including decimal)Stuff I plan to complete in this PR
accumulate
functionopt_filter
inaccumulate
functionsGroupsAccumulator
in terms ofAccumulator
(for slower, but simpler accumulators)I am very pleased with how the code looks
Things not done:
Performance Results:
This branch runs
Q17
in less than half (44%) of the time as main. 🥳main
: Query 17 avg time: 1789.73 msDetails
Correctness
Both
main
and this branch produce the same answerThis branch
Query 17 iteration 0 took 876.5 ms and returned 1 rows
Query 17 iteration 1 took 757.5 ms and returned 1 rows
Query 17 iteration 2 took 737.6 ms and returned 1 rows
Query 17 iteration 3 took 728.6 ms and returned 1 rows
Query 17 iteration 4 took 731.3 ms and returned 1 rows
Query 17 avg time: 766.31 ms
Main
Query 17 iteration 0 took 1794.5 ms and returned 1 rows
Query 17 iteration 1 took 1825.9 ms and returned 1 rows
Query 17 iteration 2 took 1799.1 ms and returned 1 rows
Query 17 iteration 3 took 1793.4 ms and returned 1 rows
Query 17 iteration 4 took 1735.7 ms and returned 1 rows
Query 17 avg time: 1789.73 ms
Methodology
Run this command
Query:
Here is the original plan:
Next Steps
Stuff I would do after the above is done:
RowAccumulators
(see list below)BoundedAggregateStream
andGroupedHashAggregateStream
#6798Here is the list of
RowAccumulator
s (aka accumulators that havespecialized implementations). I think
Avg
is the trickiest toimplement (and it is already done)
CountRowAccumulator
MaxRowAccumulator
MinRowAccumulator
AvgRowAccumulator
SumRowAccumulator
BitAndRowAccumulator
BitOrRowAccumulator
BitXorRowAccumulator
BoolAndRowAccumulator
BoolOrRowAccumulator