-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize hash_aggregate when there are no null group keys #850
Comments
I think this is a reasonable first issue for someone if they are interested. The trick will be finding some benchmark where the null check matters |
@alamb I am interested in picking this up if this is an appropriate way to begin contributing to this project. |
What is Perhaps is this the same as checking that for each |
Thanks @novemberkilo
Yes, I think that is right. So rather than group_values
.iter()
.zip(group_state.group_by_values.iter())
.all(|(array, scalar)| scalar.eq_array(array, row)) The idea would be to write something like this group_values
.iter()
.zip(group_state.group_by_values.iter())
.all(|(array, scalar)| {
if array.null_count > 0 {
scalar.eq_array(array, row))
} else {
scalar.eq_array_no_nulls(array, row))
}
}) But Although now on second thought I think the if (array.null_count() > 0) {
group_values
.iter()
.zip(group_state.group_by_values.iter())
.all(|(array, scalar)| scalar.eq_array(array, row))
} else {
// special case no null values
group_values
.iter()
.zip(group_state.group_by_values.iter())
.all(|(array, scalar)| scalar.eq_array_no_nulls(array, row))
}
|
Thanks - that was the direction I was headed in too. Am keen to pick this up so please assign to me as appropriate? |
@novemberkilo assigned |
I think given the experience of @novemberkilo on this issue, we can conclude this is not an easy issue (and maybe not worth doing at all) so removing the label |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The code in hash_aggregate.rs is general and works for data with and without nulls. However there are optimizations that can be done. One such optimization is suggested by @andygrove and @Dandandan on #844 (comment), namely add an optimized code path when there are no NULL values in the input groups that will avoid the cost of checking for null on each group.
While this might sound trivial the null check is on the hot path (done for every single row that is grouped) so removing it may improve performance by a measurable amount.
Describe the solution you'd like
ScalarVaue::eq_array
(e.g.ScalarValue::eq_array_non_null
) that assumes the input has no nulls and does not checkArray::is_valid
Describe alternatives you've considered
The performance benefit may not be worth the additional code complexity, but we won't know until we try
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: