Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize hash_aggregate when there are no null group keys #850

Closed
alamb opened this issue Aug 10, 2021 · 8 comments
Closed

Optimize hash_aggregate when there are no null group keys #850

alamb opened this issue Aug 10, 2021 · 8 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Aug 10, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The code in hash_aggregate.rs is general and works for data with and without nulls. However there are optimizations that can be done. One such optimization is suggested by @andygrove and @Dandandan on #844 (comment), namely add an optimized code path when there are no NULL values in the input groups that will avoid the cost of checking for null on each group.

While this might sound trivial the null check is on the hot path (done for every single row that is grouped) so removing it may improve performance by a measurable amount.

Describe the solution you'd like

  1. A new function or parameter in ScalarVaue::eq_array (e.g. ScalarValue::eq_array_non_null) that assumes the input has no nulls and does not check Array::is_valid
  2. A check in hash_aggregate if the null count in all group columns is 0 and invokes the specialized version of ScalarValue::eq_array_non_null if so
  3. Some sort of performance benchmark results showing that it improves grouping performance (there is a list of benchmarks on Rework GroupByHash to for faster performance and support grouping by nulls #808 that might be able to inspire you)

Describe alternatives you've considered
The performance benefit may not be worth the additional code complexity, but we won't know until we try

Additional context
Add any other context or screenshots about the feature request here.

@alamb alamb added the enhancement New feature or request label Aug 10, 2021
@alamb alamb added the good first issue Good for newcomers label Aug 10, 2021
@alamb
Copy link
Contributor Author

alamb commented Aug 10, 2021

I think this is a reasonable first issue for someone if they are interested. The trick will be finding some benchmark where the null check matters

@novemberkilo
Copy link

@alamb I am interested in picking this up if this is an appropriate way to begin contributing to this project.

@novemberkilo
Copy link

A check in hash_aggregate if the null count in all group columns is 0

What is all group columns referring to here please?

Perhaps is this the same as checking that for each array that appears on https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L371 we check that array.null_count() == 0

@alamb
Copy link
Contributor Author

alamb commented Aug 16, 2021

Thanks @novemberkilo

Perhaps is this the same as checking that for each array that appears on https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L371 we check that array.null_count() == 0

Yes, I think that is right. So rather than

            group_values
                .iter()
                .zip(group_state.group_by_values.iter())
                .all(|(array, scalar)| scalar.eq_array(array, row))

The idea would be to write something like this

            group_values
                .iter()
                .zip(group_state.group_by_values.iter())
                .all(|(array, scalar)| { 
                  if array.null_count > 0 { 
                    scalar.eq_array(array, row))
                  } else  { 
                    scalar.eq_array_no_nulls(array, row))
                  } 
                }) 

But ScalarValue::eq_array_no_nulls does not exist yet -- you would have to write it / test it

Although now on second thought I think the if needs to be hoisted out of the loop:

          if (array.null_count() > 0) {
            group_values
                .iter()
                .zip(group_state.group_by_values.iter())
                .all(|(array, scalar)| scalar.eq_array(array, row))
         } else {
            // special case no null values
            group_values
                .iter()
                .zip(group_state.group_by_values.iter())
                .all(|(array, scalar)| scalar.eq_array_no_nulls(array, row))
         }

@novemberkilo
Copy link

Thanks - that was the direction I was headed in too. Am keen to pick this up so please assign to me as appropriate?

@alamb
Copy link
Contributor Author

alamb commented Aug 17, 2021

@novemberkilo assigned

@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2021

I think given the experience of @novemberkilo on this issue, we can conclude this is not an easy issue (and maybe not worth doing at all) so removing the label

@alamb
Copy link
Contributor Author

alamb commented Jul 21, 2023

I think this is done in #6904 and #7043

@alamb alamb closed this as completed Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants