Optimize hash_aggregate when there are no null group keys #850

alamb · 2021-08-10T21:23:42Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The code in hash_aggregate.rs is general and works for data with and without nulls. However there are optimizations that can be done. One such optimization is suggested by @andygrove and @Dandandan on #844 (comment), namely add an optimized code path when there are no NULL values in the input groups that will avoid the cost of checking for null on each group.

While this might sound trivial the null check is on the hot path (done for every single row that is grouped) so removing it may improve performance by a measurable amount.

Describe the solution you'd like

A new function or parameter in ScalarVaue::eq_array (e.g. ScalarValue::eq_array_non_null) that assumes the input has no nulls and does not check Array::is_valid
A check in hash_aggregate if the null count in all group columns is 0 and invokes the specialized version of ScalarValue::eq_array_non_null if so
Some sort of performance benchmark results showing that it improves grouping performance (there is a list of benchmarks on Rework GroupByHash to for faster performance and support grouping by nulls #808 that might be able to inspire you)

Describe alternatives you've considered
The performance benefit may not be worth the additional code complexity, but we won't know until we try

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

alamb · 2021-08-10T21:24:27Z

I think this is a reasonable first issue for someone if they are interested. The trick will be finding some benchmark where the null check matters

novemberkilo · 2021-08-15T05:17:35Z

@alamb I am interested in picking this up if this is an appropriate way to begin contributing to this project.

novemberkilo · 2021-08-15T06:16:57Z

A check in hash_aggregate if the null count in all group columns is 0

What is all group columns referring to here please?

Perhaps is this the same as checking that for each array that appears on https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L371 we check that array.null_count() == 0

alamb · 2021-08-16T20:56:21Z

Thanks @novemberkilo

Perhaps is this the same as checking that for each array that appears on https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L371 we check that array.null_count() == 0

Yes, I think that is right. So rather than

            group_values
                .iter()
                .zip(group_state.group_by_values.iter())
                .all(|(array, scalar)| scalar.eq_array(array, row))

The idea would be to write something like this

            group_values
                .iter()
                .zip(group_state.group_by_values.iter())
                .all(|(array, scalar)| { 
                  if array.null_count > 0 { 
                    scalar.eq_array(array, row))
                  } else  { 
                    scalar.eq_array_no_nulls(array, row))
                  } 
                })

But ScalarValue::eq_array_no_nulls does not exist yet -- you would have to write it / test it

Although now on second thought I think the if needs to be hoisted out of the loop:

          if (array.null_count() > 0) {
            group_values
                .iter()
                .zip(group_state.group_by_values.iter())
                .all(|(array, scalar)| scalar.eq_array(array, row))
         } else {
            // special case no null values
            group_values
                .iter()
                .zip(group_state.group_by_values.iter())
                .all(|(array, scalar)| scalar.eq_array_no_nulls(array, row))
         }

novemberkilo · 2021-08-16T22:08:43Z

Thanks - that was the direction I was headed in too. Am keen to pick this up so please assign to me as appropriate?

alamb · 2021-08-17T10:39:08Z

@novemberkilo assigned

alamb · 2021-10-02T10:18:06Z

I think given the experience of @novemberkilo on this issue, we can conclude this is not an easy issue (and maybe not worth doing at all) so removing the label

alamb · 2023-07-21T13:00:29Z

I think this is done in #6904 and #7043

alamb added the enhancement New feature or request label Aug 10, 2021

alamb mentioned this issue Aug 10, 2021

Add ScalarValue::eq_array optimized comparison function #844

Merged

alamb added the good first issue Good for newcomers label Aug 10, 2021

alamb assigned novemberkilo Aug 17, 2021

novemberkilo mentioned this issue Aug 22, 2021

WIP Optimize hash_aggregate when there are no null group keys #922

Closed

alamb removed the good first issue Good for newcomers label Oct 2, 2021

novemberkilo removed their assignment May 5, 2022

alamb mentioned this issue Mar 10, 2023

[EPIC] A list of performance improvement tickets #5546

Open

29 tasks

alamb closed this as completed Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize hash_aggregate when there are no null group keys #850

Optimize hash_aggregate when there are no null group keys #850

alamb commented Aug 10, 2021

alamb commented Aug 10, 2021

novemberkilo commented Aug 15, 2021

novemberkilo commented Aug 15, 2021

alamb commented Aug 16, 2021 •

edited

Loading

novemberkilo commented Aug 16, 2021

alamb commented Aug 17, 2021

alamb commented Oct 2, 2021

alamb commented Jul 21, 2023

Optimize hash_aggregate when there are no null group keys #850

Optimize hash_aggregate when there are no null group keys #850

Comments

alamb commented Aug 10, 2021

alamb commented Aug 10, 2021

novemberkilo commented Aug 15, 2021

novemberkilo commented Aug 15, 2021

alamb commented Aug 16, 2021 • edited Loading

novemberkilo commented Aug 16, 2021

alamb commented Aug 17, 2021

alamb commented Oct 2, 2021

alamb commented Jul 21, 2023

alamb commented Aug 16, 2021 •

edited

Loading