Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce duplication between BoundedAggregateStream and GroupedHashAggregateStream #6798

Closed
Tracked by #6800 ...
alamb opened this issue Jun 29, 2023 · 6 comments · Fixed by #6932
Closed
Tracked by #6800 ...

Reduce duplication between BoundedAggregateStream and GroupedHashAggregateStream #6798

alamb opened this issue Jun 29, 2023 · 6 comments · Fixed by #6932
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 29, 2023

Is your feature request related to a problem or challenge?

We are trying to make hash based aggregation significantly faster -- see #4973

This will require some non trivial changes to the organization of how hash aggregation works. At the moment BoundedAggregateStream and GroupedHashAggregateStream both share significant amounts of code and so either we will have to duplicate the work to make hashing aggregation faster or else BoundedAggregateStream will not get the benefits.

Here is a visual depiction of the common code:

 meld datafusion/core/src/physical_plan/aggregates/bounded_aggregate_stream.rs datafusion/core/src/physical_plan/aggregates/row_hash.rs 

Screenshot 2023-06-29 at 9 16 18 AM

Describe the solution you'd like

Reduce duplication between BoundedAggregateStream and GroupedHashAggregateStream

The major differences are:

  1. Choice of when output can be emitted
  2. Clearing previous group state when groups have been emitted

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jun 29, 2023
@alamb
Copy link
Contributor Author

alamb commented Jun 29, 2023

I wonder if you have any thoughts on this proposal @metesynnada and @mustafasrepo ?

@mustafasrepo
Copy link
Contributor

During streaming support for aggregation I tried to reduce duplication between BoundedAggregateStream and GroupedHashAggregateStream. However, it reduced performance for some queries as seen in discussion

However, I think to benefit both implementations with the following major change. It is better to reduce duplication as much as possible even if it introduces minor overhead during benchmarks.

@alamb
Copy link
Contributor Author

alamb commented Jul 3, 2023

Thanks for the comments @mustafasrepo -- I have been studying the code and I am confident I can reduce the duplication. I hope to have a PR ready for review that brings us in this direction in a few days.

@alamb alamb self-assigned this Jul 3, 2023
@alamb
Copy link
Contributor Author

alamb commented Jul 3, 2023

I plan to do this

@alamb
Copy link
Contributor Author

alamb commented Jul 7, 2023

I have completed my study of the code -- and I added some comments here #6881

Rather than try and consolidate the BoundedAggregateStream on main, what I think I would like to do is to merge in the new GroupByHash implementation - #4973 and then remove BoundedAggregateStream by porting its functionality to the new hash stream

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2023

I have a PR ready that consolidates the logic which I am quite pleased with: #6932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants