-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve aggregate performance by special casing single group keys #6969
Comments
@yahoNanJing reports that with apache/arrow-rs#4523 / apache/arrow-rs#4524 and yahoNanJing@a18ac07 they see a significant performance improvement in TPHC queries Thus, I think this idea holds significant promise |
Just to be clear, what I was imagining for the group storage is not to change the contents of the But instead of storing group_values using the arrow
We would instead store the group values using a native type like
Handling null values would need some care, but since this would only be for single columns (where there can be at most one null value) I think we could figure out some way to handle it |
I ran some some tests of the various approaches on
EBAY-KYLIN-4003-5 (inlined group values): mixed results
Unsafe Row Access: mixed results
|
Is your feature request related to a problem or challenge?
This is another observation to make aggregation queries go faster
After #6904 the speed of grouping is much faster as much of the per-aggregate state overhead is removed.
Now, for queries with a single primitive type (e.g. Int64) a significant portion of their execution time is the converting the grouping key into the the arrow
Row
format: https://docs.rs/arrow-row/43.0.0/arrow_row/index.html to compareAn example of such a query is to find the number of interactions with some user id:
Describe the solution you'd like
As @Dandandan suggested on #6800 (comment)
So that would mean that for group by with a single column, we could use the (native) representation of that column to store the group values rather than using the Arrow Row format.
This is the same approach taken today by Sort / SortPreservingMerge. It is implemented via a generic
FieldCursor
:https://github.com/apache/arrow-datafusion/blob/d316702722e6c301fdb23a9698f7ec415ef548e9/datafusion/core/src/physical_plan/sorts/cursor.rs#L180-L191
Describe alternatives you've considered
@yahoNanJing has also been exploring a fixed width row cursor implementation that is optimized for comparison rather than ordering: apache/arrow-rs#4524
Additional context
There is some discussion in the ASF slack channel as well: https://the-asf.slack.com/archives/C01QUFS30TD/p1689301661826909
The text was updated successfully, but these errors were encountered: