-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Materialize Dictionaries in Group Keys #7647
Comments
You can great dictionary encoded columns using
Here is a better example (use string dictionaries): |
@tustvold I've spent some time getting familiar with the code for aggregations and I have a couple of questions. Maybe you could help me here? |
@alamb maybe you could help me here? |
You should not need to do any conversion, the conversion is already being done by RowConverter. Currently there is a cast annotated with a TODO linking back to this ticket. It should just be a case of removing this cast and updating the schema logic within the plan to account for this |
@tustvold I saw the conversion in RowConverter but as I understand it happens multiple times: i.e. we convert dictionary into utf8 in AggregatePartial, then cast results back to dictionary and then we do the same in AggregateFinal. Please let me know if I get your idea right: in the case of AggregatePartial followed by AggregateFinal, the schema for the first one should have Utf8 as its output and the second aggregate will then use Utf8 both as its input and as output. |
Given that group keys inherently have few repeated values, especially when grouping on a single column, the use of dictionary encoding is unlikely to be yielding significant returns
Given that group keys inherently have few repeated values, especially when grouping on a single column, the use of dictionary encoding is unlikely to be yielding significant returns
Given that group keys inherently have few repeated values, especially when grouping on a single column, the use of dictionary encoding is unlikely to be yielding significant returns
TLDR I recommend we revert this change and reopen this ticket while we reconsider how to handle this case better. BackgroundThis ticket caused a functional regression for us downstream in #8738 (a query that used to run started erroring). The cause of the issue is that the However while working on that issue, I thought more about this change and I am not sure the change described in this issue is good for multi column groupings at all. RationaleDictionary Encoding on single column group keys is bad because there is no repetition in the data and therefore the Dictionary encoding is pure over head. For example, given SELECT ... GROUP BY a Dictionary Encoding is worse than native encoding:
However, the story is different when there is a multi column group key, as in that case, dictionary encoding each column can be a significant performance improvement as they are applied to each column individually and each column may have substantial redundancy. For example, given this query SELECT ... GROUP BY a,b
This could especially impact multi-phase grouping where dictionary encoding will save significant time hashing values for low cardinality string columns. In fact we think we may have seen a performance regression when picking up this change downstream as well, which could also be explained by the observation above Thus I recommend we revert this change via #8740 while we reconsider how to handle this case (maybe just for single column group by? Maybe do the dictionary encoding within the |
If anything this is precisely the case where the old behaviour was especially inefficient. It will compute the rows, expanding the dictionary, perform the grouping, convert the rows back to strings, convert the strings back to dictionaries, convert back to strings in the row format, again expanding the dictionaries, and repeat for each stage |
I want to be clear that I have no particular evidence one way or the other about the performance implications of this particular change (and I probably confused the issue with speculation in #7647 (comment)) So what I would like to do is:
|
@alamb what could be some evidence for 3.2? Is there anything in the code base or maybe in some other ticket? The plan you show makes total sense |
I filed #8791 to track adding additional test coverage, and I plan to work on that in the next few days |
Is your feature request related to a problem or challenge?
Currently grouping on a dictionary column will return dictionary-encoded group keys. Given that group keys inherently have few repeated values, especially when grouping on a single column, the use of dictionary encoding is unlikely to be yielding significant returns. Additionally following #7587 computing the dictionary is a non-trivial operation that could be eliminated
Describe the solution you'd like
When grouping on a dictionary column, e.g.
Dictionary(DataType::Int32, DataType::Utf8)
, the returned schema should be the underlying value type, i.e.DataType::Utf8
.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: