-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of COUNT (distinct x) for dictionary columns #258
Comments
I've made a little PR for this. But I'm not sure about how to go about measuring the performance improvements... @alamb do you know of any existing benches in the codebase that would measure this? alternatively i guess i could add a new bench but i remember you saying you were weary about adding too many new benches 😅 |
To my knowledge, the current TPC-H and TPC-DS benchmarks do not include this scenario. While the data may be dictionary encoded in storage format, they are expanded (IIRC) to normal arrays after being read into memory. |
I agree with @waynexia that this scenario is not covered by any existing datafusion benchmarks I know of Clickbench has several queries that include count distinct (see for example #5276 (comment)) but I am not sure if the input is dictionary encoded.
However, I think with #5166 you could now create a dictionary encoded version with a command like the following (untested as I don't not to have the data downloaded -- data is here https://github.com/ClickHouse/ClickBench/tree/main#data-loading) CREATE TABLE hits_dictionary as
select
arrow_cast("RegionID", 'Dictionary(Int32, Utf8)') as "RegionID",
"ResolutionWidth",
"UserID",
FROM hits; |
I have one question for the parquet stored data. When the arrow parquet reader read data from parquet files, I remember even in parquet the files, the data is dictionary encoded(string types with low cardinality), the arrow parquet reader will not convert it to Arrow Dictionary type. |
I believe @tustvold changed this a while ago so that the Dictionary is preserved (and then I think datafusion hydrates them) -- apache/arrow-rs#1180 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have large amounts of low cardinality string data (for example, 200 M rows, but only 20 distinct values). DictionaryArrays are very good for such data as they are space efficient.
#256 adds basic query support for distinct dictionary columns but it is not a very computationally efficient imlementation. It effectively unpacks the (likely mostly deduplicated) dictionary's values row by row into a hash set to deduplicate it again. That is a lot of extra hashing work.
Describe the solution you'd like
It would likely be much more efficient (especially for arrays that have a small number of distinct values in their dictionary) to look at the values from the dictionary directly, first checking that each entry in the dictionary was actually used.
The text was updated successfully, but these errors were encountered: