-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support "A column is known to be entirely NULL" in PruningPredicate
#9171
Comments
PruningPredicate
Thank you @alamb! One question for this line:
Do you actually mean: |
Yes, I have update the ticket. Thank you |
Thanks for verifying. I plan to work on this issue |
One thought I had about this feature was to consider writing predicates like y = 10 Instead of
To something like CASE
WHEN y_null_count == y_row_count THEN false
ELSE y_min <= 10 AND y_max >=10
END where I think this would only be valid for top level expressions in the |
It might also now be a good time to look into how we could rewrite pruning predicate to use the range analysis in https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html 🤔 That would be the more elegant (but likely much more substantial) solution in my opinion |
Do you mean something like this? For example, rewriting a complicated predicate like instead of x_max < 5 AND 0 < x_min To something like (a) CASE
WHEN x_null_count = x_row_count THEN false
ELSE x_max < 5 AND 0 < x_min
END I don't think we want it to be (b) CASE
WHEN x_null_count = x_row_count THEN false
ELSE x_max < 5
END
AND
CASE
WHEN x_null_count = x_row_count THEN false
ELSE 0 < x_min
END |
I think case expression can be wrapped around the CASE
WHEN x_null_count = x_row_count THEN false
ELSE (x_min <= 8 AND 8 <= x_max) OR x_max < 0
END If we know that a container has column |
For a predicate with more than one columns, we might want to have a WHEN clause for each column. For example, rewriting a predicate like CASE
WHEN x_null_count = x_row_count THEN false
WHEN y_null_count = y_row_count THEN false
ELSE <rewrite for x < 5 AND x > 0 AND y = 10>
END |
I'm interested in rewriting pruning predicate to use the range analysis! For now, I'm still writing in |
After discussion with @alamb, we plan to do the implementation in two phases (i.e. two PRs):
1. Turn each sub expression into a case expressionEach sub expression will be rewritten into a case expression instead of wrapping the entire expression into one case expression. Each sub expression has its own case expression will make sure the pruning predict rewrite logic is correct. For example, will be rewritten into # x < 5
CASE
WHEN x_null_count = x_row_count THEN false
ELSE x_max < 5
END
AND
# x > 0
CASE
WHEN x_null_count = x_row_count THEN false
ELSE 0 < x_min
END
OR
# y = 10
CASE
WHEN y_null_count = y_row_count THEN false
ELSE y_min <= 10 AND 10 <= y_max
END 2. Simplify the case expression and make it easy to readThe above example is formatted in a way that's easy to read. In the actual pruning predicate string, there is no new lines and indentation. The above example looks like this in the query explain:
As you can see, the final pruning predict rewrite can be long and hard to read. Therefore, we need phase 2 to improve the readability. Probably add format, like |
Here is an update of the plan 1. Turn each sub expression into a case expressionImplemented in #9223 2. Simplify the case expression and make it easy to readTo be implemented in a follow-up PR:
|
I think this feature is done now that #9223 (comment) is merged so resolving this ticket @appletreeisyellow if you plan additional work, let's track them in follow on tickets |
Is your feature request related to a problem or challenge?
This is broken out from #7869 which is describing a slightly different problem
PruningPredicate
can't be told about columns that are known to contain onlyNULL
. It can be told which columns have no nulls (via thePruningStatistics::null_counts()
).Columns that contain only NULL occur in tables that have "schema evolution" -- for example if you have two files such as
File 1:
col_a
File 2:
col_a
,col_b
(col_b
was added later)A predicate like
col_a != A AND col_b='bananas'
can not betrue
for File 1 (ascol_B
is logicallyNULL
for all rows)This is subtly, but importantly different than the case when nothing is known about the column, which confusingly is encoded by returning NULL from
PruningStatistics::min_values()
Describe the solution you'd like
PruningStatistics::row_counts()
to get the total row counts in each container.PruningStatistics::row_counts()
andPruningStatistics::null_counts()
to determine containers where columns are entirely NULLNULL
with aNULL
literal and try to simplify the expressions (e.g.a = 5
-->NULL = 5
-->NULL
)For the example in this ticket's description with predicate
col_a != A AND col_b='bananas'
wherecol_b
is not known and the relevant container had100
rows,PruningStatistics
would returncol_b: {null_count = 100, row_count = 100}
PruningPredicate::prune
would determinecol_b
was entirely null, and would rewrite the predicate to becol_a != A AND NULL = 'bananas'
.col_b
and thus could be proven to be not true.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: