Support "A column is known to be entirely NULL" in `PruningPredicate` #9171

alamb · 2024-02-09T02:08:35Z

Is your feature request related to a problem or challenge?

This is broken out from #7869 which is describing a slightly different problem

PruningPredicate can't be told about columns that are known to contain only NULL. It can be told which columns have no nulls (via the PruningStatistics::null_counts()).

Columns that contain only NULL occur in tables that have "schema evolution" -- for example if you have two files such as

File 1: col_a
File 2: col_a, col_b (col_b was added later)

A predicate like col_a != A AND col_b='bananas' can not be true for File 1 (as col_B is logically NULL for all rows)

This is subtly, but importantly different than the case when nothing is known about the column, which confusingly is encoded by returning NULL from PruningStatistics::min_values()

Describe the solution you'd like

Add a new method PruningStatistics::row_counts() to get the total row counts in each container.
Use the information from PruningStatistics::row_counts() and PruningStatistics::null_counts() to determine containers where columns are entirely NULL
Rewrite the predicate, replacing references to columns known to be NULL with a NULL literal and try to simplify the expressions (e.g. a = 5 --> NULL = 5 --> NULL)

For the example in this ticket's description with predicate col_a != A AND col_b='bananas' where col_b is not known and the relevant container had 100 rows,

the relevant PruningStatistics would return col_b: {null_count = 100, row_count = 100}
PruningPredicate::prune would determine col_b was entirely null, and would rewrite the predicate to be col_a != A AND NULL = 'bananas'.
The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for col_b and thus could be proven to be not true.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2024-02-09T02:09:23Z

cc @appletreeisyellow

appletreeisyellow · 2024-02-09T19:34:58Z

Thank you @alamb!

One question for this line:

File 2: col_b, col_B (col_b was added later)

Do you actually mean:
File 2: col_a, col_b (col_b was added later)

alamb · 2024-02-09T20:03:16Z

Do you actually mean:

Yes, I have update the ticket. Thank you

appletreeisyellow · 2024-02-09T21:26:15Z

Thanks for verifying. I plan to work on this issue

alamb · 2024-02-11T10:54:32Z

One thought I had about this feature was to consider writing predicates like

y = 10

Instead of

y_min <= 10 AND y_max >=10

To something like

CASE 
  WHEN y_null_count == y_row_count THEN false
  ELSE y_min <= 10 AND y_max >=10
END

where y_row_count is a new column based on the value of PruningStatistics::row_counts

I think this would only be valid for top level expressions in the AND clause (not any arbitrary sub expression)

alamb · 2024-02-11T10:55:28Z

It might also now be a good time to look into how we could rewrite pruning predicate to use the range analysis in https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html 🤔

That would be the more elegant (but likely much more substantial) solution in my opinion

appletreeisyellow · 2024-02-13T23:11:14Z

I think this would only be valid for top level expressions in the AND clause (not any arbitrary sub expression)

Do you mean something like this?

For example, rewriting a complicated predicate like x < 5 AND x > 0

instead of

x_max < 5 AND 0 < x_min

To something like (a)

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5 AND 0 < x_min
END

I don't think we want it to be (b)

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5
END
AND
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE 0 < x_min
END

appletreeisyellow · 2024-02-13T23:18:16Z

I think case expression can be wrapped around the OR clause as well. For example, x = 8 OR x < 0 can be rewrite into

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE (x_min <= 8 AND 8 <= x_max) OR x_max < 0
END

If we know that a container has column x to be entirely NULL, this container can be skipped, regardless the result of OR clause is true or false.

appletreeisyellow · 2024-02-13T23:23:25Z

For a predicate with more than one columns, we might want to have a WHEN clause for each column.

For example, rewriting a predicate like x < 5 AND x > 0 AND y = 10 into

CASE
  WHEN x_null_count = x_row_count THEN false
  WHEN y_null_count = y_row_count THEN false
  ELSE <rewrite for x < 5 AND x > 0 AND y = 10>
END

appletreeisyellow · 2024-02-13T23:28:45Z

It might also now be a good time to look into how we could rewrite pruning predicate to use the range analysis in docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html 🤔

That would be the more elegant (but likely much more substantial) solution in my opinion

I'm interested in rewriting pruning predicate to use the range analysis! For now, I'm still writing in PhysicalExpr just to see how the code works. It would be nice to refactor into range analysis after I sort out basic logics

appletreeisyellow · 2024-02-14T16:54:34Z

After discussion with @alamb, we plan to do the implementation in two phases (i.e. two PRs):

Turn each sub expression into a case expression
Simplify the case expression and make it easy to read

1. Turn each sub expression into a case expression

Each sub expression will be rewritten into a case expression instead of wrapping the entire expression into one case expression. Each sub expression has its own case expression will make sure the pruning predict rewrite logic is correct.

For example, x < 5 AND x > 0 OR y = 10

will be rewritten into

# x < 5
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5 
END
AND
#  x > 0
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE 0 < x_min
END
OR
# y = 10
CASE
  WHEN y_null_count = y_row_count THEN false
  ELSE y_min <= 10 AND 10 <= y_max
END

2. Simplify the case expression and make it easy to read

The above example is formatted in a way that's easy to read. In the actual pruning predicate string, there is no new lines and indentation. The above example looks like this in the query explain:

CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 END AND CASE WHEN x_null_count = x_row_count THEN false ELSE 0 < x_min END OR CASE WHEN y_null_count = y_row_count THEN false ELSE y_min <= 10 AND 10 <= y_max END

As you can see, the final pruning predict rewrite can be long and hard to read. Therefore, we need phase 2 to improve the readability.

Probably add format, like () and new lines to the expression string. I will have a better idea after phase 1 PR is done

appletreeisyellow · 2024-02-26T15:50:58Z

There is a draft PR: #9223. The major feature is implemented, but there are two things I plan to do before I open it for review (see details).

I plan to continue the work on the week of March 4, 2024.

appletreeisyellow · 2024-03-14T21:32:43Z

Here is an update of the plan

1. Turn each sub expression into a case expression

Implemented in #9223

2. Simplify the case expression and make it easy to read

To be implemented in a follow-up PR:

Combine repeated CASE expressions Support "A column is known to be entirely NULL" in PruningPredicate #9223 (comment)
By default, show part of pruning predicate (i.e., truncated one) so users can know there is pruning predicate, add a config option to turn on the full display Support "A column is known to be entirely NULL" in PruningPredicate #9223 (comment)
Consider to move wrap_case_expr into PruningExpressionBuilder Support "A column is known to be entirely NULL" in PruningPredicate #9223 (comment)

alamb · 2024-03-22T13:30:57Z

I think this feature is done now that #9223 (comment) is merged so resolving this ticket

@appletreeisyellow if you plan additional work, let's track them in follow on tickets

alamb added the enhancement New feature or request label Feb 9, 2024

alamb changed the title ~~Support~~ Support "A column is known to be entirely NULL" in PruningPredicate Feb 9, 2024

This was referenced Feb 9, 2024

Don't error on unknown column when pruning if predicate can still be proven false #7869

Open

Add example of using PruningPredicate to datafusion-examples #9183

Merged

Docs: Extend PruningPredicate with background and implementation info #9184

Merged

alamb mentioned this issue Feb 12, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 12, 2024 #9200

Closed

8 tasks

alamb assigned appletreeisyellow Feb 12, 2024

This was referenced Feb 13, 2024

chore(pruning): Support IS NOT NULL predicates in PruningPredicate #9208

Merged

Support "A column is known to be entirely NULL" in PruningPredicate #9223

Merged

alamb mentioned this issue Feb 14, 2024

Consolidate interval analysies from Interval and PruningPredicate #7887

Open

This was referenced Feb 14, 2024

IS NOT NULL predicate rewrite is incorrect #9230

Closed

Support IS NOT NULL predicates in PruningPredicate #9231

Closed

alamb mentioned this issue Feb 26, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 26, 2024 #9345

Closed

9 tasks

This was referenced Mar 4, 2024

DataFusion weekly project plan (Andrew Lamb) - March 4, 2024 #9453

Closed

DataFusion weekly project plan (Andrew Lamb) - March 11, 2024 #9555

Closed

alamb mentioned this issue Mar 18, 2024

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 #9675

Closed

7 tasks

alamb closed this as completed Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "A column is known to be entirely NULL" in `PruningPredicate` #9171

Support "A column is known to be entirely NULL" in `PruningPredicate` #9171

alamb commented Feb 9, 2024 •

edited

Loading

alamb commented Feb 9, 2024

appletreeisyellow commented Feb 9, 2024

alamb commented Feb 9, 2024

appletreeisyellow commented Feb 9, 2024 •

edited

Loading

alamb commented Feb 11, 2024

alamb commented Feb 11, 2024 •

edited

Loading

appletreeisyellow commented Feb 13, 2024

appletreeisyellow commented Feb 13, 2024 •

edited

Loading

appletreeisyellow commented Feb 13, 2024

appletreeisyellow commented Feb 13, 2024

appletreeisyellow commented Feb 14, 2024 •

edited

Loading

appletreeisyellow commented Feb 26, 2024

appletreeisyellow commented Mar 14, 2024 •

edited

Loading

alamb commented Mar 22, 2024

Support "A column is known to be entirely NULL" in PruningPredicate #9171

Support "A column is known to be entirely NULL" in PruningPredicate #9171

Comments

alamb commented Feb 9, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Feb 9, 2024

appletreeisyellow commented Feb 9, 2024

alamb commented Feb 9, 2024

appletreeisyellow commented Feb 9, 2024 • edited Loading

alamb commented Feb 11, 2024

alamb commented Feb 11, 2024 • edited Loading

appletreeisyellow commented Feb 13, 2024

appletreeisyellow commented Feb 13, 2024 • edited Loading

appletreeisyellow commented Feb 13, 2024

appletreeisyellow commented Feb 13, 2024

appletreeisyellow commented Feb 14, 2024 • edited Loading

1. Turn each sub expression into a case expression

2. Simplify the case expression and make it easy to read

appletreeisyellow commented Feb 26, 2024

appletreeisyellow commented Mar 14, 2024 • edited Loading

1. Turn each sub expression into a case expression

2. Simplify the case expression and make it easy to read

alamb commented Mar 22, 2024

Support "A column is known to be entirely NULL" in `PruningPredicate` #9171

Support "A column is known to be entirely NULL" in `PruningPredicate` #9171

alamb commented Feb 9, 2024 •

edited

Loading

appletreeisyellow commented Feb 9, 2024 •

edited

Loading

alamb commented Feb 11, 2024 •

edited

Loading

appletreeisyellow commented Feb 13, 2024 •

edited

Loading

appletreeisyellow commented Feb 14, 2024 •

edited

Loading

appletreeisyellow commented Mar 14, 2024 •

edited

Loading