-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace CASE expressions in predicate pruning with boolean algebra #13795
Conversation
cc @appletreeisyellow @alamb to understand if there was a reason to make these CASE statements instead of boolean algebra that I am not picking up on |
Things that would need to be done before merging this:
|
@adriangb can you please expand PR description explaining or exemplifying what's replaced with what? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adriangb -- this change makes lots of sense to me
I think the reason @appletreeisyellow and I used CASE
before was to avoid introducing a null into the PruningPredicate .
The Pruning predicate should always evaluate to non-null
Outputs: A (non null) boolean value for each container:
- true: There MAY be rows that match the predicate
- false: There are no rows that could possibly match the predicate (the predicate can never possibly be true). The container can be pruned (skipped) entirely.
To compare the CASE
construct with NOT
you get
null_count = row_count |
CASE ... |
NOT null_count = row_count |
---|---|---|
true |
false |
false |
false |
<rest of expr> |
<rest of expr> |
null |
<rest of expr> |
null |
If you use CASE ..
and there is no row count information, null_count = row_count
is null
and the expression simplifies to whatever the rest of the expression evaluates to.
I think if you use NOT
and there is no row count information, the expression simplifies to null AND <rest>
But null AND <rest>
postgres=# select null AND true;
?column?
----------
(1 row)
postgres=# select null AND false;
?column?
----------
f
(1 row)
postgres=# select null AND null;
?column?
----------
(1 row)
Suggestions
- It is concerning that there is no test case that fails in this case (so maybe my analysis is wrong or something else is missing). We should try to add some coverage
- Can you use
coalesce
instead ofcase
to substitute in thetrue
(might be matching rows)? For example
COALESCE(NOT, null_count = row_count, true)
I am concerned that the purported benefit
may be a result of overly aggressive pruning. |
For some backstory (that I happen to know) is that I believe @adriangb 's usecase is to use the PruningPredicate rewriter to rewrite expressions into range analysis expressions that are then pushed into postgres for execution (aka he is effectively executing the rewritten predicate expression on a posgres database that stores the min/maxes) I agree it would be useful to understand why case is causing a problem |
Thank you for filling in the backstory Andrew! Yes that's right I basically have the statistics in a Postgres table (with select file_id, row_group, <some other metadata>
from stats
where <pruning predicate> is not false Note that I add The point is that even for 100k row groups / several thousand files it's very easy for Postgres to scan this table and return the matching row groups (easy as in <1s) while it would be a lot of work to do a LIST on a bucket, download footers, decode, get stats, etc. And this is better than storing the data in postgres, copying it to the application and doing the pruning there because it saves moving the data over the wire. Now about the Consider: postgres=# create table stats (row_count int, null_count int, min int, max int);
CREATE TABLE
postgres=# insert into stats values (1, 0, null, null);
INSERT 0 1
postgres=# select case when row_count = null_count then false else min > 1 end from stats;
case
------
(1 row) Maybe this can never happen with Parquet stats, but it certainly can happen with my setup: you add a new column and all "old" rows for files that don't have that column are null. datafusion/datafusion/physical-optimizer/src/pruning.rs Lines 727 to 730 in f4e65d2
Regarding Postgres' ability to handle postgres=# create table stats (row_count int, null_count int, min int, max int);
CREATE TABLE
postgres=# INSERT INTO stats (row_count, null_count, min, max)
INSERT INTO stats (row_count, null_count, min, max)
SELECT
1,
0,
g.id as min,
g.id + 1 as max
FROM generate_series(0, 100000000) g(id);
INSERT 0 100000001
postgres=#
postgres=# create index on stats (min);
CREATE INDEX
postgres=# explain select * from stats where case when row_count = null_count then false else min > 599 end;
QUERY PLAN
---------------------------------------------------------------------------------
Seq Scan on stats (cost=10000000000.00..10002040541.00 rows=50000000 width=16)
Filter: CASE WHEN (row_count = null_count) THEN false ELSE (min > 599) END
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
(5 rows)
postgres=# explain select * from stats where not row_count = null_count and min > 599;
QUERY PLAN
----------------------------------------------------------------------------------------
Bitmap Heap Scan on stats (cost=623884.23..2603804.49 rows=33166667 width=16)
Recheck Cond: (min > 599)
Filter: (row_count <> null_count)
-> Bitmap Index Scan on stats_min_idx (cost=0.00..615592.56 rows=33333333 width=0)
Index Cond: (min > 599) Two other general notes:
|
That should work: postgres=# create table stats (row_count int, null_count int, min int, max int);
CREATE TABLE
postgres=# INSERT INTO stats (row_count, null_count, min, max)
SELECT
1,
0,
g.id as min,
g.id + 1 as max
FROM generate_series(0, 100000000) g(id);
INSERT 0 100000001
postgres=# create index on stats (coalesce(not row_count = null_count, true), min);
CREATE INDEX
postgres=# explain select * from stats where coalesce(not row_count = null_count, true) and min > 599;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Bitmap Heap Scan on stats (cost=353633.90..2318398.97 rows=16666667 width=16)
Recheck Cond: (COALESCE((row_count <> null_count), true) AND (min > 599))
-> Bitmap Index Scan on stats_coalesce_min_idx (cost=0.00..349467.24 rows=16666667 width=0)
Index Cond: ((COALESCE((row_count <> null_count), true) = true) AND (min > 599)) It may be slower than without the coalesce but it's certainly better than not being able to have an index at all. |
I thought more about this in the 🚿 and doubled checked the logic: TLDR is I think using datafusion/datafusion/physical-optimizer/src/pruning.rs Lines 723 to 726 in e665115
So TLDR is upon more thought I think the theory behind this PR is sound ✅ I need to review the code more carefully and ensure we have a test that has unknown (unspecified) column count but the value can be proven true by other min/max ranges but otherwise it should be good to go. |
// If the null counts themselves are missing we should be able to fall back to the stats | ||
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, true)])); | ||
let container_stats = ContainerStats { | ||
min: Some(Arc::new(Int32Array::from(vec![Some(0)]))), | ||
max: Some(Arc::new(Int32Array::from(vec![Some(0)]))), | ||
null_counts: Some(Arc::new(UInt64Array::from(vec![None]))), | ||
row_counts: Some(Arc::new(UInt64Array::from(vec![Some(1)]))), | ||
..ContainerStats::default() | ||
}; | ||
let statistics = TestStatistics::new().with("i", container_stats); | ||
let expected_ret = &[true]; | ||
prune_with_expr(col("i").eq(lit(0)), &schema, &statistics, expected_ret); | ||
let expected_ret = &[false]; | ||
prune_with_expr(col("i").gt(lit(0)), &schema, &statistics, expected_ret); | ||
|
||
// Same for the row counts | ||
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, true)])); | ||
let container_stats = ContainerStats { | ||
min: Some(Arc::new(Int32Array::from(vec![Some(0)]))), | ||
max: Some(Arc::new(Int32Array::from(vec![Some(0)]))), | ||
null_counts: Some(Arc::new(UInt64Array::from(vec![Some(1)]))), | ||
row_counts: Some(Arc::new(UInt64Array::from(vec![None]))), | ||
..ContainerStats::default() | ||
}; | ||
let statistics = TestStatistics::new().with("i", container_stats); | ||
let expected_ret = &[true]; | ||
prune_with_expr(col("i").eq(lit(0)), &schema, &statistics, expected_ret); | ||
let expected_ret = &[false]; | ||
prune_with_expr(col("i").gt(lit(0)), &schema, &statistics, expected_ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb see test case here for #13795 (comment)
Can we go one step further and instead of (the only test failures I see from making this change are the ones that assert on the generated sql so I think it works but I wanted to ask before I spend 15 minutes updating all of the strings) |
AFAICT those two are equivalent unless operands are IEEE 754 floating point numbers. are they? |
In this case we know that they're always integers :) |
Okay I've swapped to use I also updated all of the comments and docstrings. |
/// While `PruningPredicate` will never return a `NULL` value, the | ||
/// rewritten predicate (as returned by `build_predicate_expression` and used internally | ||
/// by `PruningPredicate`) may evaluate to `NULL` when some of the min/max values | ||
/// or null / row counts are not known. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of
datafusion/datafusion/physical-optimizer/src/pruning.rs
Lines 727 to 730 in f4e65d2
_ => { | |
// Null or true means the rows in container may pass this | |
// conjunct so we can't prune any containers based on that | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has always been true and is also clarified in the same docstring lower down, I just wanted to add it here again since it's caused confusion in the past (even for @alamb !):
datafusion/datafusion/physical-optimizer/src/pruning.rs
Lines 300 to 316 in f4e65d2
/// Then, the `PruningPredicate` rewrites the original predicate into an | |
/// expression that references the min/max values of each column in the original | |
/// predicate. | |
/// | |
/// When the min/max values are actually substituted in to this expression and | |
/// evaluated, the result means | |
/// | |
/// * `true`: there MAY be rows that pass the predicate, **KEEPS** the container | |
/// | |
/// * `NULL`: there MAY be rows that pass the predicate, **KEEPS** the container | |
/// Note that rewritten predicate can evaluate to NULL when some of | |
/// the min/max values are not known. *Note that this is different than | |
/// the SQL filter semantics where `NULL` means the row is filtered | |
/// out.* | |
/// | |
/// * `false`: there are no rows that could possibly match the predicate, | |
/// **PRUNES** the container |
The difference now is that if the null or row count is null we will also return null in the case where we can't use the min/max stats to prove that the file can be pruned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Yes I find reasoning about tri-state boolean logic (and what can be proven about what) confusing!
Here's a breakdown of how the clauses evaluate before and after this PR. Before
After
The only change is for the case where |
I wanted to check that the assumption that falling back on min/max stats when row / null counts are missing/unknown is safe even when all rows are null and we don't know that all of the rows are null. I was afraid that writers might populate the statistics with a default value (e.g.
I believe both the previous implementation and this implementation would decide to exclude the container for the predicate CASE WHEN null = null THEN false ELSE col_min > 1 END` null != null AND col_min > 1 Both of these cases return I tested using our writer: use std::sync::Arc;
use parquet::arrow::ArrowWriter;
#[tokio::main]
async fn main() {
let array = arrow::array::Int32Array::from(vec![None]);
let array = Arc::new(array);
let schema = arrow::datatypes::Schema::new(vec![arrow::datatypes::Field::new("a", arrow::datatypes::DataType::Int32, true)]);
let schema = Arc::new(schema);
let batch = arrow::record_batch::RecordBatch::try_new(schema.clone(), vec![array]).unwrap();
let mut file = std::fs::File::create("test.parquet").unwrap();
let mut writer = ArrowWriter::try_new(&mut file, schema.clone(), None).unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
}
It does what I think is correct and sets min / max to So I think this is not a realistic scenario to hit, even if it theoretically could cause issues. |
if all values are null, then all non-null values are in [0, 0] range, so min=0 and max=0 would be valid. |
So is that assumption incorrect? Then it's a bug in the current implementation. |
Thinking about this again you're right. The worst that can happen is that we return |
f92442e
to
403bde8
Compare
If we wanted to eliminate this change from this PR we can add a condition |
Turns out not that small of a step: there's cases that previously pruned the row that no longer would, in particular when one statistic ( |
I agree with this assesment. Thank you for the analysis @adriangb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adriangb -- I went through this PR carefully and I think it is a very nice improvement. The expressions are simpler and will be faster to evaluate 👌
Thanks again for all the careful review work
/// While `PruningPredicate` will never return a `NULL` value, the | ||
/// rewritten predicate (as returned by `build_predicate_expression` and used internally | ||
/// by `PruningPredicate`) may evaluate to `NULL` when some of the min/max values | ||
/// or null / row counts are not known. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Yes I find reasoning about tri-state boolean logic (and what can be proven about what) confusing!
/// * `null = null` is `null` which is not true, so the `CASE` expression will use the `ELSE` clause | ||
/// * `1 <= 5 AND 5 <= 100 AND 4 <= 10 AND 10 <= 7` | ||
/// * `true AND true AND true AND false` | ||
/// * `null != null AND (1 <= 5 AND 5 <= 100) AND null != null AND (4 <= 10 AND 10 <= 7)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked this and looks good to me
Thank you also @findepi for your reviews I'll plan to merge this tomorrow to allow time for other comments unless we find issues or others would like time to review |
Co-authored-by: Andrew Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adriangb for the improvement! 🚀
I also double-checked the logic, and it works as expected. The rewritten pruning predicate is now shorter and easier to read with your change 👏 I also tested this PR against InfluxData IOx code and all the tests pass!
Co-authored-by: Chunchun Ye <[email protected]>
Co-authored-by: Chunchun Ye <[email protected]>
I plan to merge this tomorrow morning unless i hear otherwise |
Thanks again @adriangb and @appletreeisyellow |
Boolean algebra is simpler to evaluate and if the metadata is being stored in something that supports indexes e.g. Postgres then this new version can be pushed down to the index while the
CASE
expressions cannot.This replaces the produced predicate pruning expressions which will change from something along the lines of:
To: