Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace CASE expressions in predicate pruning with boolean algebra #13795

Merged
merged 15 commits into from
Dec 20, 2024

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Dec 16, 2024

Boolean algebra is simpler to evaluate and if the metadata is being stored in something that supports indexes e.g. Postgres then this new version can be pushed down to the index while the CASE expressions cannot.

This replaces the produced predicate pruning expressions which will change from something along the lines of:

case when row_count = null_count then false else col_min > 1 end

To:

row_count != null_count and col_min > 1

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 16, 2024
@adriangb adriangb marked this pull request as ready for review December 16, 2024 16:47
@adriangb
Copy link
Contributor Author

cc @appletreeisyellow @alamb to understand if there was a reason to make these CASE statements instead of boolean algebra that I am not picking up on

@adriangb
Copy link
Contributor Author

Things that would need to be done before merging this:

  • Replace a lot of comments referencing CASE
  • Reformat the test assertions that are now very long lines (manual work 😢 )

@findepi
Copy link
Member

findepi commented Dec 17, 2024

@adriangb can you please expand PR description explaining or exemplifying what's replaced with what?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb -- this change makes lots of sense to me

I think the reason @appletreeisyellow and I used CASE before was to avoid introducing a null into the PruningPredicate .

According to
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html#pruningpredicate-implementation

The Pruning predicate should always evaluate to non-null

Outputs: A (non null) boolean value for each container:

  • true: There MAY be rows that match the predicate
  • false: There are no rows that could possibly match the predicate (the predicate can never possibly be true). The container can be pruned (skipped) entirely.

To compare the CASE construct with NOT you get

null_count = row_count CASE ... NOT null_count = row_count
true false false
false <rest of expr> <rest of expr>
null <rest of expr> null

If you use CASE .. and there is no row count information, null_count = row_count is null and the expression simplifies to whatever the rest of the expression evaluates to.

I think if you use NOT and there is no row count information, the expression simplifies to null AND <rest>

But null AND <rest>

postgres=# select null AND true;
 ?column?
----------

(1 row)

postgres=# select null AND false;
 ?column?
----------
 f
(1 row)

postgres=# select null AND null;
 ?column?
----------

(1 row)

Suggestions

  1. It is concerning that there is no test case that fails in this case (so maybe my analysis is wrong or something else is missing). We should try to add some coverage
  2. Can you use coalesce instead of case to substitute in the true (might be matching rows)? For example
COALESCE(NOT, null_count = row_count, true)

@findepi
Copy link
Member

findepi commented Dec 17, 2024

I am concerned that the purported benefit

something that supports indexes e.g. Postgres then this new version can be pushed down to the index while the CASE expressions cannot.

may be a result of overly aggressive pruning.
I would assume PostgreSQL can handle and optimize CASE expressions the same way or better than we can (but i may be wrong), so CASE vs coalesce shouldn't make a difference.

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

I am concerned that the purported benefit

something that supports indexes e.g. Postgres then this new version can be pushed down to the index while the CASE expressions cannot.

may be a result of overly aggressive pruning. I would assume PostgreSQL can handle and optimize CASE expressions the same way or better than we can (but i may be wrong), so CASE vs coalesce shouldn't make a difference.

For some backstory (that I happen to know) is that I believe @adriangb 's usecase is to use the PruningPredicate rewriter to rewrite expressions into range analysis expressions that are then pushed into postgres for execution (aka he is effectively executing the rewritten predicate expression on a posgres database that stores the min/maxes)

I agree it would be useful to understand why case is causing a problem

@adriangb
Copy link
Contributor Author

adriangb commented Dec 17, 2024

Thank you for filling in the backstory Andrew!

Yes that's right I basically have the statistics in a Postgres table (with file_id and row_group columns) so I end up running a query along the lines of:

select file_id, row_group, <some other metadata>
from stats
where <pruning predicate> is not false

Note that I add is not false, will come back to that later.

The point is that even for 100k row groups / several thousand files it's very easy for Postgres to scan this table and return the matching row groups (easy as in <1s) while it would be a lot of work to do a LIST on a bucket, download footers, decode, get stats, etc. And this is better than storing the data in postgres, copying it to the application and doing the pruning there because it saves moving the data over the wire.
I want to make it fast even for millions of row groups, which requires an index on some of the stats columns (for us in particular temporal or partitioning columns make sense, so something like start_timestamp_min gets an index because pretty much every query filters by it).

Now about the is not false: I found that PruningPredicate can return null values: if _max or _min are null.

Consider:

postgres=# create table stats (row_count int, null_count int, min int, max int);
CREATE TABLE
postgres=# insert into stats values (1, 0, null, null);
INSERT 0 1
postgres=# select case when row_count = null_count then false else min > 1 end from stats;
 case
------

(1 row)

Maybe this can never happen with Parquet stats, but it certainly can happen with my setup: you add a new column and all "old" rows for files that don't have that column are null.
So I don't think using a CASE statement is enough to guarantee that the predicate can never return null. There must be some further guarantees about Parquet metadata specifically e.g. that the min/max stats can only be null if all of the rows are null or something.
In fact, I think there is code to handle that here:

_ => {
// Null or true means the rows in container may pass this
// conjunct so we can't prune any containers based on that
}

Regarding Postgres' ability to handle CASE, I was a bit surprised as well but indeed it does not use an index with CASE expressions:

postgres=# create table stats (row_count int, null_count int, min int, max int);
CREATE TABLE
postgres=# INSERT INTO stats (row_count, null_count, min, max)
INSERT INTO stats (row_count, null_count, min, max)
SELECT 
    1,
    0,
    g.id as min,
    g.id + 1 as max
FROM generate_series(0, 100000000) g(id);
INSERT 0 100000001
postgres=#
postgres=# create index on stats (min);
CREATE INDEX
postgres=# explain select * from stats where case when row_count = null_count then false else min > 599 end;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Seq Scan on stats  (cost=10000000000.00..10002040541.00 rows=50000000 width=16)
   Filter: CASE WHEN (row_count = null_count) THEN false ELSE (min > 599) END
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(5 rows)

postgres=# explain select * from stats where not row_count = null_count and min > 599;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Bitmap Heap Scan on stats  (cost=623884.23..2603804.49 rows=33166667 width=16)
   Recheck Cond: (min > 599)
   Filter: (row_count <> null_count)
   ->  Bitmap Index Scan on stats_min_idx  (cost=0.00..615592.56 rows=33333333 width=0)
         Index Cond: (min > 599)

Two other general notes:

  • It would be nice if PruningPrediacate could take into account which columns are non-nullable and skip the row_count = null_count condition for those (will always evaluate to false).
  • I would have flipped the boolean logic so that if the predicate returns false(ey) we keep the container. That way null and false meen keep and only true means prune.

@adriangb
Copy link
Contributor Author

adriangb commented Dec 17, 2024

2. Can you use coalesce instead of case to substitute in the true (might be matching rows)?

That should work:

postgres=# create table stats (row_count int, null_count int, min int, max int);
CREATE TABLE
postgres=# INSERT INTO stats (row_count, null_count, min, max)
SELECT 
    1,
    0,
    g.id as min,
    g.id + 1 as max
FROM generate_series(0, 100000000) g(id);
INSERT 0 100000001
postgres=# create index on stats (coalesce(not row_count = null_count, true), min);
CREATE INDEX
postgres=# explain select * from stats where coalesce(not row_count = null_count, true) and min > 599;
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on stats  (cost=353633.90..2318398.97 rows=16666667 width=16)
   Recheck Cond: (COALESCE((row_count <> null_count), true) AND (min > 599))
   ->  Bitmap Index Scan on stats_coalesce_min_idx  (cost=0.00..349467.24 rows=16666667 width=0)
         Index Cond: ((COALESCE((row_count <> null_count), true) = true) AND (min > 599))

It may be slower than without the coalesce but it's certainly better than not being able to have an index at all.

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

I thought more about this in the 🚿 and doubled checked the logic:

TLDR is I think using NOT (as in this PR) is ok -- the rationale is that if evaluating column_count = null_count is null it means nothing is known about the null_counts. However, since null AND ... will still resolve to false if the ... is false (aka we can prove the predicate is not true by other means), then the requirements of the pruning predicate will be satisfied

ColumnarValue::Scalar(ScalarValue::Boolean(Some(false))) => {
// False means all containers can not pass the predicate
self.inner = vec![false; self.inner.len()];
}

So TLDR is upon more thought I think the theory behind this PR is sound ✅

I need to review the code more carefully and ensure we have a test that has unknown (unspecified) column count but the value can be proven true by other min/max ranges but otherwise it should be good to go.

Comment on lines +2085 to +2109
// If the null counts themselves are missing we should be able to fall back to the stats
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, true)]));
let container_stats = ContainerStats {
min: Some(Arc::new(Int32Array::from(vec![Some(0)]))),
max: Some(Arc::new(Int32Array::from(vec![Some(0)]))),
null_counts: Some(Arc::new(UInt64Array::from(vec![None]))),
row_counts: Some(Arc::new(UInt64Array::from(vec![Some(1)]))),
..ContainerStats::default()
};
let statistics = TestStatistics::new().with("i", container_stats);
let expected_ret = &[true];
prune_with_expr(col("i").eq(lit(0)), &schema, &statistics, expected_ret);
let expected_ret = &[false];
prune_with_expr(col("i").gt(lit(0)), &schema, &statistics, expected_ret);

// Same for the row counts
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, true)]));
let container_stats = ContainerStats {
min: Some(Arc::new(Int32Array::from(vec![Some(0)]))),
max: Some(Arc::new(Int32Array::from(vec![Some(0)]))),
null_counts: Some(Arc::new(UInt64Array::from(vec![Some(1)]))),
row_counts: Some(Arc::new(UInt64Array::from(vec![None]))),
..ContainerStats::default()
};
let statistics = TestStatistics::new().with("i", container_stats);
let expected_ret = &[true];
prune_with_expr(col("i").eq(lit(0)), &schema, &statistics, expected_ret);
let expected_ret = &[false];
prune_with_expr(col("i").gt(lit(0)), &schema, &statistics, expected_ret);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb see test case here for #13795 (comment)

@adriangb
Copy link
Contributor Author

adriangb commented Dec 17, 2024

Can we go one step further and instead of not row_count = null_count use row_count != null_count?

(the only test failures I see from making this change are the ones that assert on the generated sql so I think it works but I wanted to ask before I spend 15 minutes updating all of the strings)

@findepi
Copy link
Member

findepi commented Dec 17, 2024

not row_count = null_count use row_count != null_count?

AFAICT those two are equivalent unless operands are IEEE 754 floating point numbers. are they?

@adriangb
Copy link
Contributor Author

not row_count = null_count use row_count != null_count?

AFAICT those two are equivalent unless operands are IEEE 754 floating point numbers. are they?

In this case we know that they're always integers :)

@adriangb
Copy link
Contributor Author

Okay I've swapped to use row_count != null_count.

I also updated all of the comments and docstrings.

Comment on lines +290 to +293
/// While `PruningPredicate` will never return a `NULL` value, the
/// rewritten predicate (as returned by `build_predicate_expression` and used internally
/// by `PruningPredicate`) may evaluate to `NULL` when some of the min/max values
/// or null / row counts are not known.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of

_ => {
// Null or true means the rows in container may pass this
// conjunct so we can't prune any containers based on that
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has always been true and is also clarified in the same docstring lower down, I just wanted to add it here again since it's caused confusion in the past (even for @alamb !):

/// Then, the `PruningPredicate` rewrites the original predicate into an
/// expression that references the min/max values of each column in the original
/// predicate.
///
/// When the min/max values are actually substituted in to this expression and
/// evaluated, the result means
///
/// * `true`: there MAY be rows that pass the predicate, **KEEPS** the container
///
/// * `NULL`: there MAY be rows that pass the predicate, **KEEPS** the container
/// Note that rewritten predicate can evaluate to NULL when some of
/// the min/max values are not known. *Note that this is different than
/// the SQL filter semantics where `NULL` means the row is filtered
/// out.*
///
/// * `false`: there are no rows that could possibly match the predicate,
/// **PRUNES** the container

The difference now is that if the null or row count is null we will also return null in the case where we can't use the min/max stats to prove that the file can be pruned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Yes I find reasoning about tri-state boolean logic (and what can be proven about what) confusing!

@adriangb
Copy link
Contributor Author

Here's a breakdown of how the clauses evaluate before and after this PR.

Before

row_count = null_count predicate outcome
null null null
null true true
null false false
true null false
true true false
true false false
false null null
false true true
false false false

After

row_count = null_count row_count != null_count predicate outcome
null null null null
null null true null
null null false false
true false null false
true false true false
true false false false
false true null null
false true true true
false true false false

The only change is for the case where row_count = null_count is null and the predicate is true: previously true would be returned but now null is returned. Since we interpret null as true the end result is the same.

@adriangb
Copy link
Contributor Author

adriangb commented Dec 17, 2024

I wanted to check that the assumption that falling back on min/max stats when row / null counts are missing/unknown is safe even when all rows are null and we don't know that all of the rows are null.

I was afraid that writers might populate the statistics with a default value (e.g. 0) if all of the rows are null instead of null.
And that some other pass might then remove the information on row count / null counts resulting in:

row_count null_count min max
null null 0 0

I believe both the previous implementation and this implementation would decide to exclude the container for the predicate col > 1:

CASE WHEN null = null THEN false ELSE col_min > 1 END`
null != null AND col_min > 1

Both of these cases return false indicating that we know we can prune the container.
I guess that's right but possibly confusing since we essentially have no real information on the container.

I tested using our writer:

use std::sync::Arc;

use parquet::arrow::ArrowWriter;

#[tokio::main]
async fn main() {
    let array = arrow::array::Int32Array::from(vec![None]);
    let array = Arc::new(array);
    let schema = arrow::datatypes::Schema::new(vec![arrow::datatypes::Field::new("a", arrow::datatypes::DataType::Int32, true)]);
    let schema = Arc::new(schema);
    let batch = arrow::record_batch::RecordBatch::try_new(schema.clone(), vec![array]).unwrap();

    let mut file = std::fs::File::create("test.parquet").unwrap();
    let mut writer = ArrowWriter::try_new(&mut file, schema.clone(), None).unwrap();
    writer.write(&batch).unwrap();
    writer.close().unwrap();
}
❯ parquet meta test.parquet 
Row group 0:  count: 1  38.00 B records  start: 4  total(compressed): 38 B total(uncompressed):38 B 
--------------------------------------------------------------------------------
   type      encodings count     avg size   nulls   min / max
a  INT32     _ RB_     1         38.00 B    1 

It does what I think is correct and sets min / max to null.
If any writer were to set the min / max values to defaults instead of nulls I think that would be a bug in the writer.
And there would have to be a further bug in the writer or other part of the system to discard or never record the fact that null_count = row_count.

So I think this is not a realistic scenario to hit, even if it theoretically could cause issues.

@adriangb adriangb requested a review from alamb December 17, 2024 16:40
@findepi
Copy link
Member

findepi commented Dec 18, 2024

I was afraid that writers might populate the statistics with a default value (e.g. 0) if all of the rows are null instead of null. And that some other pass might then remove the information on row count / null counts resulting in:

row_count null_count min max
null null 0 0

if all values are null, then all non-null values are in [0, 0] range, so min=0 and max=0 would be valid.

@adriangb
Copy link
Contributor Author

So is that assumption incorrect? Then it's a bug in the current implementation.

@adriangb
Copy link
Contributor Author

So is that assumption incorrect? Then it's a bug in the current implementation.

Thinking about this again you're right. The worst that can happen is that we return true and include the file. If we return false well all of the rows are null so we didn't need to scan the container anyway 👍🏻.

@adriangb adriangb force-pushed the predicate-pruning-remove-case branch from f92442e to 403bde8 Compare December 19, 2024 06:27
@adriangb
Copy link
Contributor Author

The only change is for the case where row_count = null_count is null and the predicate is true: previously true would be returned but now null is returned. Since we interpret null as true the end result is the same.

If we wanted to eliminate this change from this PR we can add a condition and row_count is not null and null_count is not null. If we go one small step further and add and col_a_min is not null and col_b_min is not null then we actually ensure that the predicates always return true or false and never null which IMO would be helpful just to minimize confusion especially since the interpretation of null here is so confusing and opposes the interpretation of null in a where clause.

@adriangb
Copy link
Contributor Author

If we go one small step further and add and col_a_min is not null and col_b_min is not null then we actually ensure that the predicates always return true or false and never null which IMO would be helpful just to minimize confusion especially since the interpretation of null here is so confusing and opposes the interpretation of null in a where clause.

Turns out not that small of a step: there's cases that previously pruned the row that no longer would, in particular when one statistic (min or max) is missing but the other one isn't and we only use the non-missing statistic.

@alamb
Copy link
Contributor

alamb commented Dec 19, 2024

The only change is for the case where row_count = null_count is null and the predicate is true: previously true would be returned but now null is returned. Since we interpret null as true the end result is the same.

I agree with this assesment. Thank you for the analysis @adriangb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb -- I went through this PR carefully and I think it is a very nice improvement. The expressions are simpler and will be faster to evaluate 👌

Thanks again for all the careful review work

Comment on lines +290 to +293
/// While `PruningPredicate` will never return a `NULL` value, the
/// rewritten predicate (as returned by `build_predicate_expression` and used internally
/// by `PruningPredicate`) may evaluate to `NULL` when some of the min/max values
/// or null / row counts are not known.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Yes I find reasoning about tri-state boolean logic (and what can be proven about what) confusing!

/// * `null = null` is `null` which is not true, so the `CASE` expression will use the `ELSE` clause
/// * `1 <= 5 AND 5 <= 100 AND 4 <= 10 AND 10 <= 7`
/// * `true AND true AND true AND false`
/// * `null != null AND (1 <= 5 AND 5 <= 100) AND null != null AND (4 <= 10 AND 10 <= 7)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked this and looks good to me

datafusion/physical-optimizer/src/pruning.rs Outdated Show resolved Hide resolved
datafusion/physical-optimizer/src/pruning.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Dec 19, 2024

Thank you also @findepi for your reviews

I'll plan to merge this tomorrow to allow time for other comments unless we find issues or others would like time to review

Copy link
Contributor

@appletreeisyellow appletreeisyellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb for the improvement! 🚀

I also double-checked the logic, and it works as expected. The rewritten pruning predicate is now shorter and easier to read with your change 👏 I also tested this PR against InfluxData IOx code and all the tests pass!

datafusion/physical-optimizer/src/pruning.rs Outdated Show resolved Hide resolved
datafusion/physical-optimizer/src/pruning.rs Outdated Show resolved Hide resolved
adriangb and others added 2 commits December 19, 2024 17:11
Co-authored-by: Chunchun Ye <[email protected]>
Co-authored-by: Chunchun Ye <[email protected]>
@alamb
Copy link
Contributor

alamb commented Dec 20, 2024

I plan to merge this tomorrow morning unless i hear otherwise

@alamb alamb merged commit 079f621 into apache:main Dec 20, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 20, 2024

Thanks again @adriangb and @appletreeisyellow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants