-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define LogicalPlan invariants, delineated by invariant type. #13651
base: main
Are you sure you want to change the base?
Define LogicalPlan invariants, delineated by invariant type. #13651
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the idea of having additional testing like this! It's definitely out of my area of expertise but a few questions:
|
// verify invariant: equivalent schema across union inputs | ||
// assert_unions_are_valid(check_name, plan)?; | ||
|
||
// TODO: trait API and provide extension on the Optimizer to define own validations? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a mention of the extensibility of invariants. Options include:
- for general invariants:
- defined as being checked before/after each OptimizerRule, and applied here in
check_plan()
(or equivalent code) - we could provide
Optimizer.invariants = Vec<Arc<dyn InvariantCheck>>
for user-defined invariants
- defined as being checked before/after each OptimizerRule, and applied here in
- for invariants specific for a given OptimizerRule:
- we could provide
OptimizerRule::check_invariants()
such that certain invariants are only checked for a given rule (instead of all rules) - for a user-defined OptimizerRule, users can also check their own invariants
- we could provide
Ditto for the AnalyzerRule passes. Altho I wasn't sure about how much is added complexity and planning time overhead - as @Omega359 mentions we could make it configurable (e.g. run for CI and debugging in downstream projects).
This WIP is about proposing different ideas of what we could do. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it can be controlled through environment variables, similar to RUST_LOG
or RUST_BACKTRACE
. Enable it for debugging when problems are encountered or during an upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have example use-case for user-defined plan invariants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some special invariants for our SortPreservingMerge
replacement, ProgressiveEval
(related to time ranges of parquet files) that would be great to be able to encode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it can be controlled through environment variables, similar to RUST_LOG or RUST_BACKTRACE. Enable it for debugging when problems are encountered or during an upgrade.
We could also add it as a debug_assert!
after each optimizer pass and call the real validation
- After analyze
- After all the optimizer passes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some special invariants for our
SortPreservingMerge
replacement,ProgressiveEval
(related to time ranges of parquet files) that would be great to be able to encode
is this about LogicalPlan::Extension
? I agree it makes sense to support validation of these if we validate the overall plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this to the followup items list: #13652 (comment)
/// This invariant is subject to change. | ||
/// refer: <https://github.com/apache/datafusion/issues/13525#issuecomment-2494046463> | ||
fn assert_unique_field_names(plan: &LogicalPlan) -> Result<()> { | ||
plan.schema().check_names()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is check_names
also called whenever creating new DFSchema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on every creation. But not after every merge. Which should be ok if no bug is introduced in the merge -- altho I would prefer to add the check there.
I like the verification. The only concern is the overhead of doing it, especially for large plans. Every optimizer pass gets a new verification of all plan nodes. i hope one day we go towards more iterative optimization and then we should be able to verify plan invariants locally. Which would guarantee that the verification overhead is linearly proportional to number of modifications applied to the plan. This would require eg that get_type is a constant operation (#12604), since plan verification should also include that the types do match. |
I think it would also be great if we could consider moving the invariant check directly into This would make this function easier to discover and for others to understand what the invariants are. |
…sus assertions made after a given analyzer or optimizer pass
The updates focus on dividing out LP invariants, vs analyzer pass checks, vs optimizer pass checks (including invariants listed as the analyzer's responsibility). I added in reference links to invariants in the docs, in an attempt to delineate an invariant vs a valid plan. But I'm muddy on this boundary. I'll make it configurable (or debug only) after I get feedback on this^^. 🙏🏼 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @wiedld -- thank you
I think the basic structure of assert_invariants is looking good.
I had a suggestion for how to combine the two different invariant checks into a single API.
In terms of planning:
- Perhaps we can begin creating an epic for "Enforcing Invariants" and list items there to follow up on (like "Define the invariants for Union plans" for example)
In terms of this PR, once we solidify the API the only remaining concern I have is with potential performance slowdown of re-checking the invariants. When we have the code ready we can run the sql_planning
benchmark to make sure this PR doesn't slow down planning. If it does, we can perhaps only run the invariant checks in debug builds
// Refer to <https://datafusion.apache.org/contributor-guide/specification/invariants.html#relation-name-tuples-in-logical-fields-and-logical-columns-are-unique> | ||
assert_unique_field_names(self)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had forgotten about that documentation 🤔
Maybe it would be a good thing to inline that information into the code rather than in a separate document that might be harder to find (not for you, I am just musing here for my own benefit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this to the followup items list: #13652 (comment)
…e (valid semantic plan) vs basic LP invariants
…o a TableScan's filter clause
651f5d3
to
e52187e
Compare
| LogicalPlan::TableScan(_) | ||
| LogicalPlan::Window(_) | ||
| LogicalPlan::Aggregate(_) | ||
| LogicalPlan::Join(_) => Ok(()), | ||
_ => plan_err!( | ||
"In/Exist subquery can only be used in \ | ||
Projection, Filter, Window functions, Aggregate and Join plan nodes, \ | ||
Projection, Filter, TableScan, Window functions, Aggregate and Join plan nodes, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The push_down_filter
optimization pass can push a IN(<subquery>)
into a TableScan. This change was required for push_down_filter::tests::test_in_subquery_with_alias to pass.
Ran all all benchmarksc3d-standard-8 debian-12
confirmation run
The performance change was minimized because we didn't add that many extra timepoints of checking. Specifically:
I propose that we add another full I can also make some of these check listed above into debug only, if the current performance change is unacceptable. |
We could also reduce the time on this one by making it compile time with Arc::ptr_eq, but that means (a) a stricter contract (metadata & nullability cannot change) and (b) it requires code changes to chase down the current pointer_eq failures. 🤔 |
Which issue does this PR close?
Part of #13652
Rationale for this change
The original discussion included implicit changes which can cause problems when trying to upgrade DF. One class of bugs are related to user-constructed LPs, and the mutations of these LPs. This PR is a first step to programmatically enforce the rules of what should, and should not, be done.
What changes are included in this PR?
Are these changes tested?
By existing tests, and are benchmarked for impact to planning time.
Are there any user-facing changes?
There is a new
LogicalPlan::check_invariants
public api.