-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(9678): short circuiting prevented population of visited stack, for common subexpr elimination optimization #9685
Conversation
…ring the common-expr elimination optimization
…te), as seen in other test case
f190b24
to
256a701
Compare
let expr_set_item = self.expr_set.get(id).ok_or_else(|| { | ||
internal_datafusion_err!("expr_set invalid state") | ||
})?; | ||
if *series_number < self.max_series_number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks are already done earlier. Remove.
@@ -719,23 +741,10 @@ impl TreeNodeRewriter for CommonSubexprRewriter<'_> { | |||
)); | |||
} | |||
|
|||
let (series_number, id) = &self.id_array[self.curr_index]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was one of the bugs (see the regression test). It uses a self.curr_index
which has already been incr, when it really needs the original self.curr_index
for the current expr node being visited. The fix is to grab all of the IdArray[self.curr_index]
values once (see new line 717), before all the self.curr_index
incr shenanigans.
Ok(TreeNodeRecursion::Continue) | ||
} | ||
|
||
fn f_up(&mut self, expr: &Expr) -> Result<TreeNodeRecursion> { | ||
self.series_number += 1; | ||
|
||
let Some((idx, sub_expr_desc)) = self.pop_enter_mark() else { | ||
return Ok(TreeNodeRecursion::Continue); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was one of the bugs, which occurred for short-circuited expr. The behavior was changed with the TreeNode refactor. Previously, it was always returning a sub_expr_desc
=> then executing the remaining function block.
The changes I made here was to (1) bring it back closer to the original code, while (2) incorporating the jump contract from the consolidated visitor.
The actual process to figure this out was done over several commits. The bug was fixed by build the proper visited_stack, then later I tweaked the fix as I figured out a bit more.
} | ||
} | ||
} | ||
None | ||
unreachable!("Enter mark should paired with node number"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bring back the pop_enter_mark() contract to the original (before the TreeNode refactor), while also adding in a new JumpMark.
/// | ||
/// - series_number. increased in fn_up, start from 1. | ||
/// - Identifier. is empty ("") if expr should not be considered for common elimation. | ||
type IdArray = Vec<(usize, Identifier)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This abstraction really helps to understand how the 2 visitors work together.
The first visitor builds the IdArray
and ExprSet
. The first visitor sets the vector idx used on f_down, whereas the series_number is set on f_up. An empty string for the expr_id means that this expression is not considered for rewrite (is skipped) in the second visitor.
e.g. [(3, expr_id),(2, expr_id),(1, expr_id)]
The second visitor then performs the mutation, based upon the IdArray
and ExprSet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THank you -- I agree introducing a new typedef helps to make the code eaiser to understand
…elationship between the two visitors
…idx, before the various incr/decr
…add code comments
b08acf7
to
a55ced4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- I went through this PR carefully and I think it looks really nice. I really liked how you both fixed the bug, but also left the code better than when you found it (better comments, typedef, etc) 🏆
I had some minor comment / style suggestions, but we could do them as a follow on PRs.
Here is a PR that targets your branch with some suggested comment refinements: wiedld#2
cc @waynexia (CommonSubExpr expert), @peter-toth (TreeNodeRefactor), and @crepererum who has also worked on this code in the past
datafusion/sqllogictest/test_files/common_subexpr_eliminate.slt
Outdated
Show resolved
Hide resolved
datafusion/sqllogictest/test_files/common_subexpr_eliminate.slt
Outdated
Show resolved
Hide resolved
@@ -571,66 +584,73 @@ enum VisitRecord { | |||
/// `usize` is the monotone increasing series number assigned in pre_visit(). | |||
/// Starts from 0. Is used to index the identifier array `id_array` in post_visit(). | |||
EnterMark(usize), | |||
/// the node's children were skipped => jump to f_up on same node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key fix, as I understand it -- the TreeNodeVisitor rewrite removed the notion of skipping sibling nodes during recursion, so this notion must be explicitly encoded in the subexpression rewrite pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding (please correct here 🙏🏼 ), prior to refactoring the 2 different visitors (TreeNodeVisitor
and TreeNodeRewriter
) had different transversal rules, and the semantics of skip vs stop mapped in weird ways (as nicely described here in the TreeNode refactor's "Rationale for Change").
Prior to the refactor:
- the first visitor used skips (
VisitRecursion::Skip
), which never calls f_up afterwards. - the second visitor used
RewriteRecursion::Stop
which is functionally the same asVisitRecursion::Skip
Fix for the first visitor, for the short-circuited nodes:
Prior to the refactor, the first tree walk did not add short-circuited nodes to the stack:
- short-circuited nodes were skipped, without adding to the stack
- the old semantics meant skipping the post-visit (up) call
- it never looked in the stack (since no post-visit on the old semantics)
Whereas with the new TreeNode refactor the jump became a skip-children-but-call-f_up, which meant:
- prior to the fix in this PR, it was using the old control flow. So it still didn't add to visited_stack.
- refactor means that it now does perform the f_up on jump
- f_up checks the stack (in
self.pop_enter_mark()
) and would not find it - the original refactor got around this by removing the panic and adding an option
Edited: For the changes done in the TreeNodeVisitor
for the short-circuited nodes, I basically made it work with a jump that does the post-visit f_up:
- add a placeholder to the id_array so that it exists on f_up
- keep the idx counts correct, for that IdArray abstraction
- fixing the stack management:
- push into the stack so it exists on the f_up
- before this bugfix, it was either not finding it in the stack (None), or occasionally matching another/wrong entry in the stack (such as in our bug test case).
- make a
VisitRecord::JumpMark
to be used in the stack- therefore the stack entry should always exist, including on short-circuited jumps.
- push into the stack so it exists on the f_up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's how the API worked before and after #8891. I think the issue came in with this peter-toth#1 adjustment.
Thanks @wiedld for fixing it!
/// | ||
/// - series_number. increased in fn_up, start from 1. | ||
/// - Identifier. is empty ("") if expr should not be considered for common elimation. | ||
type IdArray = Vec<(usize, Identifier)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THank you -- I agree introducing a new typedef helps to make the code eaiser to understand
…nation test with other expr test
I plan to merge this later today unless someone else would like more time to review |
I plan to give it a review later tomorrow. We can merge this first as there is one approval if the release progress is blocked. |
I merged up to main and pushed a commit to this branch
Given I would like to unblock the release I do plan to merge it in later today. We'll plan to address any comments as follow on PRs. Thank you very much @waynexia |
…r common subexpr elimination optimization (apache#9685) * test(9678): reproducer of short-circuiting causing expr elimination to error * fix(9678): populate visited stack for short-circuited expressions, during the common-expr elimination optimization * test(9678): reproducer for optimizer error (in common_subexpr_eliminate), as seen in other test case * chore: extract id_array into abstraction, to make it more clear the relationship between the two visitors * refactor: tweak the fix and make code more explicit (JumpMark, node_to_identifier) * fix: get the series_number and curr_id with the correct self.current_idx, before the various incr/decr * chore: remove unneeded conditional check (already done earlier), and add code comments * Refine documentation in common_subexpr_eliminate.rs * chore: cleanup -- fix 1 doc comment and consolidate common-expr-elimination test with other expr test --------- Co-authored-by: Andrew Lamb <[email protected]>
…r common subexpr elimination optimization (apache#9685) * test(9678): reproducer of short-circuiting causing expr elimination to error * fix(9678): populate visited stack for short-circuited expressions, during the common-expr elimination optimization * test(9678): reproducer for optimizer error (in common_subexpr_eliminate), as seen in other test case * chore: extract id_array into abstraction, to make it more clear the relationship between the two visitors * refactor: tweak the fix and make code more explicit (JumpMark, node_to_identifier) * fix: get the series_number and curr_id with the correct self.current_idx, before the various incr/decr * chore: remove unneeded conditional check (already done earlier), and add code comments * Refine documentation in common_subexpr_eliminate.rs * chore: cleanup -- fix 1 doc comment and consolidate common-expr-elimination test with other expr test --------- Co-authored-by: Andrew Lamb <[email protected]>
…r common subexpr elimination optimization (apache#9685) * test(9678): reproducer of short-circuiting causing expr elimination to error * fix(9678): populate visited stack for short-circuited expressions, during the common-expr elimination optimization * test(9678): reproducer for optimizer error (in common_subexpr_eliminate), as seen in other test case * chore: extract id_array into abstraction, to make it more clear the relationship between the two visitors * refactor: tweak the fix and make code more explicit (JumpMark, node_to_identifier) * fix: get the series_number and curr_id with the correct self.current_idx, before the various incr/decr * chore: remove unneeded conditional check (already done earlier), and add code comments * Refine documentation in common_subexpr_eliminate.rs * chore: cleanup -- fix 1 doc comment and consolidate common-expr-elimination test with other expr test --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #9678
Rationale for this change
The TreeNode refactor is awesome, but it seemed to introduce a bug in common_subexpr_eliminate, specifically for use cases containing short-circuited expressions. The first 2 commits in this PR are the reproducer, and the initial fix.
After the initial fix, we started seeing an error in another test. Reproducer is the third commit.
The remaining commits are a combination of chore, refactor, and fix. The
chore
is where I was figuring out how the parts fit together, and documenting in code & abstractions. Therefactor
is a tweak to the fix for the first bug. Then there is a newfix
for the second bug (which was introduced in the TreeNode refactor).What changes are included in this PR?
Are these changes tested?
Yes. Both fixes are covered by new tests.
Are there any user-facing changes?
No API changes.