Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(9678): short circuiting prevented population of visited stack, for common subexpr elimination optimization #9685

Merged
merged 11 commits into from
Mar 21, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Mar 18, 2024

Which issue does this PR close?

Closes #9678

Rationale for this change

The TreeNode refactor is awesome, but it seemed to introduce a bug in common_subexpr_eliminate, specifically for use cases containing short-circuited expressions. The first 2 commits in this PR are the reproducer, and the initial fix.

After the initial fix, we started seeing an error in another test. Reproducer is the third commit.

The remaining commits are a combination of chore, refactor, and fix. The chore is where I was figuring out how the parts fit together, and documenting in code & abstractions. The refactor is a tweak to the fix for the first bug. Then there is a new fix for the second bug (which was introduced in the TreeNode refactor).

What changes are included in this PR?

  1. Fix the bug
  2. New regression tests

Are these changes tested?

Yes. Both fixes are covered by new tests.

Are there any user-facing changes?

No API changes.

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 18, 2024
@wiedld wiedld force-pushed the 9678/common-subexpr-eliminate branch from f190b24 to 256a701 Compare March 19, 2024 18:30
@github-actions github-actions bot added logical-expr Logical plan and expressions and removed logical-expr Logical plan and expressions labels Mar 19, 2024
let expr_set_item = self.expr_set.get(id).ok_or_else(|| {
internal_datafusion_err!("expr_set invalid state")
})?;
if *series_number < self.max_series_number
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks are already done earlier. Remove.

@@ -719,23 +741,10 @@ impl TreeNodeRewriter for CommonSubexprRewriter<'_> {
));
}

let (series_number, id) = &self.id_array[self.curr_index];
Copy link
Contributor Author

@wiedld wiedld Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was one of the bugs (see the regression test). It uses a self.curr_index which has already been incr, when it really needs the original self.curr_index for the current expr node being visited. The fix is to grab all of the IdArray[self.curr_index] values once (see new line 717), before all the self.curr_index incr shenanigans.

Ok(TreeNodeRecursion::Continue)
}

fn f_up(&mut self, expr: &Expr) -> Result<TreeNodeRecursion> {
self.series_number += 1;

let Some((idx, sub_expr_desc)) = self.pop_enter_mark() else {
return Ok(TreeNodeRecursion::Continue);
};
Copy link
Contributor Author

@wiedld wiedld Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was one of the bugs, which occurred for short-circuited expr. The behavior was changed with the TreeNode refactor. Previously, it was always returning a sub_expr_desc => then executing the remaining function block.

The changes I made here was to (1) bring it back closer to the original code, while (2) incorporating the jump contract from the consolidated visitor.

The actual process to figure this out was done over several commits. The bug was fixed by build the proper visited_stack, then later I tweaked the fix as I figured out a bit more.

}
}
}
None
unreachable!("Enter mark should paired with node number");
Copy link
Contributor Author

@wiedld wiedld Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bring back the pop_enter_mark() contract to the original (before the TreeNode refactor), while also adding in a new JumpMark.

///
/// - series_number. increased in fn_up, start from 1.
/// - Identifier. is empty ("") if expr should not be considered for common elimation.
type IdArray = Vec<(usize, Identifier)>;
Copy link
Contributor Author

@wiedld wiedld Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstraction really helps to understand how the 2 visitors work together.

The first visitor builds the IdArray and ExprSet. The first visitor sets the vector idx used on f_down, whereas the series_number is set on f_up. An empty string for the expr_id means that this expression is not considered for rewrite (is skipped) in the second visitor.
e.g. [(3, expr_id),(2, expr_id),(1, expr_id)]

The second visitor then performs the mutation, based upon the IdArray and ExprSet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THank you -- I agree introducing a new typedef helps to make the code eaiser to understand

@wiedld wiedld force-pushed the 9678/common-subexpr-eliminate branch from b08acf7 to a55ced4 Compare March 20, 2024 07:15
@alamb alamb marked this pull request as ready for review March 20, 2024 13:42
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I went through this PR carefully and I think it looks really nice. I really liked how you both fixed the bug, but also left the code better than when you found it (better comments, typedef, etc) 🏆

I had some minor comment / style suggestions, but we could do them as a follow on PRs.

Here is a PR that targets your branch with some suggested comment refinements: wiedld#2

cc @waynexia (CommonSubExpr expert), @peter-toth (TreeNodeRefactor), and @crepererum who has also worked on this code in the past

datafusion/optimizer/src/common_subexpr_eliminate.rs Outdated Show resolved Hide resolved
@@ -571,66 +584,73 @@ enum VisitRecord {
/// `usize` is the monotone increasing series number assigned in pre_visit().
/// Starts from 0. Is used to index the identifier array `id_array` in post_visit().
EnterMark(usize),
/// the node's children were skipped => jump to f_up on same node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key fix, as I understand it -- the TreeNodeVisitor rewrite removed the notion of skipping sibling nodes during recursion, so this notion must be explicitly encoded in the subexpression rewrite pass.

Copy link
Contributor Author

@wiedld wiedld Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding (please correct here 🙏🏼 ), prior to refactoring the 2 different visitors (TreeNodeVisitor and TreeNodeRewriter) had different transversal rules, and the semantics of skip vs stop mapped in weird ways (as nicely described here in the TreeNode refactor's "Rationale for Change").

Prior to the refactor:

  • the first visitor used skips (VisitRecursion::Skip), which never calls f_up afterwards.
  • the second visitor used RewriteRecursion::Stop which is functionally the same as VisitRecursion::Skip

Fix for the first visitor, for the short-circuited nodes:

Prior to the refactor, the first tree walk did not add short-circuited nodes to the stack:

  1. short-circuited nodes were skipped, without adding to the stack
  2. the old semantics meant skipping the post-visit (up) call
  3. it never looked in the stack (since no post-visit on the old semantics)

Whereas with the new TreeNode refactor the jump became a skip-children-but-call-f_up, which meant:

  1. prior to the fix in this PR, it was using the old control flow. So it still didn't add to visited_stack.
  2. refactor means that it now does perform the f_up on jump
  3. f_up checks the stack (in self.pop_enter_mark()) and would not find it
  4. the original refactor got around this by removing the panic and adding an option

Edited: For the changes done in the TreeNodeVisitor for the short-circuited nodes, I basically made it work with a jump that does the post-visit f_up:

  • add a placeholder to the id_array so that it exists on f_up
    • keep the idx counts correct, for that IdArray abstraction
  • fixing the stack management:
    • push into the stack so it exists on the f_up
      • before this bugfix, it was either not finding it in the stack (None), or occasionally matching another/wrong entry in the stack (such as in our bug test case).
    • make a VisitRecord::JumpMark to be used in the stack
      • therefore the stack entry should always exist, including on short-circuited jumps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's how the API worked before and after #8891. I think the issue came in with this peter-toth#1 adjustment.
Thanks @wiedld for fixing it!

///
/// - series_number. increased in fn_up, start from 1.
/// - Identifier. is empty ("") if expr should not be considered for common elimation.
type IdArray = Vec<(usize, Identifier)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THank you -- I agree introducing a new typedef helps to make the code eaiser to understand

@alamb alamb mentioned this pull request Mar 20, 2024
8 tasks
@alamb
Copy link
Contributor

alamb commented Mar 21, 2024

I plan to merge this later today unless someone else would like more time to review

@waynexia
Copy link
Member

I plan to give it a review later tomorrow. We can merge this first as there is one approval if the release progress is blocked.

@alamb
Copy link
Contributor

alamb commented Mar 21, 2024

I merged up to main and pushed a commit to this branch

I plan to give it a review later tomorrow. We can merge this first as there is one approval if the release progress is blocked.

Given I would like to unblock the release I do plan to merge it in later today. We'll plan to address any comments as follow on PRs.

Thank you very much @waynexia

@alamb alamb merged commit 5f0cb49 into apache:main Mar 21, 2024
23 checks passed
alamb added a commit to alamb/datafusion that referenced this pull request Mar 22, 2024
…r common subexpr elimination optimization (apache#9685)

* test(9678): reproducer of short-circuiting causing expr elimination to error

* fix(9678): populate visited stack for short-circuited expressions, during the common-expr elimination optimization

* test(9678): reproducer for optimizer error (in common_subexpr_eliminate), as seen in other test case

* chore: extract id_array into abstraction, to make it more clear the relationship between the two visitors

* refactor: tweak the fix and make code more explicit (JumpMark, node_to_identifier)

* fix: get the series_number and curr_id with the correct self.current_idx, before the various incr/decr

* chore: remove unneeded conditional check (already done earlier), and add code comments

* Refine documentation in common_subexpr_eliminate.rs

* chore: cleanup -- fix 1 doc comment and consolidate common-expr-elimination test with other expr test

---------

Co-authored-by: Andrew Lamb <[email protected]>
wiedld added a commit to wiedld/arrow-datafusion that referenced this pull request Mar 28, 2024
…r common subexpr elimination optimization (apache#9685)

* test(9678): reproducer of short-circuiting causing expr elimination to error

* fix(9678): populate visited stack for short-circuited expressions, during the common-expr elimination optimization

* test(9678): reproducer for optimizer error (in common_subexpr_eliminate), as seen in other test case

* chore: extract id_array into abstraction, to make it more clear the relationship between the two visitors

* refactor: tweak the fix and make code more explicit (JumpMark, node_to_identifier)

* fix: get the series_number and curr_id with the correct self.current_idx, before the various incr/decr

* chore: remove unneeded conditional check (already done earlier), and add code comments

* Refine documentation in common_subexpr_eliminate.rs

* chore: cleanup -- fix 1 doc comment and consolidate common-expr-elimination test with other expr test

---------

Co-authored-by: Andrew Lamb <[email protected]>
wiedld added a commit to influxdata/arrow-datafusion that referenced this pull request Apr 1, 2024
…r common subexpr elimination optimization (apache#9685)

* test(9678): reproducer of short-circuiting causing expr elimination to error

* fix(9678): populate visited stack for short-circuited expressions, during the common-expr elimination optimization

* test(9678): reproducer for optimizer error (in common_subexpr_eliminate), as seen in other test case

* chore: extract id_array into abstraction, to make it more clear the relationship between the two visitors

* refactor: tweak the fix and make code more explicit (JumpMark, node_to_identifier)

* fix: get the series_number and curr_id with the correct self.current_idx, before the various incr/decr

* chore: remove unneeded conditional check (already done earlier), and add code comments

* Refine documentation in common_subexpr_eliminate.rs

* chore: cleanup -- fix 1 doc comment and consolidate common-expr-elimination test with other expr test

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: incorrect result for aliased arguments in select list after TreeNode rewrite
4 participants