fix(9678): short circuiting prevented population of visited stack, for common subexpr elimination optimization #9685

wiedld · 2024-03-18T22:47:00Z

Which issue does this PR close?

Rationale for this change

The TreeNode refactor is awesome, but it seemed to introduce a bug in common_subexpr_eliminate, specifically for use cases containing short-circuited expressions. The first 2 commits in this PR are the reproducer, and the initial fix.

After the initial fix, we started seeing an error in another test. Reproducer is the third commit.

The remaining commits are a combination of chore, refactor, and fix. The chore is where I was figuring out how the parts fit together, and documenting in code & abstractions. The refactor is a tweak to the fix for the first bug. Then there is a new fix for the second bug (which was introduced in the TreeNode refactor).

What changes are included in this PR?

Fix the bug
New regression tests

Are these changes tested?

Yes. Both fixes are covered by new tests.

Are there any user-facing changes?

No API changes.

…o error

…ring the common-expr elimination optimization

…te), as seen in other test case

wiedld · 2024-03-20T06:42:34Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

-                    let expr_set_item = self.expr_set.get(id).ok_or_else(|| {
-                        internal_datafusion_err!("expr_set invalid state")
-                    })?;
-                    if *series_number < self.max_series_number


These checks are already done earlier. Remove.

wiedld · 2024-03-20T06:44:27Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

@@ -719,23 +741,10 @@ impl TreeNodeRewriter for CommonSubexprRewriter<'_> {
                        ));
                    }

-                    let (series_number, id) = &self.id_array[self.curr_index];


This was one of the bugs (see the regression test). It uses a self.curr_index which has already been incr, when it really needs the original self.curr_index for the current expr node being visited. The fix is to grab all of the IdArray[self.curr_index] values once (see new line 717), before all the self.curr_index incr shenanigans.

wiedld · 2024-03-20T06:50:49Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

        Ok(TreeNodeRecursion::Continue)
    }

    fn f_up(&mut self, expr: &Expr) -> Result<TreeNodeRecursion> {
        self.series_number += 1;

-        let Some((idx, sub_expr_desc)) = self.pop_enter_mark() else {
-            return Ok(TreeNodeRecursion::Continue);
-        };


This was one of the bugs, which occurred for short-circuited expr. The behavior was changed with the TreeNode refactor. Previously, it was always returning a sub_expr_desc => then executing the remaining function block.

The changes I made here was to (1) bring it back closer to the original code, while (2) incorporating the jump contract from the consolidated visitor.

The actual process to figure this out was done over several commits. The bug was fixed by build the proper visited_stack, then later I tweaked the fix as I figured out a bit more.

wiedld · 2024-03-20T06:52:17Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

                }
            }
        }
-        None
+        unreachable!("Enter mark should paired with node number");


Bring back the pop_enter_mark() contract to the original (before the TreeNode refactor), while also adding in a new JumpMark.

wiedld · 2024-03-20T06:57:16Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+///
+/// - series_number. increased in fn_up, start from 1.
+/// - Identifier. is empty ("") if expr should not be considered for common elimation.
+type IdArray = Vec<(usize, Identifier)>;


This abstraction really helps to understand how the 2 visitors work together.

The first visitor builds the IdArray and ExprSet. The first visitor sets the vector idx used on f_down, whereas the series_number is set on f_up. An empty string for the expr_id means that this expression is not considered for rewrite (is skipped) in the second visitor.
e.g. [(3, expr_id),(2, expr_id),(1, expr_id)]

The second visitor then performs the mutation, based upon the IdArray and ExprSet.

THank you -- I agree introducing a new typedef helps to make the code eaiser to understand

…elationship between the two visitors

…o_identifier)

…idx, before the various incr/decr

…add code comments

alamb

Thank you @wiedld -- I went through this PR carefully and I think it looks really nice. I really liked how you both fixed the bug, but also left the code better than when you found it (better comments, typedef, etc) 🏆

I had some minor comment / style suggestions, but we could do them as a follow on PRs.

Here is a PR that targets your branch with some suggested comment refinements: wiedld#2

cc @waynexia (CommonSubExpr expert), @peter-toth (TreeNodeRefactor), and @crepererum who has also worked on this code in the past

datafusion/sqllogictest/test_files/common_subexpr_eliminate.slt

datafusion/optimizer/src/common_subexpr_eliminate.rs

alamb · 2024-03-20T13:36:32Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

@@ -571,66 +584,73 @@ enum VisitRecord {
    /// `usize` is the monotone increasing series number assigned in pre_visit().
    /// Starts from 0. Is used to index the identifier array `id_array` in post_visit().
    EnterMark(usize),
+    /// the node's children were skipped => jump to f_up on same node


This is the key fix, as I understand it -- the TreeNodeVisitor rewrite removed the notion of skipping sibling nodes during recursion, so this notion must be explicitly encoded in the subexpression rewrite pass.

For my understanding (please correct here 🙏🏼 ), prior to refactoring the 2 different visitors (TreeNodeVisitor and TreeNodeRewriter) had different transversal rules, and the semantics of skip vs stop mapped in weird ways (as nicely described here in the TreeNode refactor's "Rationale for Change").

Prior to the refactor:

the first visitor used skips (VisitRecursion::Skip), which never calls f_up afterwards.

the second visitor used RewriteRecursion::Stop which is functionally the same as VisitRecursion::Skip

Fix for the first visitor, for the short-circuited nodes:

Prior to the refactor, the first tree walk did not add short-circuited nodes to the stack:

short-circuited nodes were skipped, without adding to the stack

the old semantics meant skipping the post-visit (up) call

it never looked in the stack (since no post-visit on the old semantics)

Whereas with the new TreeNode refactor the jump became a skip-children-but-call-f_up, which meant:

prior to the fix in this PR, it was using the old control flow. So it still didn't add to visited_stack.

refactor means that it now does perform the f_up on jump

f_up checks the stack (in self.pop_enter_mark()) and would not find it

the original refactor got around this by removing the panic and adding an option

Edited: For the changes done in the TreeNodeVisitor for the short-circuited nodes, I basically made it work with a jump that does the post-visit f_up:

add a placeholder to the id_array so that it exists on f_up

keep the idx counts correct, for that IdArray abstraction

fixing the stack management:

push into the stack so it exists on the f_up

before this bugfix, it was either not finding it in the stack (None), or occasionally matching another/wrong entry in the stack (such as in our bug test case).

make a VisitRecord::JumpMark to be used in the stack

therefore the stack entry should always exist, including on short-circuited jumps.

Yes, that's how the API worked before and after #8891. I think the issue came in with this peter-toth#1 adjustment.
Thanks @wiedld for fixing it!

alamb · 2024-03-20T13:38:55Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+///
+/// - series_number. increased in fn_up, start from 1.
+/// - Identifier. is empty ("") if expr should not be considered for common elimation.
+type IdArray = Vec<(usize, Identifier)>;


THank you -- I agree introducing a new typedef helps to make the code eaiser to understand

datafusion/optimizer/src/common_subexpr_eliminate.rs

…nation test with other expr test

alamb · 2024-03-21T09:59:43Z

I plan to merge this later today unless someone else would like more time to review

waynexia · 2024-03-21T14:35:19Z

I plan to give it a review later tomorrow. We can merge this first as there is one approval if the release progress is blocked.

…liminate

alamb · 2024-03-21T18:19:55Z

I merged up to main and pushed a commit to this branch

I plan to give it a review later tomorrow. We can merge this first as there is one approval if the release progress is blocked.

Given I would like to unblock the release I do plan to merge it in later today. We'll plan to address any comments as follow on PRs.

Thank you very much @waynexia

…r common subexpr elimination optimization (apache#9685) * test(9678): reproducer of short-circuiting causing expr elimination to error * fix(9678): populate visited stack for short-circuited expressions, during the common-expr elimination optimization * test(9678): reproducer for optimizer error (in common_subexpr_eliminate), as seen in other test case * chore: extract id_array into abstraction, to make it more clear the relationship between the two visitors * refactor: tweak the fix and make code more explicit (JumpMark, node_to_identifier) * fix: get the series_number and curr_id with the correct self.current_idx, before the various incr/decr * chore: remove unneeded conditional check (already done earlier), and add code comments * Refine documentation in common_subexpr_eliminate.rs * chore: cleanup -- fix 1 doc comment and consolidate common-expr-elimination test with other expr test --------- Co-authored-by: Andrew Lamb <[email protected]>

wiedld added 2 commits March 18, 2024 15:31

test(9678): reproducer of short-circuiting causing expr elimination t…

f1a01eb

…o error

fix(9678): populate visited stack for short-circuited expressions, du…

ff46614

…ring the common-expr elimination optimization

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 18, 2024

test(9678): reproducer for optimizer error (in common_subexpr_elimina…

de09f72

…te), as seen in other test case

wiedld force-pushed the 9678/common-subexpr-eliminate branch from f190b24 to 256a701 Compare March 19, 2024 18:30

github-actions bot added logical-expr Logical plan and expressions and removed logical-expr Logical plan and expressions labels Mar 19, 2024

wiedld commented Mar 20, 2024

View reviewed changes

wiedld added 4 commits March 20, 2024 00:14

chore: extract id_array into abstraction, to make it more clear the r…

97938ad

…elationship between the two visitors

refactor: tweak the fix and make code more explicit (JumpMark, node_t…

1931b3b

…o_identifier)

fix: get the series_number and curr_id with the correct self.current_…

552e0f4

…idx, before the various incr/decr

chore: remove unneeded conditional check (already done earlier), and …

a55ced4

…add code comments

wiedld force-pushed the 9678/common-subexpr-eliminate branch from b08acf7 to a55ced4 Compare March 20, 2024 07:15

alamb marked this pull request as ready for review March 20, 2024 13:42

Refine documentation in common_subexpr_eliminate.rs

04ddb6e

alamb mentioned this pull request Mar 20, 2024

Refine documentation in common_subexpr_eliminate.rs wiedld/arrow-datafusion#2

Merged

alamb approved these changes Mar 20, 2024

View reviewed changes

wiedld and others added 2 commits March 20, 2024 09:19

chore: Refine documentation in common_subexpr_eliminate

104eeb7

chore: cleanup -- fix 1 doc comment and consolidate common-expr-elimi…

6223239

…nation test with other expr test

alamb mentioned this pull request Mar 20, 2024

Release DataFusion 37.0.0 #9682

Closed

8 tasks

Merge remote-tracking branch 'apache/main' into 9678/common-subexpr-e…

8013ca6

…liminate

alamb merged commit 5f0cb49 into apache:main Mar 21, 2024
23 checks passed

alamb mentioned this pull request Mar 22, 2024

Branch for upgrade to DataFusion March 5 Upgrade alamb/datafusion#18

Open

This was referenced Mar 28, 2024

WIP: df patched upgrade to 2024-03-05, requiring new DF fixes wiedld/arrow-datafusion#3

Closed

WIP: df patched upgrade to 2024-03-05, requiring new DF fixes #9901

Closed

WIP: df patched upgrade to 2024-03-05 influxdata/arrow-datafusion#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(9678): short circuiting prevented population of visited stack, for common subexpr elimination optimization #9685

fix(9678): short circuiting prevented population of visited stack, for common subexpr elimination optimization #9685

wiedld commented Mar 18, 2024 •

edited by alamb

Loading

wiedld Mar 20, 2024

wiedld Mar 20, 2024 •

edited

Loading

wiedld Mar 20, 2024 •

edited

Loading

wiedld Mar 20, 2024 •

edited by alamb

Loading

wiedld Mar 20, 2024 •

edited

Loading

alamb Mar 20, 2024

alamb left a comment

alamb Mar 20, 2024

wiedld Mar 20, 2024 •

edited

Loading

peter-toth Mar 21, 2024

alamb Mar 20, 2024

alamb commented Mar 21, 2024

waynexia commented Mar 21, 2024

alamb commented Mar 21, 2024

fix(9678): short circuiting prevented population of visited stack, for common subexpr elimination optimization #9685

fix(9678): short circuiting prevented population of visited stack, for common subexpr elimination optimization #9685

Conversation

wiedld commented Mar 18, 2024 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

wiedld Mar 20, 2024

Choose a reason for hiding this comment

wiedld Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

wiedld Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

wiedld Mar 20, 2024 • edited by alamb Loading

Choose a reason for hiding this comment

wiedld Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Mar 20, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 20, 2024

Choose a reason for hiding this comment

wiedld Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Fix for the first visitor, for the short-circuited nodes:

peter-toth Mar 21, 2024

Choose a reason for hiding this comment

alamb Mar 20, 2024

Choose a reason for hiding this comment

alamb commented Mar 21, 2024

waynexia commented Mar 21, 2024

alamb commented Mar 21, 2024

wiedld commented Mar 18, 2024 •

edited by alamb

Loading

wiedld Mar 20, 2024 •

edited

Loading

wiedld Mar 20, 2024 •

edited

Loading

wiedld Mar 20, 2024 •

edited by alamb

Loading

wiedld Mar 20, 2024 •

edited

Loading

wiedld Mar 20, 2024 •

edited

Loading