Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not add redundant subquery ordering into plan #12003

Merged
merged 47 commits into from
Aug 19, 2024

Conversation

mertak-synnada
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR solves an issue with the subquery sortings. If a main query does not support the same sorting with the subquery and the subquery has no limit, then the ordering is redundant (unless used in an Aggregation but it's being handled in a different plan type anyway).

While implementing this behavior we discovered that it's only possible to use WITH ORDER syntax with EXTERNAL tables. That's why we needed to add LIMIT xx conditions to subqueries so that the system can understand they're valid orderings. The sql-parser issue will be handled in a different PR to the sql-parser project and those tests will be changed back to their original forms

Other than that, we discovered there are some redundant LimitExec plans even if the child plan has fetch support. Also, fixed this issue and replaced GlobalLimitExec plans with skip=0 with LocalLimitExec plan type.

Also added Limit push down support to the UnionExec plan type.

What changes are included in this PR?

Explained above

Are these changes tested?

Yes

Are there any user-facing changes?

No

mertak-synnada and others added 30 commits August 6, 2024 09:38
…ting' into feature/Sort-For-Subqueries

# Conflicts:
#	datafusion/sqllogictest/test_files/window.slt
remove unnecessary limit plans when used with sort + fetch
add test case for Sort and Limit with offset
push down limit even if a child with no fetch appears when the child supports push down
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
mertak-synnada and others added 12 commits August 14, 2024 11:07
…eature/Sort-For-Subqueries

# Conflicts:
#	datafusion/core/src/physical_optimizer/enforce_distribution.rs
#	datafusion/core/src/physical_optimizer/enforce_sorting.rs
#	datafusion/core/src/physical_optimizer/limit_pushdown.rs
#	datafusion/core/src/physical_optimizer/sort_pushdown.rs
#	datafusion/physical-plan/src/coalesce_batches.rs
#	datafusion/physical-plan/src/limit.rs
#	datafusion/physical-plan/src/sorts/sort.rs
#	datafusion/sqllogictest/test_files/order.slt
add pushes_global_limit_into_multiple_fetch_plans test case
change limit_pushdown.rs as manual top down operator and simplify algorithm by supporting most parent node remove and other pushdown cases
@github-actions github-actions bot added sql SQL Planner physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Aug 15, 2024
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I collaborated with @mertak-synnada on this and it LGTM. A lot of plan simplifications/improvements! 🚀

cc @alamb

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

I plan to review this PR later today or tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @melgenek and @ozankabak

I had a small request related to some of the tests but otherwise looks good to me.

I didn't review the code in detail, but I did review the plan changes carefully and they looked good to me. Thank you again -- these plans are looking really quite good now

\n TableScan: aggregate_test_100 projection=[c1]\
\n SubqueryAlias: t2\
\n TableScan: aggregate_test_100 projection=[c1]",
\n Sort: t1.c1 ASC NULLS FIRST, fetch=1\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This plan looks better to me because since the sort already has a fetch there is no reason to also apply a limit afterwards

07)------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
08)--------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], aggr=[max(traces.timestamp)]
09)----------------MemoryExec: partitions=1, partition_sizes=[1]
01)SortPreservingMergeExec: [trace_id@0 ASC NULLS LAST], fetch=4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sure is getting easier to understand what the plans are doing here

@@ -3898,6 +3898,7 @@ SELECT * FROM (
) as lhs RIGHT JOIN (
SELECT * from right_table_no_nulls
ORDER BY b
LIMIT 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than modifying the existing test, can you please add a new one ? If we modify the existing test we may also inadvertently be changing the coverage of this test

Likewise for the other tests in this file that now have limits

Copy link
Contributor

@ozankabak ozankabak Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not leave the test as is because the sort in the subquery gets optimized away (rightfully), breaking the intent of the test. These kind of tests seems to be originally written with the assumption that the sort wouldn't be optimized away, and the subquery with ORDER BY was mimicking a table with order.

Placing the limit prevents the optimizer from optimizing the sort away, keeping the intent of the test intact. Keeping the test as is gives us a degenerate test with our new, more powerful optimizer.

Once we get WITH ORDER support for memory tables, we will simply use a memory table with ordering in these tests, and they will be reverted back fully (i.e. the LIMIT will be removed).

a 5 -38
a 5 -54
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this chnage looks like it is because the test itself is not deterministic

@ozankabak ozankabak merged commit 02eab80 into apache:main Aug 19, 2024
24 checks passed
wiedld added a commit to influxdata/arrow-datafusion that referenced this pull request Aug 28, 2024
@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

This PR seems to have introduced a bug #12423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants