feat: optimize CoalesceBatches in limit #11983

acking-you · 2024-08-14T12:47:53Z

Which issue does this PR close?

Closes #11980.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

datafusion/core/src/physical_optimizer/coalesce_batches.rs

Dandandan · 2024-08-15T07:25:42Z

datafusion/core/src/physical_optimizer/coalesce_batches.rs

+    plan: Arc<dyn crate::physical_plan::ExecutionPlan>,
+) -> Result<Arc<dyn crate::physical_plan::ExecutionPlan>> {
+    // If the entire table needs to be scanned, the limit at the upper level does not take effect
+    if need_scan_all(plan.as_any()) {


I think we should turn this around: allow any approved plan nodes instead of disallowing some.
Otherwise this will be wrong for any added/forgotten nodes or user defined nodes.

Ok, I get it. I'll give it a try.

I think we should turn this around: allow any approved plan nodes instead of disallowing some. Otherwise this will be wrong for any added/forgotten nodes or user defined nodes.

After carefully considering the revised plan, I realized that identifying operators requiring a full table scan is still necessary regardless of the changes.
Because this optimization is always determined by whether or not there is a full table scan operator and whether or not it contains a limit operator.
Here are the new changes: https://github.com/acking-you/arrow-datafusion/blob/feat/optimize_coalesce_batches/datafusion/core/src/physical_optimizer/coalesce_batches.rs#L101-L134

berkaysynnada · 2024-08-15T10:23:27Z

The issue explained in #9792 was resolved with the implementation of #11652. This fix handles the problem related to waiting for the coalescer buffer to fill when a Limit -> ... -> CoalesceBatches pattern exists. The approach was to push down the limit (fetch + skip) into CoalesceBatches and eliminate the limit when it was no longer needed.

With #12003, it appears that additional corner cases are being addressed. It further refines the process by pushing limits as far down the execution plan as possible and removing any redundant limits.

It seems that these recent improvements already address the objective you're aiming for, without the need to define a constant thresholds. I think there is no difference between using a limit without coalescing and using a coalesce that can internally handle limits.

I am curious about your thoughts. Do you still see a need for additional optimization? If so, could you provide an example scenario or a test case that would help us discuss this further?

acking-you · 2024-08-15T10:34:31Z

The issue explained in #9792 was resolved with the implementation of #11652. This fix handles the problem related to waiting for the coalescer buffer to fill when a Limit -> ... -> CoalesceBatches pattern exists. The approach was to push down the limit (fetch + skip) into CoalesceBatches and eliminate the limit when it was no longer needed.

With #12003, it appears that additional corner cases are being addressed. It further refines the process by pushing limits as far down the execution plan as possible and removing any redundant limits.

It seems that these recent improvements already address the objective you're aiming for, without the need to define a constant thresholds. I think there is no difference between using a limit without coalescing and using a coalesce that can internally handle limits.

I am curious about your thoughts. Do you still see a need for additional optimization? If so, could you provide an example scenario or a test case that would help us discuss this further?

Thanks for providing the background on this optimization. I looked into the issues you mentioned and it seems they've been resolved exactly as I hoped. Great job! I'll reference the information you compiled in my issue.

github-actions bot added the core Core DataFusion crate label Aug 14, 2024

2010YOUY01 reviewed Aug 14, 2024

View reviewed changes

datafusion/core/src/physical_optimizer/coalesce_batches.rs Outdated Show resolved Hide resolved

ackingliu added 4 commits August 15, 2024 11:06

feat: optimize CoalesceBatches in limit

433f9b6

add with_fetch in CoalesceBatchesExec

b71b51a

ceil(limit,partition) as fetch

b56fcea

optimize small limit

6b6e031

acking-you force-pushed the feat/optimize_coalesce_batches branch from ea419f5 to 6b6e031 Compare August 15, 2024 03:29

add offset

a7077bb

Dandandan reviewed Aug 15, 2024

View reviewed changes

refactor scan all function

34e9af9

acking-you closed this Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optimize CoalesceBatches in limit #11983

feat: optimize CoalesceBatches in limit #11983

acking-you commented Aug 14, 2024

Dandandan Aug 15, 2024

acking-you Aug 15, 2024

acking-you Aug 15, 2024 •

edited

Loading

berkaysynnada commented Aug 15, 2024

acking-you commented Aug 15, 2024

feat: optimize CoalesceBatches in limit #11983

feat: optimize CoalesceBatches in limit #11983

Conversation

acking-you commented Aug 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan Aug 15, 2024

Choose a reason for hiding this comment

acking-you Aug 15, 2024

Choose a reason for hiding this comment

acking-you Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

berkaysynnada commented Aug 15, 2024

acking-you commented Aug 15, 2024

acking-you Aug 15, 2024 •

edited

Loading