Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optimize CoalesceBatches in limit #11983

Closed

Conversation

acking-you
Copy link
Contributor

Which issue does this PR close?

Closes #11980.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Aug 14, 2024
@acking-you acking-you force-pushed the feat/optimize_coalesce_batches branch from ea419f5 to 6b6e031 Compare August 15, 2024 03:29
plan: Arc<dyn crate::physical_plan::ExecutionPlan>,
) -> Result<Arc<dyn crate::physical_plan::ExecutionPlan>> {
// If the entire table needs to be scanned, the limit at the upper level does not take effect
if need_scan_all(plan.as_any()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should turn this around: allow any approved plan nodes instead of disallowing some.
Otherwise this will be wrong for any added/forgotten nodes or user defined nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I get it. I'll give it a try.

Copy link
Contributor Author

@acking-you acking-you Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should turn this around: allow any approved plan nodes instead of disallowing some. Otherwise this will be wrong for any added/forgotten nodes or user defined nodes.

After carefully considering the revised plan, I realized that identifying operators requiring a full table scan is still necessary regardless of the changes.
Because this optimization is always determined by whether or not there is a full table scan operator and whether or not it contains a limit operator.
Here are the new changes: https://github.com/acking-you/arrow-datafusion/blob/feat/optimize_coalesce_batches/datafusion/core/src/physical_optimizer/coalesce_batches.rs#L101-L134

@berkaysynnada
Copy link
Contributor

The issue explained in #9792 was resolved with the implementation of #11652. This fix handles the problem related to waiting for the coalescer buffer to fill when a Limit -> ... -> CoalesceBatches pattern exists. The approach was to push down the limit (fetch + skip) into CoalesceBatches and eliminate the limit when it was no longer needed.

With #12003, it appears that additional corner cases are being addressed. It further refines the process by pushing limits as far down the execution plan as possible and removing any redundant limits.

It seems that these recent improvements already address the objective you're aiming for, without the need to define a constant thresholds. I think there is no difference between using a limit without coalescing and using a coalesce that can internally handle limits.

I am curious about your thoughts. Do you still see a need for additional optimization? If so, could you provide an example scenario or a test case that would help us discuss this further?

@acking-you
Copy link
Contributor Author

The issue explained in #9792 was resolved with the implementation of #11652. This fix handles the problem related to waiting for the coalescer buffer to fill when a Limit -> ... -> CoalesceBatches pattern exists. The approach was to push down the limit (fetch + skip) into CoalesceBatches and eliminate the limit when it was no longer needed.

With #12003, it appears that additional corner cases are being addressed. It further refines the process by pushing limits as far down the execution plan as possible and removing any redundant limits.

It seems that these recent improvements already address the objective you're aiming for, without the need to define a constant thresholds. I think there is no difference between using a limit without coalescing and using a coalesce that can internally handle limits.

I am curious about your thoughts. Do you still see a need for additional optimization? If so, could you provide an example scenario or a test case that would help us discuss this further?

Thanks for providing the background on this optimization. I looked into the issues you mentioned and it seems they've been resolved exactly as I hoped. Great job! I'll reference the information you compiled in my issue.

@acking-you acking-you closed this Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The batch_size selection for CoalesceBatches doesn't account for cases with a limit
4 participants