Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The batch_size selection for CoalesceBatches doesn't account for cases with a limit #11980

Closed
acking-you opened this issue Aug 14, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@acking-you
Copy link
Contributor

acking-you commented Aug 14, 2024

Is your feature request related to a problem or challenge?

The current CoalesceBatches optimization rule only create CoalesceBatchesExec based on the batch_size configured in the config struct, which can cause issues in some cases involving limit operators.

Consider the following scenario:
When a rule-compliant operation includes a limit operator on top of CoalesceBatchesExec, and the limit value is less than the batch_size, the entire computation might be blocked until a full Batch is collected, even though the limit has already been reached.

A possible operator tree:

SortExec: TopK(fetch=10), expr=[event_time@3 DESC]
  LocalLimitExec: fetch=100
    CoalesceBatchesExec: target_batch_size=8192
      FilterExec: event_time@3 = 10
        TableScanExec

My idea is:

  1. when the limit occurs you need to change the fetch of CoalesceBatchesExec to limit/partition
  2. if the limit is small enough, then no optimization is performed

Of course, we also need to consider special cases, like if the limit operator is above SortExec, then limit shouldn't affect the batch_size value.

Describe the solution you'd like

The fetch is determined based on the limit operator's value and the current parallelism.

Describe alternatives you've considered

When operators downstream of the limit operator require a full table scan (e.g., SortExec), batch_size is not handled specially.

Additional context

No response

@acking-you
Copy link
Contributor Author

acking-you commented Aug 14, 2024

@acking-you
Copy link
Contributor Author

As @berkaysynnada mentioned, this issue has been resolved.

The issue explained in #9792 was resolved with the implementation of #11652. This fix handles the problem related to waiting for the coalescer buffer to fill when a Limit -> ... -> CoalesceBatches pattern exists. The approach was to push down the limit (fetch + skip) into CoalesceBatches and eliminate the limit when it was no longer needed.
With #12003, it appears that additional corner cases are being addressed. It further refines the process by pushing limits as far down the execution plan as possible and removing any redundant limits.
It seems that these recent improvements already address the objective you're aiming for, without the need to define a constant thresholds. I think there is no difference between using a limit without coalescing and using a coalesce that can internally handle limits.
I am curious about your thoughts. Do you still see a need for additional optimization? If so, could you provide an example scenario or a test case that would help us discuss this further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant