The batch_size selection for `CoalesceBatches` doesn't account for cases with a limit #11980

acking-you · 2024-08-14T12:21:00Z

Is your feature request related to a problem or challenge?

The current CoalesceBatches optimization rule only create CoalesceBatchesExec based on the batch_size configured in the config struct, which can cause issues in some cases involving limit operators.

Consider the following scenario:
When a rule-compliant operation includes a limit operator on top of CoalesceBatchesExec, and the limit value is less than the batch_size, the entire computation might be blocked until a full Batch is collected, even though the limit has already been reached.

A possible operator tree:

SortExec: TopK(fetch=10), expr=[event_time@3 DESC]
  LocalLimitExec: fetch=100
    CoalesceBatchesExec: target_batch_size=8192
      FilterExec: event_time@3 = 10
        TableScanExec

My idea is:

when the limit occurs you need to change the fetch of CoalesceBatchesExec to limit/partition
if the limit is small enough, then no optimization is performed

Of course, we also need to consider special cases, like if the limit operator is above SortExec, then limit shouldn't affect the batch_size value.

Describe the solution you'd like

The fetch is determined based on the limit operator's value and the current parallelism.

Describe alternatives you've considered

When operators downstream of the limit operator require a full table scan (e.g., SortExec), batch_size is not handled specially.

Additional context

No response

The text was updated successfully, but these errors were encountered:

acking-you · 2024-08-14T12:54:19Z

You can see the code implementation directly; this part is the key change: https://github.com/acking-you/arrow-datafusion/blob/feat/optimize_coalesce_batches/datafusion/core/src/physical_optimizer/coalesce_batches.rs#L101-L134

acking-you · 2024-08-15T10:37:32Z

As @berkaysynnada mentioned, this issue has been resolved.

The issue explained in #9792 was resolved with the implementation of #11652. This fix handles the problem related to waiting for the coalescer buffer to fill when a Limit -> ... -> CoalesceBatches pattern exists. The approach was to push down the limit (fetch + skip) into CoalesceBatches and eliminate the limit when it was no longer needed.
With #12003, it appears that additional corner cases are being addressed. It further refines the process by pushing limits as far down the execution plan as possible and removing any redundant limits.
It seems that these recent improvements already address the objective you're aiming for, without the need to define a constant thresholds. I think there is no difference between using a limit without coalescing and using a coalesce that can internally handle limits.
I am curious about your thoughts. Do you still see a need for additional optimization? If so, could you provide an example scenario or a test case that would help us discuss this further?

acking-you added the enhancement New feature or request label Aug 14, 2024

acking-you mentioned this issue Aug 14, 2024

feat: optimize CoalesceBatches in limit #11983

Closed

acking-you closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The batch_size selection for `CoalesceBatches` doesn't account for cases with a limit #11980

The batch_size selection for `CoalesceBatches` doesn't account for cases with a limit #11980

acking-you commented Aug 14, 2024 •

edited

Loading

acking-you commented Aug 14, 2024 •

edited

Loading

acking-you commented Aug 15, 2024

The batch_size selection for CoalesceBatches doesn't account for cases with a limit #11980

The batch_size selection for CoalesceBatches doesn't account for cases with a limit #11980

Comments

acking-you commented Aug 14, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

acking-you commented Aug 14, 2024 • edited Loading

acking-you commented Aug 15, 2024

The batch_size selection for `CoalesceBatches` doesn't account for cases with a limit #11980

The batch_size selection for `CoalesceBatches` doesn't account for cases with a limit #11980

acking-you commented Aug 14, 2024 •

edited

Loading

acking-you commented Aug 14, 2024 •

edited

Loading