The batch_size selection for CoalesceBatches
doesn't account for cases with a limit
#11980
Labels
enhancement
New feature or request
Is your feature request related to a problem or challenge?
The current CoalesceBatches optimization rule only create CoalesceBatchesExec based on the batch_size configured in the config struct, which can cause issues in some cases involving limit operators.
Consider the following scenario:
When a rule-compliant operation includes a
limit
operator on top ofCoalesceBatchesExec
, and thelimit
value is less than thebatch_size
, the entire computation might be blocked until a fullBatch
is collected, even though thelimit
has already been reached.A possible operator tree:
My idea is:
CoalesceBatchesExec
tolimit/partition
Of course, we also need to consider special cases, like if the limit operator is above SortExec, then limit shouldn't affect the batch_size value.
Describe the solution you'd like
The
fetch
is determined based on the limit operator's value and the current parallelism.Describe alternatives you've considered
When operators downstream of the limit operator require a full table scan (e.g., SortExec), batch_size is not handled specially.
Additional context
No response
The text was updated successfully, but these errors were encountered: