-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce copying in CoalesceBatchesExec
for StringViews
#11628
Comments
This makes sense to me and I think your analysis is very clear. Thank you
I think this is what we should pursue and I think what is covered by #7957. As you say it is likely the thing that will perform the best. Maybe we could explore a solution that builds an the output while let Some(batch) = input.read_batch() {
// append new rows to inprogress output, producing a complete batch if ready
if let Some(output_batch) = coalescer.push_batch(batch) {
output.emit(output_batch)
}
} The idea would be that struct Coalescer {
in_progress: StringViewBuilder
// and similiar things for other types 🤔
}
impl Coalescer {
fn push_bach(&mut self, batch: RecordBatch) -> Option<RecordBatch> {
// copy relevant values to self.in_progress
// if in_progress.len is greater than threshold emit a batch
}
} You might recognize this high level structure from #11610 :)
100% agree |
This benchmark is a great inspiration, I think this query has low selectivity and processed strings are longer, so it's preferred to do early |
Is your feature request related to a problem or challenge?
In pictures, what #11587 does is like
this (to ensure lots of unreachable "garbage" does not accumulate in the output batch)
However, as @2010YOUY01 pointed out in https://github.com/apache/datafusion/pull/11587/files#r1686678665
This implementation will effectively copy the data twice -- once for the call to
gc
and once for the call coalsece batches.Due to the nature of
StringView
the actual strings vaules are only copied once, but theu128
view value will be copied twiceDescribe the solution you'd like
Somehow structure the code to avoid copying the views again. Like this
Describe alternatives you've considered
https://github.com/apache/datafusion/pull/11587/files#r1687099239
Additional context
#7957 is another related idea for avoding copies
The text was updated successfully, but these errors were encountered: