Replace `execution_mode` with `emission_type` and `boundedness` #13823

jayzhan-synnada · 2024-12-18T08:24:39Z

Which issue does this PR close?

Closes #.

Rationale for this change

Existing execution_mode is not enough the represent different combination of execution properties. There are different emission type like incrementally or emit at the end. And, we might have different decision based on the combination of boundedness of the source and emission type, for example, whether it is pipeline breaking (not able to generate result if it has unbounded input and requires to emit at the end)

In this PR we introduce emission_type and boundedness to replace the existing execution mode to represent more execution properties.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

- Introduced `Incremental` execution mode alongside existing modes in the DataFusion execution plan. - Updated various execution plans to utilize the new `Incremental` mode where applicable, enhancing streaming capabilities. - Added `bitflags` dependency to `Cargo.toml` for better management of execution modes. - Adjusted execution mode handling in multiple files to ensure compatibility with the new structure.

Signed-off-by: Jay Zhan <[email protected]>

- Removed the `ExecutionMode` parameter from `PlanProperties` across multiple physical plan implementations. - Updated related functions to utilize the new structure, ensuring compatibility with the changes. - Adjusted comments and cleaned up imports to reflect the removal of execution mode handling. This refactor simplifies the execution plan properties and enhances maintainability.

…sionType` - Removed the `ExecutionMode` parameter from `PlanProperties` and related implementations across multiple files. - Introduced `EmissionType` to better represent the output characteristics of execution plans. - Updated functions and tests to reflect the new structure, ensuring compatibility and enhancing maintainability. - Cleaned up imports and adjusted comments accordingly. This refactor simplifies the execution plan properties and improves the clarity of memory handling in execution plans.

Signed-off-by: Jay Zhan <[email protected]>

- Updated test cases in `sanity_checker.rs` to reflect changes in expected outcomes for bounded and unbounded joins, ensuring accurate test coverage. - Simplified the `is_pipeline_breaking` method in `execution_plan.rs` to clarify the conditions under which a plan is considered pipeline-breaking. - Enhanced the emission type determination logic in `execution_plan.rs` to prioritize `Final` over `Both` and `Incremental`, improving clarity in execution plan behavior. - Adjusted join type handling in `hash_join.rs` to classify `Right` joins as `Incremental`, allowing for immediate row emission. These changes improve the accuracy of tests and the clarity of execution plan properties.

- Updated multiple execution plan implementations to replace `unimplemented!()` with `EmissionType::Incremental`, ensuring that the emission type is correctly defined for various plans. - This change enhances the clarity and functionality of the execution plans by explicitly specifying their emission behavior. These updates contribute to a more robust execution plan framework within the DataFusion project.

- Updated the `JoinType` enum in `join_type.rs` to include detailed descriptions for each join type, improving clarity on their behavior and expected results. - Modified the emission type logic in `hash_join.rs` to ensure that `Right` and `RightAnti` joins are classified as `Incremental`, allowing for immediate row emission when applicable. These changes improve the documentation and functionality of join operations within the DataFusion project.

- Updated the emission type determination in `SortMergeJoinExec` and `SymmetricHashJoinExec` to utilize the `emission_type_from_children` function, enhancing the accuracy of emission behavior based on input characteristics. - Clarified comments in `sort.rs` regarding the conditions under which results are emitted, emphasizing the relationship between input sorting and emission type. - These changes improve the clarity and functionality of the execution plans within the DataFusion project, ensuring more robust handling of emission types.

- Updated the `emission_type_from_children` function to accept an iterator instead of a slice, enhancing flexibility in how child execution plans are passed. - Modified the `SymmetricHashJoinExec` implementation to utilize the new function signature, improving code clarity and maintainability. These changes streamline the emission type determination process within the DataFusion project, contributing to a more robust execution plan framework.

- Introduced `boundedness` and `pipeline_behavior` methods to the `ExecutionPlanProperties` trait, improving the handling of execution plan characteristics. - Updated the `CsvExec`, `SortExec`, and related implementations to utilize the new methods for determining boundedness and emission behavior. - Refactored the `ensure_distribution` function to use the new boundedness logic, enhancing clarity in distribution decisions. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project.

…dling - Updated multiple execution plan implementations to incorporate `Boundedness` and `EmissionType`, improving the clarity and functionality of execution plans. - Replaced instances of `unimplemented!()` with appropriate emission types, ensuring that plans correctly define their output behavior. - Refactored the `PlanProperties` structure to utilize the new boundedness logic, enhancing decision-making in execution plans. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project.

- Updated the condition for checking memory requirements in execution plans from `has_finite_memory()` to `boundedness().requires_finite_memory()`, improving clarity in memory management. - This change enhances the robustness of execution plans within the DataFusion project by ensuring more accurate assessments of memory constraints.

- Updated conditions for checking boundedness in various execution plans to use `is_unbounded()` instead of `requires_finite_memory()`, enhancing clarity in memory management. - Adjusted the `PlanProperties` structure to reflect these changes, ensuring more accurate assessments of memory constraints across the DataFusion project. - These modifications contribute to a more robust and maintainable execution plan framework, improving the handling of boundedness in execution strategies.

…Exec` implementation - Eliminated the outdated comment suggesting a switch to unbounded execution with finite memory, streamlining the code and improving clarity. - This change contributes to a cleaner and more maintainable codebase within the DataFusion project.

- Updated the `is_pipeline_breaking` method to use `requires_finite_memory()` for improved clarity in determining pipeline behavior. - Enhanced the `Boundedness` enum to include detailed documentation on memory requirements for unbounded streams. - Refactored `compute_properties` methods in `GlobalLimitExec` and `LocalLimitExec` to directly use the input's boundedness, simplifying the logic. - Adjusted emission type determination in `NestedLoopJoinExec` to utilize the `emission_type_from_children` function, ensuring accurate output behavior based on input characteristics. These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project, improving clarity and functionality in handling boundedness and emission types.

- Removed the `OptionalEmissionType` struct from `plan_properties.rs`, simplifying the codebase. - Updated the `is_pipeline_breaking` function in `execution_plan.rs` for improved readability by formatting the condition across multiple lines. - Adjusted the `GlobalLimitExec` implementation in `limit.rs` to directly use the input's boundedness, enhancing clarity in memory management. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, improving the handling of emission types and boundedness.

…ndling - Updated the `compute_properties` methods in both `GlobalLimitExec` and `LocalLimitExec` to replace `EmissionType::Final` with `Boundedness::Bounded`, reflecting that limit operations always produce a finite number of rows. - Changed the input's boundedness reference to `pipeline_behavior()` for improved clarity in execution plan properties. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, enhancing the handling of boundedness in limit operations.

ozankabak

Both myself and @berkaysynnada collaborated with @jayzhan-synnada extensively on this and reviewed this API change carefully. I am quite happy with this improvement as DataFusion now exposes much more information on the nature of output streams of operators, leaving more room for downstream projects for optimizations and innovative uses.

Given that it is an API change, I think it would be great if @alamb and other interested community members can take a look and review if possible.

With this PR, we will have the following future work:

There are some minor TODOs which will enable DataFusion to expose emission and boundedness information less conservatively for certain operators when suitable conditions exist.
The API implicitly assumes that all output streams of an operator have the same emission and boundedness characteristics (which is the vast majority of the practical use cases). In the future, if we discover that we need to expose this information in a per-partition manner, we can extend the design to accommodate that.

findepi · 2024-12-18T08:42:44Z

datafusion/ffi/src/plan_properties.rs

    Bounded,
-    Unbounded,


Was the purpose of FFI to provide a stable ABI? cc @timsaucer

I think so too -- Maybe we can leave the FFI_ExecutionMode present and mark it deprecated or something 🤔

Yes, the idea is to be stable. I don't think the real problem here is the change from FFI_ExecutionMode to FFI_Boundness but rather the way it breaks libraries working across the boundary that will expect one and receive the other. I think this current impact is minimal since the FFI interfaces have just been released and I don't think many people have had an opportunity to use them. However I will open an issue about adding a feature to validate the versioning across the boundary. I think that will be necessary going forward.

findepi · 2024-12-18T08:45:48Z

datafusion/common/src/join_type.rs

@@ -28,21 +28,26 @@ use crate::{DataFusionError, Result};
 /// Join type
 #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Hash)]
 pub enum JoinType {
-    /// Inner Join
+    /// Inner Join - Returns rows where there is a match in both tables.


if we want to get precise, this should also say that all matches are returned (n-to-m)

findepi · 2024-12-18T08:47:02Z

datafusion/common/src/join_type.rs

+    /// Full Join - Returns all rows when there is a match in either table.
+    /// Rows without a match in one table will have NULL values for columns from that table.


there is a match in either table

"a match" implies a matching row in left and right tables.

also see above, this should also imply cardinality

maybe

/// Full Join - returns all matches between left and right table, /// plus left table rows without a match (with NULL values for right table columns) /// plus right table rows without a match (with NULL values for left table columns)

findepi · 2024-12-18T08:48:30Z

Thanks for expanding docstrings in JoinType.
Can you please move them to separate PR? This would help with review and merging.

ozankabak · 2024-12-18T10:33:01Z

Thanks for expanding docstrings in JoinType. Can you please move them to separate PR? This would help with review and merging.

Makes sense, we can merge the docstrings quickly ahead of the main PR while it is under review and reduce the diff

alamb

Thank you @jayzhan-synnada and @findepi for the reivew.

I think this PR makes a lot of sense and is well documented, reasoned, and a nice improvement.

I have two concerns about this PR:

The potentially breaking changes to FFI (I am sure we can find some way around this, but I am still new to the area)
The accumulation of API changes we already have on main before Release DataFusion 44.0.0 #13334

So maybe we can figure out the FFI stuff, and then wait to merge this PR until 44 is relased. I will try hard to focus on the 44 release later this week

alamb · 2024-12-18T12:25:11Z

datafusion/ffi/src/plan_properties.rs

    Bounded,
-    Unbounded,


I think so too -- Maybe we can leave the FFI_ExecutionMode present and mark it deprecated or something 🤔

alamb · 2024-12-18T12:27:44Z

datafusion/physical-plan/src/execution_plan.rs

-    /// In this case the stream will eventually return `None` to indicate that
-    /// there are no more records to process.
+/// For unbounded streams, it also tracks whether the operator requires finite memory
+/// to process the stream or if memory usage could grow unbounded.


Can we clarify in these comments how Boundedness is related to the input stream?

For example, if the the input stream is unbounded, does that imply the output is also unbounded?

Maybe some examples would help to clarify this

Limit and TopK can be given as examples

Sure, we can expand the comments. Basically, boundedness of the output stream is a function of the boundedness of the input stream and the nature of the operator. For example, a simple SISO operator that transforms it data preserves the boundedness of its input, while an operator with fetch can have an unbounded input stream but a bounded output stream.

alamb · 2024-12-18T12:28:56Z

datafusion/physical-plan/src/execution_plan.rs

+    /// The data stream is unbounded (infinite) and could run forever
+    Unbounded {
+        /// Whether this operator requires infinite memory to process the unbounded stream.
+        /// If false, the operator can process an infinite stream with bounded memory.


Maybe we can give an example of each type?

Like is it the case that

Unbounded { requires_infinite_memory: true, }

Would describe a min/max aggregate?

min/max over unordered keys are requires_infinite_memory: true by default, but if the key cardinality is finite, then it is false. That could be added as an example.

Makes sense, we will add some examples. In your case, MEDIAN would be an example (but we can do MIN/MAX with bounded memory when sorted).

alamb · 2024-12-18T12:32:38Z

datafusion/physical-plan/src/execution_plan.rs

-    }
+/// Represents how an operator emits its output records.
+///
+/// This is used to determine whether an operator emits records incrementally as they arrive,


I suggest documenting when someone would expect incremental output -- I think it is basically when the operator can emit a RecordBatches of batch_size rows

Specifically I think it would help to say something like "EmissionType::Incremental generates output incrementally but the operator may still buffer data internally until batch_size records arrive

Yes, you have the right idea. We will add some comments to clarify.

ozankabak · 2024-12-18T13:03:25Z

So maybe we can figure out the FFI stuff, and then wait to merge this PR until 44 is relased. I will try hard to focus on the 44 release later this week

Sure, it makes sense to wait and only proceed with merging after all the PRs that are slated for 44 get merged before this.

One alternative to consider would be making 44 a major release with all the API changes (including this) so that we can proceed with a few "stable" releases after 44. It is also not a great dev-ex to deal with some API change in every single release, albeit small.

alamb

Thank you @jayzhan-synnada and @ozankabak for this PR

From my perspective this PR is fine to merge, and maybe since the FFI feature is so new as @ozankabak pointed out it would make sense to not delay this PR until after the release to get it in.

I don't have a strong preference, though hope to have dug out of my backlog later today and help to get a release candidate in shape over the next few days

- Improved descriptions for the Inner and Full join types in join_type.rs to clarify their behavior and examples. - Added explanations regarding the boundedness of output streams and memory requirements in execution_plan.rs, including specific examples for operators like Median and Min/Max.

jayzhan-synnada · 2024-12-20T03:27:16Z

If we can merge this soon, I think we don't need to separate Join doc, just get in together. Thanks all for the review!

ozankabak · 2024-12-20T07:02:26Z

Great, then let's wait for a few hours in case any other feedback arrives and then merge this. This way, we will avoid API churn on FFI between DataFusion 44 and 45.

alamb · 2024-12-20T14:12:32Z

🚀

alamb · 2024-12-20T14:20:39Z

One alternative to consider would be making 44 a major release with all the API changes (including this) so that we can proceed with a few "stable" releases after 44. It is also not a great dev-ex to deal with some API change in every single release, albeit small.

Indeed -- it may be time to consider making the API more stable. Here is a related discussion:

[Discuss] Release cadence / patch releases / Long Term Supported (lts) minor releases #5269

jayzhan-synnada and others added 27 commits December 10, 2024 08:44

add exec API

ac3cb45

Signed-off-by: Jay Zhan <[email protected]>

replace done but has stackoverflow

a669ce1

Signed-off-by: Jay Zhan <[email protected]>

exec API done

7e353da

Signed-off-by: Jay Zhan <[email protected]>

fix test

4c4c636

Signed-off-by: Jay Zhan <[email protected]>

Merge branch 'apache_main' into exec-mode

3557182

Review Part1

bd5f6e4

Update sanity_checker.rs

7e8ab0a

addressing reviews

0613416

Review Part 1

754201b

Update datafusion/physical-plan/src/execution_plan.rs

5f10f98

Update datafusion/physical-plan/src/execution_plan.rs

af83ce8

github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels Dec 18, 2024

github-actions bot added the common Related to common crate label Dec 18, 2024

Shorten imports

b0d7139

berkaysynnada added the api change Changes the API exposed to users of the crate label Dec 18, 2024

ozankabak approved these changes Dec 18, 2024

View reviewed changes

findepi reviewed Dec 18, 2024

View reviewed changes

alamb reviewed Dec 18, 2024

View reviewed changes

alamb mentioned this pull request Dec 18, 2024

Release DataFusion 44.0.0 #13334

Open

10 tasks

timsaucer mentioned this pull request Dec 18, 2024

Add version checking to FFI crate #13827

Open

alamb approved these changes Dec 19, 2024

View reviewed changes

ozankabak merged commit f3b1141 into apache:main Dec 20, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `execution_mode` with `emission_type` and `boundedness` #13823

Replace `execution_mode` with `emission_type` and `boundedness` #13823

jayzhan-synnada commented Dec 18, 2024 •

edited

Loading

ozankabak left a comment

findepi Dec 18, 2024

alamb Dec 18, 2024

timsaucer Dec 18, 2024

timsaucer Dec 18, 2024

findepi Dec 18, 2024

findepi Dec 18, 2024

findepi commented Dec 18, 2024

ozankabak commented Dec 18, 2024 •

edited

Loading

alamb left a comment

alamb Dec 18, 2024

alamb Dec 18, 2024

berkaysynnada Dec 18, 2024

ozankabak Dec 18, 2024

alamb Dec 18, 2024

berkaysynnada Dec 18, 2024

ozankabak Dec 18, 2024 •

edited

Loading

alamb Dec 18, 2024

ozankabak Dec 18, 2024

ozankabak commented Dec 18, 2024

alamb left a comment

jayzhan-synnada commented Dec 20, 2024

ozankabak commented Dec 20, 2024

alamb commented Dec 20, 2024

alamb commented Dec 20, 2024

		/// Full Join - Returns all rows when there is a match in either table.
		/// Rows without a match in one table will have NULL values for columns from that table.

Replace execution_mode with emission_type and boundedness #13823

Replace execution_mode with emission_type and boundedness #13823

Conversation

jayzhan-synnada commented Dec 18, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ozankabak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Dec 18, 2024

ozankabak commented Dec 18, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Dec 18, 2024

alamb left a comment

Choose a reason for hiding this comment

jayzhan-synnada commented Dec 20, 2024

ozankabak commented Dec 20, 2024

alamb commented Dec 20, 2024

alamb commented Dec 20, 2024

Replace `execution_mode` with `emission_type` and `boundedness` #13823

Replace `execution_mode` with `emission_type` and `boundedness` #13823

jayzhan-synnada commented Dec 18, 2024 •

edited

Loading

ozankabak commented Dec 18, 2024 •

edited

Loading

ozankabak Dec 18, 2024 •

edited

Loading