-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: move aggregate_statistics
to datafusion-physical-optimizer
#11798
refactor: move aggregate_statistics
to datafusion-physical-optimizer
#11798
Conversation
aggregate_statistics
to datafusion-physical-optimizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @Weijun-H -- I think you are doing the hard work of figuring out the patterns as we start to break out the physical optimizers from the core
In this case, I think the challenge is where to put the tests for physical-optimizer
passes that rely on code that is not available to the physical-optimizer
crate (like the functions library or the parquet exec)
We had a similar challenge with the logical optimizer passes
I think the two main options are:
- Move the tests into "integration" tests in https://github.com/apache/datafusion/tree/main/datafusion/core/tests
- Rewrite the tests to avoid the dependencies (e.g. use function stubs and other things
I personally suggest we move the tests to https://github.com/apache/datafusion/tree/main/datafusion/core/tests so they retain the same level of coverage
The downside of this is that then the tests will not be in the same directory as code. I think we can manage this with some good comments / documentation (e.g. at the end of the physical optimizer module saying where the tests were
@@ -34,5 +34,6 @@ workspace = true | |||
[dependencies] | |||
datafusion-common = { workspace = true, default-features = true } | |||
datafusion-execution = { workspace = true } | |||
datafusion-expr = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Why is this needed?
If it is only for
use datafusion_expr::utils::COUNT_STAR_EXPANSION;
I think we could potentially move COUNT_STAR_EXPANSION
to datafusion_common
(and leave a backwards compatible pub use
) and avoid this dependency
I think in general we are trying to keep the division between logical and physical exprs separate when possible -- this makes it easier for systems that only need physical plans (e.g. DataFusion Comet)
@@ -15,292 +15,19 @@ | |||
// specific language governing permissions and limitations | |||
// under the License. | |||
|
|||
//! Utilizing exact statistics from sources to avoid scanning data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that putting these tests in datafusion/core/src/physical_optimizer/tests/aggregate_statistics.rs
will make them harder to find becase:
- It is in a different crate than being tested
- It is in a non standard testing location
As I understand it, these tests use code in other modules that physical optimizer doesn't depend on (like functions-aggregate
Thus, I would like to suggest we put these tests https://github.com/apache/datafusion/tree/main/datafusion/core/tests
Perhaps something like physical_optimizer_integration.rs
to mirror https://github.com/apache/datafusion/blob/main/datafusion/core/tests/optimizer_integration.rs
|
||
/// The value to which `COUNT(*)` is expanded to in | ||
/// `COUNT(<constant>)` expressions | ||
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move COUNT_STAR_EXPANSION
here
/// Describe the type of aggregate being tested | ||
pub enum TestAggregate { | ||
/// Testing COUNT(*) type aggregates | ||
CountStar, | ||
|
||
/// Testing for COUNT(column) aggregate | ||
ColumnA(Arc<Schema>), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enum is used by aggregate_statistics
and limited_distinct_aggregate
, so move it for convenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Weijun-H -- this looks great.
Which issue does this PR close?
Parts #11502
Rationale for this change
aggregate_statistics
todatafusion-physical-optimizer
datafusion/core/tests/physical_optimizer_integration.rs
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?