Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: move aggregate_statistics to datafusion-physical-optimizer #11798

Conversation

Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Aug 4, 2024

Which issue does this PR close?

Parts #11502

Rationale for this change

  • move aggregate_statistics to datafusion-physical-optimizer
  • leave the tests to datafusion/core/tests/physical_optimizer_integration.rs

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Aug 4, 2024
@Weijun-H Weijun-H changed the title refactor: move aggregate statistics to datafusion-physical-optimizer refactor: move aggregate_statistics to datafusion-physical-optimizer Aug 4, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @Weijun-H -- I think you are doing the hard work of figuring out the patterns as we start to break out the physical optimizers from the core

In this case, I think the challenge is where to put the tests for physical-optimizer passes that rely on code that is not available to the physical-optimizer crate (like the functions library or the parquet exec)

We had a similar challenge with the logical optimizer passes

I think the two main options are:

  1. Move the tests into "integration" tests in https://github.com/apache/datafusion/tree/main/datafusion/core/tests
  2. Rewrite the tests to avoid the dependencies (e.g. use function stubs and other things

I personally suggest we move the tests to https://github.com/apache/datafusion/tree/main/datafusion/core/tests so they retain the same level of coverage

The downside of this is that then the tests will not be in the same directory as code. I think we can manage this with some good comments / documentation (e.g. at the end of the physical optimizer module saying where the tests were

cc @lewiszlw @mustafasrepo @jayzhan211

@@ -34,5 +34,6 @@ workspace = true
[dependencies]
datafusion-common = { workspace = true, default-features = true }
datafusion-execution = { workspace = true }
datafusion-expr = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Why is this needed?

If it is only for

use datafusion_expr::utils::COUNT_STAR_EXPANSION;

I think we could potentially move COUNT_STAR_EXPANSION to datafusion_common (and leave a backwards compatible pub use) and avoid this dependency

I think in general we are trying to keep the division between logical and physical exprs separate when possible -- this makes it easier for systems that only need physical plans (e.g. DataFusion Comet)

@@ -15,292 +15,19 @@
// specific language governing permissions and limitations
// under the License.

//! Utilizing exact statistics from sources to avoid scanning data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that putting these tests in datafusion/core/src/physical_optimizer/tests/aggregate_statistics.rs will make them harder to find becase:

  1. It is in a different crate than being tested
  2. It is in a non standard testing location

As I understand it, these tests use code in other modules that physical optimizer doesn't depend on (like functions-aggregate

Thus, I would like to suggest we put these tests https://github.com/apache/datafusion/tree/main/datafusion/core/tests

Perhaps something like physical_optimizer_integration.rs to mirror https://github.com/apache/datafusion/blob/main/datafusion/core/tests/optimizer_integration.rs

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Aug 5, 2024

/// The value to which `COUNT(*)` is expanded to in
/// `COUNT(<constant>)` expressions
pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move COUNT_STAR_EXPANSION here

Comment on lines +411 to +418
/// Describe the type of aggregate being tested
pub enum TestAggregate {
/// Testing COUNT(*) type aggregates
CountStar,

/// Testing for COUNT(column) aggregate
ColumnA(Arc<Schema>),
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enum is used by aggregate_statistics and limited_distinct_aggregate, so move it for convenience.

@Weijun-H Weijun-H requested a review from alamb August 5, 2024 08:03
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Weijun-H -- this looks great.

@alamb alamb merged commit fcd907d into apache:main Aug 5, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants