-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move min and max to user defined aggregate function, remove AggregateFunction
/ AggregateFunctionDefinition::BuiltIn
#11013
Conversation
I do have something that's starting to look reasonable, but some tests on the optimizer now are failing for some reasons I can't understand
|
I guess you skip the aggregate statistic optimization for min/max datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs Lines 177 to 224 in 18042fd
You might need to check if the |
I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)
That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated |
I think we should add distinct for MIN/MAX so we can get the But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX. |
Is this a part of the optimizer i.e. https://github.com/edmondop/arrow-datafusion/blob/main/datafusion/optimizer/src/replace_distinct_aggregate.rs ? Thank your for your help btw |
I don't think so, Distinct/Distinct On is different from distinct in the function. |
@jayzhan211 I have started experimenting with an optimizer rule, but removing the distinct result in such an error:
Do I need to change also the equivalence rules? |
You can take |
Thanks. I guess I wasn't clear in my comment here #11013 (comment) . How should that test failure be addressed? It seems that min/max udaf uses other aliases and is not reusing the intermediate results already available |
If we eliminate distinct of min/max prior to |
Wouldn't eliminating it require the optimizer rule? Or do you suggest I update the test case? Or the expected value? |
Yes, I suggest we update the test like #[test]
fn one_distinct_and_two_common() -> Result<()> {
let table_scan = test_table_scan()?;
let plan = LogicalPlanBuilder::from(table_scan)
.aggregate(
vec![col("a")],
vec![sum(col("c")), count_distinct(col("b")), max(col("b"))],
)?
.build()?;
// Should work
let expected = "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n TableScan: test [a:UInt32, b:UInt32, c:UInt32]";
assert_optimized_plan_equal(plan, expected)
} |
There seems to be a column added to the Aggregate node in the logical plan, can that affect performance and/or memory footprint? This was the reason why I didn't update the test in the first place This is a subset of the new plan
while this is the subset from the previous plan
there is an alias3:UInt64 that gets added |
Remove the Min/Max matching in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @edmondop -- I took a look at this PR and I think in general it is quite close.
It needs:
- to remove the old min/max implementation in https://github.com/apache/datafusion/blob/5bb6b356277ea1c6f1d7af64e2d66f005d7e1ed4/datafusion/physical-expr/src/aggregate/min_max.rs
- resolve some merge conflicts
There is also a follow on issue / PR I would like to make regarding the optimizer check
Given this PR has hung out for a while and has some merge conflicts now I am going to try and help polish it up
I think as long as you can explain me how to resolve the current test failure I should be fine. Agree using names for min and max unwrapping is not very robust |
@jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation |
wait, we should use stubs instead of the real function, otherwise we import the |
You use the name "min" in Max, and "max" in Min for stubs |
It is available, I hadn't have to do anything / update the Cargo |
I see. We don't need stubs. |
Do you expect any additional changes ? I fixed the stub function name anyways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
self.nullable, | ||
)) | ||
fn name(&self) -> &str { | ||
"MIN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up PR is to lowercase the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I logged this #11779 I am not sure we should lowercase this or make the other UDF uppercase
|
||
impl std::fmt::Debug for Max { | ||
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { | ||
f.debug_struct("Min") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be "Max"
AggregateFunction
AggregateFunction
AggregateFunction
/ AggregateFunctionDefinition::BuiltIn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @edmondop and @jayzhan211 -- this is epic work and I think the codebase is that much better for it 🙏
After this PR is merged i think we can also remove the AggregateFunctionDefition
enum that now only has a single variant as a follow on cleanup
println!( | ||
"Copying {} to {}", | ||
prost.clone().display(), | ||
proto_dir.join("src/generated/prost.rs").display() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to print out some status reporting while regenerating protos 👍
@@ -466,51 +464,6 @@ message InListNode { | |||
bool negated = 3; | |||
} | |||
|
|||
enum AggregateFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -477,30 +477,10 @@ impl AsExecutionPlan for protobuf::PhysicalPlanNode { | |||
ExprType::AggregateExpr(agg_node) => { | |||
let input_phy_expr: Vec<Arc<dyn PhysicalExpr>> = agg_node.expr.iter() | |||
.map(|e| parse_physical_expr(e, registry, &physical_schema, extension_codec)).collect::<Result<Vec<_>>>()?; | |||
let ordering_req: Vec<PhysicalSortExpr> = agg_node.ordering_req.iter() | |||
let _ordering_req: Vec<PhysicalSortExpr> = agg_node.ordering_req.iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting change -- does it mean ordering is not carried into the udf?
Or maybe it is redundant and now is entirely determined by the udf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordering are not supported for udaf yet. I had leave a TODO comment below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO:
|
@jayzhan211 should we file tickets to track this work (I think @edmondop has a PR for the lower case name). The I am happy to file the tickets if you would like |
Sure, thanks |
#11805 <-- comment PR |
Which issue does this PR close?
Closes #10943 .
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?