Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Simplify + document EliminateCrossJoin better #10427

Merged
merged 1 commit into from
May 9, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 8, 2024

Which issue does this PR close?

Part of #10287

Rationale for this change

I am trying to avoid copying as much in EliminateCrossJoin and I noticed a few
things that can be simpler. While this avoids some copies I don't think it will really impact performance much.
Instead this is a step towards a larger refactor to avoid the actual plan copies

What changes are included in this PR?

  1. Add some additional documentation
  2. Remove unnecessary &mut to make it clear what is being changed
  3. Update some apis to avoid needing an owned DFSchemaRef

Are these changes tested?

By existing CI

Are there any user-facing changes?

No

@@ -885,7 +885,7 @@ pub fn can_hash(data_type: &DataType) -> bool {
/// Check whether all columns are from the schema.
pub fn check_all_columns_from_schema(
columns: &HashSet<Column>,
schema: DFSchemaRef,
schema: &DFSchema,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes the schema by reference rather than value now.

Since DFSchemaRef is an Arc I don't expect this to make any measurable performance improvement, but I think it makes it clearer what is going on and avoids a bunch of clones

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -107,7 +107,7 @@ impl OptimizerRule for EliminateCrossJoin {
left = find_inner_join(
&left,
&mut all_inputs,
&mut possible_join_keys,
&possible_join_keys,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_inner_join does not modify possible_join_keys so update the signature to make that clear

}
}
Ok(true)
}

/// Finds the next to join with the left input plan,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time studying this code so wanted to document what it is doing

@alamb
Copy link
Contributor Author

alamb commented May 9, 2024

@jackwener I wonder if you have time to review this PR?

@@ -909,8 +909,8 @@ pub fn check_all_columns_from_schema(
pub fn find_valid_equijoin_key_pair(
left_key: &Expr,
right_key: &Expr,
left_schema: DFSchemaRef,
right_schema: DFSchemaRef,
left_schema: &DFSchema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats interesting, should we a reference instead of Arc whenever possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general we should use a reference when the underlying callsite doesn't actually need a owned copy

So in this case, the callsite just needs a &DFSchema so there is no need to go through the overhead of creating the Arc

I think the runtime overhead of creating another Arc is likely to be unmeasurable in all but the most extreme circumstances. However I think the code is often easier to reason about

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb nice clean up

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice job to me, thanks @alamb

@alamb
Copy link
Contributor Author

alamb commented May 9, 2024

Thanks for the reviews @jackwener and @comphead ❤️ -- this now opens up a few more PRs I have queued up to make this code faster (by reducing cloning) #10430

@alamb alamb merged commit 7219744 into apache:main May 9, 2024
24 checks passed
@alamb alamb deleted the alamb/cross_join_simpify branch May 9, 2024 17:01
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants