Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Expr::column_refs to find column references without copying #10948

Merged
merged 4 commits into from
Jun 22, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 17, 2024

Which issue does this PR close?

Part of #10505

Rationale for this change

Code that uses Expr::to_columns or expr_to_columns copies the column name strings, often to simply check if a column is present. A non copying API would be more efficient, and now thanks to the improvements in the TreeNode API (from #10543, thanks @peter-toth!) it is straightforward to implement

What changes are included in this PR?

  1. Add Expr::column_refs and Expr::add_column_refs
  2. Updated some places in the code to use this new API. I will make a follow on ticket to deprecate the old apis and update code to use it
  3. Deprecate exprlist_to_columns which is not used anywhere in the datafusion codebase

I plan to deprecate Expr::to_columns and expr_to_columns as follow on PRs

Are these changes tested?

Yes, by existing tests and new doc examples

Performance results show basically no improvement (maybe it gets faster for some of the queries with larger numbers of columns) but it may also be noise

Details

++ critcmp main less_to_column
group                                         less_to_column                         main
-----                                         --------------                         ----
logical_aggregate_with_join                   1.01  1013.4±56.98µs        ? ?/sec    1.00  1002.7±41.73µs        ? ?/sec
logical_plan_tpcds_all                        1.00    152.8±0.76ms        ? ?/sec    1.00    153.0±1.61ms        ? ?/sec
logical_plan_tpch_all                         1.00     17.0±0.21ms        ? ?/sec    1.00     16.9±0.19ms        ? ?/sec
logical_select_all_from_1000                  1.00     18.1±0.14ms        ? ?/sec    1.04     18.9±0.08ms        ? ?/sec
logical_select_one_from_700                   1.01    820.3±9.09µs        ? ?/sec    1.00    810.2±9.31µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.01   772.3±17.02µs        ? ?/sec    1.00    762.2±6.46µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.01   762.8±20.29µs        ? ?/sec    1.00   755.9±19.28µs        ? ?/sec
physical_plan_tpcds_all                       1.00   1226.8±5.71ms        ? ?/sec    1.00   1225.2±8.60ms        ? ?/sec
physical_plan_tpch_all                        1.00     83.6±1.50ms        ? ?/sec    1.00     83.8±1.04ms        ? ?/sec
physical_plan_tpch_q1                         1.04      4.6±0.27ms        ? ?/sec    1.00      4.5±0.04ms        ? ?/sec
physical_plan_tpch_q10                        1.00      4.0±0.04ms        ? ?/sec    1.00      4.0±0.04ms        ? ?/sec
physical_plan_tpch_q11                        1.01      3.6±0.05ms        ? ?/sec    1.00      3.5±0.04ms        ? ?/sec
physical_plan_tpch_q12                        1.01      2.7±0.03ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
physical_plan_tpch_q13                        1.00      2.0±0.02ms        ? ?/sec    1.00      2.0±0.02ms        ? ?/sec
physical_plan_tpch_q14                        1.01      2.4±0.02ms        ? ?/sec    1.00      2.4±0.03ms        ? ?/sec
physical_plan_tpch_q16                        1.01      3.4±0.05ms        ? ?/sec    1.00      3.4±0.03ms        ? ?/sec
physical_plan_tpch_q17                        1.00      3.3±0.04ms        ? ?/sec    1.00      3.3±0.04ms        ? ?/sec
physical_plan_tpch_q18                        1.00      3.7±0.11ms        ? ?/sec    1.00      3.7±0.03ms        ? ?/sec
physical_plan_tpch_q19                        1.00      5.4±0.07ms        ? ?/sec    1.00      5.4±0.05ms        ? ?/sec
physical_plan_tpch_q2                         1.00      7.3±0.07ms        ? ?/sec    1.00      7.2±0.08ms        ? ?/sec
physical_plan_tpch_q20                        1.00      4.2±0.07ms        ? ?/sec    1.00      4.2±0.06ms        ? ?/sec
physical_plan_tpch_q21                        1.00      5.8±0.07ms        ? ?/sec    1.00      5.8±0.07ms        ? ?/sec
physical_plan_tpch_q22                        1.01      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
physical_plan_tpch_q3                         1.00      2.9±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
physical_plan_tpch_q4                         1.01      2.2±0.02ms        ? ?/sec    1.00      2.2±0.01ms        ? ?/sec
physical_plan_tpch_q5                         1.01      4.1±0.05ms        ? ?/sec    1.00      4.1±0.03ms        ? ?/sec
physical_plan_tpch_q6                         1.01  1446.6±63.18µs        ? ?/sec    1.00   1434.8±8.77µs        ? ?/sec
physical_plan_tpch_q7                         1.01      5.2±0.06ms        ? ?/sec    1.00      5.2±0.06ms        ? ?/sec
physical_plan_tpch_q8                         1.01      6.7±0.08ms        ? ?/sec    1.00      6.6±0.10ms        ? ?/sec
physical_plan_tpch_q9                         1.00      5.1±0.06ms        ? ?/sec    1.00      5.1±0.04ms        ? ?/sec
physical_select_all_from_1000                 1.00     59.0±0.52ms        ? ?/sec    1.04     61.3±0.32ms        ? ?/sec
physical_select_one_from_700                  1.03      3.6±0.03ms        ? ?/sec    1.00      3.5±0.02ms        ? ?/sec

Are there any user-facing changes?

There is a new API

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules labels Jun 17, 2024
/// assert!(refs.contains(&Column::new_unqualified("a")));
/// assert!(refs.contains(&Column::new_unqualified("b")));
/// ```
pub fn column_refs(&self) -> HashSet<&Column> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of this function is quite nice now compared to the expr_to_columns one: https://github.com/alamb/datafusion/blob/58d0c34d77c9a5202e62b9281cdbf1046abaa096/datafusion/expr/src/utils.rs#L264-L309

@@ -46,6 +46,7 @@ pub const COUNT_STAR_EXPANSION: ScalarValue = ScalarValue::Int64(Some(1));

/// Recursively walk a list of expression trees, collecting the unique set of columns
/// referenced in the expression
#[deprecated(since = "40.0.0", note = "Expr::add_column_refs instead")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is not used anywhere in the datafusion codebase

@@ -785,7 +782,7 @@ impl OptimizerRule for PushDownFilter {
let mut keep_predicates = vec![];
let mut push_predicates = vec![];
for expr in predicates {
let cols = expr.to_columns()?;
let cols = expr.column_refs();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty good example where there is no need to copy Columns simply to check if they are referenced.

@alamb alamb marked this pull request as ready for review June 17, 2024 10:52
@alamb alamb marked this pull request as draft June 17, 2024 10:52
@peter-toth
Copy link
Contributor

This PR looks really nice, let me know when it is ready for review.

@alamb alamb marked this pull request as ready for review June 17, 2024 16:29
@alamb
Copy link
Contributor Author

alamb commented Jun 17, 2024

I think this is now ready for review. I ran benchmarks and they show some slight improvement

Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have only a minor suggestion.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb

@alamb
Copy link
Contributor Author

alamb commented Jun 20, 2024

Thanks @comphead and @peter-toth

@alamb alamb merged commit 98373ab into apache:main Jun 22, 2024
23 checks passed
@alamb alamb deleted the alamb/less_to_column branch June 22, 2024 12:51
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jun 22, 2024
…he#10948)

* Add Expr::column_refs to find column references without copying

migrate some uses of to_column

* Simplify condition
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jun 22, 2024
…he#10948)

* Add Expr::column_refs to find column references without copying

migrate some uses of to_column

* Simplify condition
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…he#10948)

* Add Expr::column_refs to find column references without copying

migrate some uses of to_column

* Simplify condition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants