-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: When consuming Substrait, temporarily rename clashing duplicate columns #11329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thanks @Blizzara
new_name = format!("{}__temp__{}", name, i); | ||
i += 1; | ||
} | ||
names.insert(new_name.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably save a clone by putting the names.insert
call below the exprs.push
calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change in #11337 as I had the code open anyways
if new_name != name { | ||
exprs.push(x.as_ref().clone().alias(new_name.clone())); | ||
} else { | ||
exprs.push(x.as_ref().clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused at first why this was as_ref().clone()
was needed
When I investigated, it seems that it is due to the fact that from_substrait_rex
returns an Arc<Expr>
rather than an Expr
which is somewhat confusing to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out I can remove them: #11337
…columns (apache#11329) * cleanup project internals * alias intermediate duplicate columns * fix test * fix clippy
…columns (apache#11329) * cleanup project internals * alias intermediate duplicate columns * fix test * fix clippy
Which issue does this PR close?
Related to #10815, and follows up from #11049 (comment). There may still be some cases left uncovered, but this does handle some more.
Closes #.
Rationale for this change
As DF requires column name uniqueness while Substrait doesn't care about names at all (apart from first & last levels of the plan), that causes clashes in the intermediate nodes that prevent DF from executing plans.
A simple example of a plan that would not work is:
This fails as Substrait consumer first creates a project to add the columns, and only then creates the final project (or modifies the existing project) to rename them. But when the first project is created, it has two identical expressions without aliases, meaning their names are collide and
validate_unique_names()
fails.What changes are included in this PR?
When consuming Substrait
Project
s, rename expressions as required to prevent duplicates.Are these changes tested?
Added unit test + existing tests
Are there any user-facing changes?