Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: When consuming Substrait, temporarily rename clashing duplicate columns #11329

Merged
merged 4 commits into from
Jul 8, 2024

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Jul 8, 2024

Which issue does this PR close?

Related to #10815, and follows up from #11049 (comment). There may still be some cases left uncovered, but this does handle some more.

Closes #.

Rationale for this change

As DF requires column name uniqueness while Substrait doesn't care about names at all (apart from first & last levels of the plan), that causes clashes in the intermediate nodes that prevent DF from executing plans.

A simple example of a plan that would not work is:

SELECT a + 1 as sum_a, a + 1 as sum_a_2 FROM data

This fails as Substrait consumer first creates a project to add the columns, and only then creates the final project (or modifies the existing project) to rename them. But when the first project is created, it has two identical expressions without aliases, meaning their names are collide and validate_unique_names() fails.

What changes are included in this PR?

When consuming Substrait Projects, rename expressions as required to prevent duplicates.

Are these changes tested?

Added unit test + existing tests

Are there any user-facing changes?

@Blizzara Blizzara marked this pull request as ready for review July 8, 2024 08:59
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thanks @Blizzara

new_name = format!("{}__temp__{}", name, i);
i += 1;
}
names.insert(new_name.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably save a clone by putting the names.insert call below the exprs.push calls

Copy link
Contributor

@alamb alamb Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change in #11337 as I had the code open anyways

if new_name != name {
exprs.push(x.as_ref().clone().alias(new_name.clone()));
} else {
exprs.push(x.as_ref().clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused at first why this was as_ref().clone() was needed

When I investigated, it seems that it is due to the fact that from_substrait_rex returns an Arc<Expr> rather than an Expr which is somewhat confusing to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I can remove them: #11337

@alamb alamb merged commit 894a879 into apache:main Jul 8, 2024
23 checks passed
@Blizzara Blizzara deleted the avo/substrait-duplicate-columns branch July 9, 2024 07:16
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…columns (apache#11329)

* cleanup project internals

* alias intermediate duplicate columns

* fix test

* fix clippy
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
…columns (apache#11329)

* cleanup project internals

* alias intermediate duplicate columns

* fix test

* fix clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants