Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(substrait): modular substrait consumer #13803

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

vbarua
Copy link
Contributor

@vbarua vbarua commented Dec 17, 2024

Note: When reviewing this PR I would recommend hiding whitespace changes.

Which issue does this PR close?

Initial work for #13318

Rationale for this change

Improves the reusability of the Substrait consumer for users that utilize user-defined extensions and types.

I should note that this PR represents an approach to the problem, but is not necessarily the only way we could handle it. I'm very open to feedback on the approach.

What changes are included in this PR?

  • Relation and expression handling code has been extracted into a series of from_substrait_* functions (i.e. from_filter_rel, from_scalar_function) to aid re-use.
  • A SubstraitConsumer trait has been introduced with default methods that handle relations and expression using the above functions.
  • The SubstraitConsumer trait also includes methods for handling extension relations (i.e consume_extension_single) along with user-defined types (consume_user_defined_type, consume_user_defined_literal).
  • All conversion methods take as their first argument an &impl SubstraitConsumer.
  • The role of SubstraitPlanningState has been subsumed into the SubstraitConsumer.

Are these changes tested?

These changes refactor existing code and leverage their tests.

A doc comment has been added to show how the SubstraitConsumer trait can be implemented by users.

Are there any user-facing changes?

Yes.

The top-level from_substrait_plan(state, plan) function now takes a &SessionState as its first argument, instead of &dyn SubstraitPlanningState. The same is true for from_substrait_extended_expr. These two functions are the primary external facing APIs for intiation a conversion.

A number of functions like from_substrait_rel, from_substrait_rex, etc have had their API changed to make the first argument &impl SubstraitConsumer, and remove all state: &dyn SubstraitPlanningState and extensions: &Extensions arguments.

datafusion/substrait/tests/cases/roundtrip_logical_plan.rs Outdated Show resolved Hide resolved
datafusion/substrait/src/logical_plan/consumer.rs Outdated Show resolved Hide resolved
/// Can be used to consume standard Substrait without user-defined extensions
pub struct DefaultSubstraitConsumer {
extensions: Arc<Extensions>,
state: Arc<SessionState>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if Arc is actually needed here, I just got tired of fighting the borrow-checker. I'm open to suggestions, as all we really need is to be able to get references to Extensions and SessionState.

rels: &[Rel],
state: &dyn SubstraitPlanningState,
extensions: &Extensions,
is_all: bool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting from here, a lot of the changes look like this. Replacing state and extensions with a single consumer argument that subsumes them both, and then threading the consumer into calls.

state: &dyn SubstraitPlanningState,
rel: &Rel,
extensions: &Extensions,
pub async fn from_project_rel(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the from_* are extractions of existing functionality. It is easier to see this if you hide whitespaces changes in the diff.

@vbarua vbarua force-pushed the vbarua/modular-substrait-consumer branch from ad28bec to 04491a6 Compare December 17, 2024 03:00
@vbarua vbarua marked this pull request as ready for review December 17, 2024 03:01
@vbarua
Copy link
Contributor Author

vbarua commented Dec 17, 2024

@Blizzara I would appreciate if you could take a look and let me know if this approach seems reasonable, or if I've architecture astronauted myself.

@vbarua vbarua force-pushed the vbarua/modular-substrait-consumer branch from 04491a6 to b8e435f Compare December 17, 2024 03:29
@Blizzara
Copy link
Contributor

@Blizzara I would appreciate if you could take a look and let me know if this approach seems reasonable, or if I've architecture astronauted myself.

Thanks for the ping - I took a look, and overall looks really good to me! I left couple thoughts (I see you already saw some), but nothing major. I already have usecase in mind for this internally, so looking forward to getting it merged 😄

One thing is we should consider splitting the consumer.rs into smaller files. It could be a separate PR to keep the diff smaller, but maybe moving all the static "consumer_X" functions into like consumer/relations.rs, consumer/expressions.rs, consumer/literals.rs and consumer/types.rs (and one more for utils etc) or similar? Feels like now could be a good moment to do that.

let plan = self
.state
.serializer_registry()
.deserialize_logical_plan(&ext_detail.type_url, &ext_detail.value)?;
Copy link

@ccciudatu ccciudatu Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One drawback of the current implementation is that it's abusing the type_url field in proto::Any (which is quite clearly specified). Working with Substrait objects directly would eliminate the need for hacks like encoding node names as Any::type_url.

As a side note, this may sound hypocritical, given that I made it even worse with my PR by setting it to table names as well. But I only did it for consistency. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, this may sound hypocritical, given that I made it even worse #13772

I don't think it's hypocritical, you were aiming for consistency and I'm just being much more aggressive with my changes and breaking the API.

Include SerializerRegistry based handlers for Extension Relations in the
DefaultSubstraitConsumer
@vbarua
Copy link
Contributor Author

vbarua commented Dec 17, 2024

@Blizzara

One thing is we should consider splitting the consumer.rs into smaller files. It could be a separate PR to keep the diff smaller, but maybe moving all the static "consumer_X" functions into like consumer/relations.rs, consumer/expressions.rs, consumer/literals.rs and consumer/types.rs (and one more for utils etc) or similar? Feels like now could be a good moment to do that.

I think it makes sense to do that, and I'm happy to own it, but I would prefer to do that as a followup. I structured the consumer.rs changes in order to make it easy to see that the functions I added were extracted from the big from_substrait_rel and from_substrait_rex switch statements.

@Blizzara
Copy link
Contributor

I think it makes sense to do that, and I'm happy to own it, but I would prefer to do that as a followup. I structured the consumer.rs changes in order to make it easy to see that the functions I added were extracted from the big from_substrait_rel and from_substrait_rex switch statements.

Yup, I totally agree!

I looked through the latest changes, looks even better now, just some more comments re SubstraitPlanningState but otherwise this looks very good by me!

@vbarua vbarua requested a review from Blizzara December 18, 2024 17:22
Copy link
Contributor

@Blizzara Blizzara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀 Thanks @vbarua, and FYI @alamb :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants