-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial start of a yaml-based declarative way of building pipelines. #24667
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24667 +/- ##
==========================================
- Coverage 73.35% 73.15% -0.21%
==========================================
Files 719 732 +13
Lines 97137 98093 +956
==========================================
+ Hits 71257 71762 +505
- Misses 24532 24983 +451
Partials 1348 1348
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just adding some initial fly by comments.
self._service = service | ||
self._schema_transforms = None | ||
|
||
def provided_transforms(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "provided_transform_urns" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, these aren't the URNs but the "friendly" names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated per some of the feedback.
Assigning reviewers. If you would like to opt out of this review, comment R: @AnandInguva for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Reminder, please take a look at this pr: @AnandInguva |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @damccorm for label python. Available commands:
|
stop reviewer notifications |
(looks like Cham has been reviewing, I'll let him take the review side to completion) |
Stopping reviewer notifications for this pull request: requested by reviewer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
pipeline_spec = yaml.load( | ||
known_args.pipeline_spec, Loader=yaml_transform.SafeLineLoader) | ||
else: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this check is not done at the correct location (it will be triggered if just known_args.pipeline_spec_file is set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
known_args.pipeline_spec
is was populated from the file above in that case. But this didn't catch the case where both were set. Spelling it out more explicitly now.
def __init__(self, spec, providers={}): # pylint: disable=dangerous-default-value | ||
if isinstance(spec, str): | ||
spec = yaml.load(spec, Loader=SafeLineLoader) | ||
self._spec = spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can/should we validate the spec somehow before accepting it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a schema that we can validate against.
self._schema_transforms = [] | ||
urn = self._urns[type] | ||
if urn in self._schema_transforms: | ||
return external.SchemaAwareExternalTransform(urn, self._service, **args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that this will limit Yaml-based pipelines to portable runners (which is fine but should be documented).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but all available runners in Python are portable runners (and I'd be tempted to say that the other runners aren't "real" beam Runners :-). Is there somewhere specific you'd want this mentioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good place here but this should go to Yaml runner documentation when we have it.
input: | ||
elements: input | ||
transforms: | ||
- type: PyMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any meaning to deeper levels of nesting or should all transforms listed at the top level ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a composite transform. One could also do
YamlTransform(
'''
type: PyMap
name: Cube
'''
)
to get a single transform. I agree the input here is a bit hacky, dropped a TODO to think about this.
assert_that(result, equal_to([1, 4, 9, 1, 8, 27])) | ||
|
||
def test_chain_with_input(self): | ||
with beam.Pipeline(options=beam.options.pipeline_options.PipelineOptions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add tests for other types of providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but that requires invoking the whole cross-langauge testing infrastructure framework, and I was hoping to run these as unit tests (mostly validating the parsing and construction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM.
Seems like we might need to add more tests and documentation but those can be added in follow up PRs.
self._schema_transforms = [] | ||
urn = self._urns[type] | ||
if urn in self._schema_transforms: | ||
return external.SchemaAwareExternalTransform(urn, self._service, **args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good place here but this should go to Yaml runner documentation when we have it.
return constructor >> FullyQualifiedNamedTransform( | ||
constructor, args, kwargs) | ||
|
||
class Flatten(beam.PTransform): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Let's add a comment here regarding needing the extra Flatten to support one or zero PCollections.
Thanks! |
E.g.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.