-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add I/O Standards Page #24962
Add I/O Standards Page #24962
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 🎉
A few initial comments ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hermanmak I continued with the next section, but only got to Java so far. Will continue tomorrow...
</tr> | ||
<tr> | ||
<td> | ||
<p>I/Os should provide links to the Source/Sink documentation within <strong>Before you start Header</strong> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<p>I/Os should provide links to the Source/Sink documentation within <strong>Before you start Header</strong> | |
<p>I/Os should provide links to the Source/Sink documentation within a <strong>Before you start</strong> header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
|
||
## Development | ||
|
||
This section outlines API Syntax, Semantics and Feature Adoption recommendations for new and existing Apache Beam I/O Connectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section outlines API Syntax, Semantics and Feature Adoption recommendations for new and existing Apache Beam I/O Connectors. | |
This section outlines API syntax, semantics and recommendations for features that should be adopted for new as well as existing Apache Beam I/O Connectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
|
||
This section outlines API Syntax, Semantics and Feature Adoption recommendations for new and existing Apache Beam I/O Connectors. | ||
|
||
Development guidelines are written with the following principles in mind: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Development guidelines are written with the following principles in mind: | |
The I/O development guidelines are written with the following principles in mind: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
|
||
|
||
|
||
* Consistency makes an API easier to learn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Consistency makes an API easier to learn | |
* Consistency makes an API easier to learn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
|
||
* Consistency makes an API easier to learn | ||
* If there are multiple ways of doing something, we should strive to be consistent first | ||
* With a couple minutes of studying documentation, users should be able to pick up most I/O connectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* With a couple minutes of studying documentation, users should be able to pick up most I/O connectors | |
* With a couple minutes of studying documentation, users should be able to pick up most I/O connectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
<p>Sink Windowing | ||
</td> | ||
<td> | ||
<p>A sink should be Window agnostic and handle elements sent with any Windowing methodexpect elements to be sent to it in the Global Window, unless explicitly parameterized or expressed in its API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<p>A sink should be Window agnostic and handle elements sent with any Windowing methodexpect elements to be sent to it in the Global Window, unless explicitly parameterized or expressed in its API. | |
<p>A sink should be Window agnostic and handle elements sent with any windowing method, unless explicitly parameterized or expressed in its API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
</td> | ||
<td> | ||
<p>A sink should be Window agnostic and handle elements sent with any Windowing methodexpect elements to be sent to it in the Global Window, unless explicitly parameterized or expressed in its API. | ||
<p>A sink may change the windowing of a PCollection internally however it needs, however, the metadata that it returns as part of its Result object must be: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<p>A sink may change the windowing of a PCollection internally however it needs, however, the metadata that it returns as part of its Result object must be: | |
<p>A sink may change windowing of a PCollection internally in any way. However, the metadata it returns as part of its result object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
<p>A sink should be Window agnostic and handle elements sent with any Windowing methodexpect elements to be sent to it in the Global Window, unless explicitly parameterized or expressed in its API. | ||
<p>A sink may change the windowing of a PCollection internally however it needs, however, the metadata that it returns as part of its Result object must be: | ||
<ul> | ||
<li>In the same window, unless explicitly declared in the API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit unclear, same
could also refer to the internal window...
<li>In the same window, unless explicitly declared in the API | |
<li>must be in the same window as the input, unless explicitly declared otherwise in the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
<p>A sink may change the windowing of a PCollection internally however it needs, however, the metadata that it returns as part of its Result object must be: | ||
<ul> | ||
<li>In the same window, unless explicitly declared in the API | ||
<li>With accurate timestamps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<li>With accurate timestamps | |
<li>must have accurate timestamps. |
<ul> | ||
<li>In the same window, unless explicitly declared in the API | ||
<li>With accurate timestamps | ||
<li><strong>It may</strong> also return metadata with information about windowing (e.g. a BigQuery job may have a timestamp, but also a window associated with it). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<li><strong>It may</strong> also return metadata with information about windowing (e.g. a BigQuery job may have a timestamp, but also a window associated with it). | |
<li>may contain additional information about windowing (e.g. a BigQuery job may have a timestamp, but also a window associated with it). |
or even about internal windowing
for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this in the repo, it gives us something to iterate on! This looks reasonable as a first start from the "relational" perspective.
I didn't read the whole thing, it is somewhat a wall of text. It might make sense to split it up into more manageable chunks, possibly by language?
|
||
## Documentation | ||
|
||
This section lays out the superset of all documentation that is expected to be made available with an I/O. The Apache Beam documentation referenced throughout this section can be found [here](https://beam.apache.org/documentation/). And generally a good example to follow would be the built-in I/O, [Snowflake I/O](https://beam.apache.org/documentation/io/built-in/snowflake/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Internal links should be relative. Drop https://beam.apache.org
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced all links pointing to documentation with the relative path.
<li>Authentication | ||
<li>Reading from {Connector} | ||
<li>Writing to {Connector} | ||
<li><a href="#unit-tests">Resource scalability</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link is odd. Why does Resource scalability
link to #unit-tests
? Should the rest be links too?
<p>I/O Connectors should include a note indicating <a href="https://2022.beamsummit.org/sessions/relational-beam/">Relational Features</a> supported in their page under <strong>I/O connector guides</strong>. | ||
<p>Relational Features are concepts that can help improve efficiency and can optionally be implemented by an I/O Connector. Using end user supplied pipeline configuration (<a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/io/SchemaIO.html">SchemaIO</a>) and user query (<a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/FieldAccessDescriptor.html">FieldAccessDescriptor</a>) data, relational theory is applied to derive improvements such as faster pipeline execution, lower operation costs and less data read/written. | ||
<p>Example table: | ||
<p><img src="/images/io-standards/io-supported-relational-features-table.png" width="" alt="Supported Relational Features" title="Supported Relational Features"></img> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this text and not a table. Markdown tables might be a good choice, if not HTML tables work too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for text instead of the image. Unfortunately markdown isn't going to work inside of html formatted stuff as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, how about I have sample markdown, comment blocks which the I/O writer can simple copy and fill out and add?
</ul> | ||
<p>Instead, we highly recommend exposing Beam-native interfaces and an adaptor be implemented to translate. | ||
<p>If you believe that the library in question is extremely static in nature. Please note it in the I/O itself. | ||
<p>As part of the API of an I/O, it is <strong>highly discouraged</strong> to expose third-party libraries in the public API of a Beam API or connector,. Instead, a Beam-native interface should be used and adapted into the third-library object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This repeats the same thing again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hermanmak looks like you copied the text using a view that showed editing markup (such as removed / edited text) of the google doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I rewrote this and forgot to remove the old!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 Done with the remainder of the document, but I skipped over the Python section.
Would be good to get someone else to review that section.
Awesome work pushing this forward, great to finally get this on the website!
<td> | ||
<p>The unit/integration/performance tests should live under the package <strong>org.apache.beam.sdk.io.{connector}.testing</strong>. This will cause the various tests to work with the standard user-facing interfaces of the connector. | ||
<p>Unit tests should reside in the same package (i.e. <strong>org.apache.beam.sdk.io.{connector}</strong>), as they may often test internals of the connector. | ||
<p>The BigQueryIO belongs in the java package <a href="https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java">org.apache.beam.sdk.io.bigquery</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate from above, line can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed extra.
<tr> | ||
<td> | ||
<p>An I/O transform should avoid receiving user lambdas to map elements from a user type to a connector-specific type. Instead, they should interface with a connector-specific data type (with schema information when possible). | ||
<p>When necessary, then an I/O transform should receive a type parameter that specifies the input type (for sinks) or output type (for sources) of the transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<p>When necessary, then an I/O transform should receive a type parameter that specifies the input type (for sinks) or output type (for sources) of the transform. | |
<p>When necessary, an I/O transform should receive a type parameter that specifies the input type (for sinks) or output type (for sources) of the transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the then
<p>class IO.Read | ||
</td> | ||
<td> | ||
<p>Gives access to the class that represents reads within the I/O. The Read class should implement an fluent interface similar to the fluentbuilder pattern (e.g. withX(...).withY(...)). Together with default values; it provides a fail-fast (with immediate validation feedback after each .withX() and slightly less verbosity compared to builder pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text contained editing markup, also 2nd part wasn't clear. Does that capture what you wanted to say?
<p>Gives access to the class that represents reads within the I/O. The Read class should implement an fluent interface similar to the fluentbuilder pattern (e.g. withX(...).withY(...)). Together with default values; it provides a fail-fast (with immediate validation feedback after each .withX() and slightly less verbosity compared to builder pattern. | |
<p>Gives access to the class that represents reads within the I/O. The <code>Read</code> class should implement a fluent interface pattern (e.g. <code>withX(...).withY(...)</code>). Together with default values, it provide fail-fast configuration (with immediate validation after each <code>withX()</code>) that is slightly less verbose than the builder pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding using <code>
tags, i personally think it helps the reader. But that's certainly optional. It probably only makes sense if applied more consistently throughout the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added code blocks
</ul> | ||
<p>Instead, we highly recommend exposing Beam-native interfaces and an adaptor be implemented to translate. | ||
<p>If you believe that the library in question is extremely static in nature. Please note it in the I/O itself. | ||
<p>As part of the API of an I/O, it is <strong>highly discouraged</strong> to expose third-party libraries in the public API of a Beam API or connector,. Instead, a Beam-native interface should be used and adapted into the third-library object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hermanmak looks like you copied the text using a view that showed editing markup (such as removed / edited text) of the google doc
<p>A few different sources implement runtime configuration for reading from a data source. This is a valuable pattern because it enables a purely batch source to become a more sophisticated streaming source. | ||
<p>As much as possible, this type of transform should have the type richness of a construction-time-configured transform: | ||
<ul> | ||
<li>Support Beam Row output with a schema known at construction-time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<li>Support Beam Row output with a schema known at construction-time | |
<li>Support Beam Row output with a schema known at construction-time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mosche yes, haha I made a new version (strikeout old) in Google Docs and it was also ported over, cleaned up now!
|
||
|
||
* To evaluate if the cost and performance of a specific I/O or dataflow template meets the customer’s business requirements. | ||
* To illustrate performance regressions/improvements to I/O or dataflow templates between code changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* To illustrate performance regressions/improvements to I/O or dataflow templates between code changes | |
* To illustrate performance regressions/improvements to I/O or dataflow templates between code changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
|
||
* To evaluate if the cost and performance of a specific I/O or dataflow template meets the customer’s business requirements. | ||
* To illustrate performance regressions/improvements to I/O or dataflow templates between code changes | ||
* To help end customers estimate costs, plan capacity to meet their SLOs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* To help end customers estimate costs, plan capacity to meet their SLOs | |
* To help end customers estimate costs, plan capacity to meet their SLOs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
|
||
#### Dashboard | ||
|
||
Google performs the performance tests routinely for built-in I/Os and publishes them to an externally viewable dashboard for [Java](http://104.154.241.245/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1) and [Python](http://104.154.241.245/d/gP7vMPqZz/python-io-it-tests-dataflow?orgId=1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google performs the performance tests routinely for built-in I/Os and publishes them to an externally viewable dashboard for [Java](http://104.154.241.245/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1) and [Python](http://104.154.241.245/d/gP7vMPqZz/python-io-it-tests-dataflow?orgId=1). | |
Google runs performance tests routinely for built-in I/Os and publishes them to an externally viewable dashboard for [Java](https://s.apache.org/beam-community-metrics/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1) and [Python](https://s.apache.org/beam-community-metrics/d/gP7vMPqZz/python-io-it-tests-dataflow?orgId=1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
<td> | ||
<p>Include a <strong>resource scalability</strong> section into your page under <strong> <a href="#built-in-io">Built-in I/O connector guides</a> </strong>documentation<strong> </strong> which will indicate the upper bounds which the IO has integration tests for. | ||
<p>For example: | ||
<p>An indication that kafkaIO has integration tests with <strong>xxxx</strong> topics. The documentation can state if the connector authors believe that the connector can scale beyond the integration test number, however this will make it clear to the user the limits of the tested paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<p>An indication that kafkaIO has integration tests with <strong>xxxx</strong> topics. The documentation can state if the connector authors believe that the connector can scale beyond the integration test number, however this will make it clear to the user the limits of the tested paths. | |
<p>An indication that KafkaIO has integration tests with <strong>xxxx</strong> topics. The documentation can state if the connector authors believe that the connector can scale beyond the integration test number, however this will make it clear to the user the limits of the tested paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed!
</tr> | ||
<tr> | ||
<td> | ||
<p>Document the performance / internal metrics that you I/O collects, including what they mean, and how they can be used (some connectors collect and publish performance metrics like latency/bundle size/etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<p>Document the performance / internal metrics that you I/O collects, including what they mean, and how they can be used (some connectors collect and publish performance metrics like latency/bundle size/etc) | |
<p>Document the performance / internal metrics that your I/O collects including what they mean, and how they can be used (some connectors collect and publish performance metrics like latency/bundle size/etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
Codecov Report
@@ Coverage Diff @@
## master #24962 +/- ##
=======================================
Coverage 73.13% 73.13%
=======================================
Files 735 735
Lines 98146 98151 +5
=======================================
+ Hits 71777 71782 +5
Misses 25006 25006
Partials 1363 1363
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Adds the I/O Standards Page
fixes #21741
https://apache-beam-website-pull-requests.storage.googleapis.com/24962/documentation/io/io-standards/index.html
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.