Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add I/O Standards Page #24962

Merged
merged 8 commits into from
Jan 23, 2023
Merged

Add I/O Standards Page #24962

merged 8 commits into from
Jan 23, 2023

Conversation

hermanmak
Copy link
Contributor

@hermanmak hermanmak commented Jan 10, 2023

Adds the I/O Standards Page
fixes #21741

https://apache-beam-website-pull-requests.storage.googleapis.com/24962/documentation/io/io-standards/index.html


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

Copy link
Member

@mosche mosche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 🎉
A few initial comments ...

@hermanmak hermanmak requested review from mosche and removed request for mosche January 16, 2023 13:42
Copy link
Member

@mosche mosche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hermanmak I continued with the next section, but only got to Java so far. Will continue tomorrow...

</tr>
<tr>
<td>
<p>I/Os should provide links to the Source/Sink documentation within <strong>Before you start Header</strong>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<p>I/Os should provide links to the Source/Sink documentation within <strong>Before you start Header</strong>
<p>I/Os should provide links to the Source/Sink documentation within a <strong>Before you start</strong> header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved


## Development

This section outlines API Syntax, Semantics and Feature Adoption recommendations for new and existing Apache Beam I/O Connectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This section outlines API Syntax, Semantics and Feature Adoption recommendations for new and existing Apache Beam I/O Connectors.
This section outlines API syntax, semantics and recommendations for features that should be adopted for new as well as existing Apache Beam I/O Connectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.


This section outlines API Syntax, Semantics and Feature Adoption recommendations for new and existing Apache Beam I/O Connectors.

Development guidelines are written with the following principles in mind:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Development guidelines are written with the following principles in mind:
The I/O development guidelines are written with the following principles in mind:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed




* Consistency makes an API easier to learn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Consistency makes an API easier to learn
* Consistency makes an API easier to learn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.


* Consistency makes an API easier to learn
* If there are multiple ways of doing something, we should strive to be consistent first
* With a couple minutes of studying documentation, users should be able to pick up most I/O connectors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* With a couple minutes of studying documentation, users should be able to pick up most I/O connectors
* With a couple minutes of studying documentation, users should be able to pick up most I/O connectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

<p>Sink Windowing
</td>
<td>
<p>A sink should be Window agnostic and handle elements sent with any Windowing methodexpect elements to be sent to it in the Global Window, unless explicitly parameterized or expressed in its API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<p>A sink should be Window agnostic and handle elements sent with any Windowing methodexpect elements to be sent to it in the Global Window, unless explicitly parameterized or expressed in its API.
<p>A sink should be Window agnostic and handle elements sent with any windowing method, unless explicitly parameterized or expressed in its API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

</td>
<td>
<p>A sink should be Window agnostic and handle elements sent with any Windowing methodexpect elements to be sent to it in the Global Window, unless explicitly parameterized or expressed in its API.
<p>A sink may change the windowing of a PCollection internally however it needs, however, the metadata that it returns as part of its Result object must be:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<p>A sink may change the windowing of a PCollection internally however it needs, however, the metadata that it returns as part of its Result object must be:
<p>A sink may change windowing of a PCollection internally in any way. However, the metadata it returns as part of its result object:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

<p>A sink should be Window agnostic and handle elements sent with any Windowing methodexpect elements to be sent to it in the Global Window, unless explicitly parameterized or expressed in its API.
<p>A sink may change the windowing of a PCollection internally however it needs, however, the metadata that it returns as part of its Result object must be:
<ul>
<li>In the same window, unless explicitly declared in the API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit unclear, same could also refer to the internal window...

Suggested change
<li>In the same window, unless explicitly declared in the API
<li>must be in the same window as the input, unless explicitly declared otherwise in the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

<p>A sink may change the windowing of a PCollection internally however it needs, however, the metadata that it returns as part of its Result object must be:
<ul>
<li>In the same window, unless explicitly declared in the API
<li>With accurate timestamps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<li>With accurate timestamps
<li>must have accurate timestamps.

<ul>
<li>In the same window, unless explicitly declared in the API
<li>With accurate timestamps
<li><strong>It may</strong> also return metadata with information about windowing (e.g. a BigQuery job may have a timestamp, but also a window associated with it).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<li><strong>It may</strong> also return metadata with information about windowing (e.g. a BigQuery job may have a timestamp, but also a window associated with it).
<li>may contain additional information about windowing (e.g. a BigQuery job may have a timestamp, but also a window associated with it).

or even about internal windowing for clarity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

@apilloud
Copy link
Member

Copy link
Member

@apilloud apilloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this in the repo, it gives us something to iterate on! This looks reasonable as a first start from the "relational" perspective.

I didn't read the whole thing, it is somewhat a wall of text. It might make sense to split it up into more manageable chunks, possibly by language?


## Documentation

This section lays out the superset of all documentation that is expected to be made available with an I/O. The Apache Beam documentation referenced throughout this section can be found [here](https://beam.apache.org/documentation/). And generally a good example to follow would be the built-in I/O, [Snowflake I/O](https://beam.apache.org/documentation/io/built-in/snowflake/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Internal links should be relative. Drop https://beam.apache.org

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced all links pointing to documentation with the relative path.

<li>Authentication
<li>Reading from {Connector}
<li>Writing to {Connector}
<li><a href="#unit-tests">Resource scalability</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is odd. Why does Resource scalability link to #unit-tests? Should the rest be links too?

<p>I/O Connectors should include a note indicating <a href="https://2022.beamsummit.org/sessions/relational-beam/">Relational Features</a> supported in their page under <strong>I/O connector guides</strong>.
<p>Relational Features are concepts that can help improve efficiency and can optionally be implemented by an I/O Connector. Using end user supplied pipeline configuration (<a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/io/SchemaIO.html">SchemaIO</a>) and user query (<a href="https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/FieldAccessDescriptor.html">FieldAccessDescriptor</a>) data, relational theory is applied to derive improvements such as faster pipeline execution, lower operation costs and less data read/written.
<p>Example table:
<p><img src="/images/io-standards/io-supported-relational-features-table.png" width="" alt="Supported Relational Features" title="Supported Relational Features"></img>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this text and not a table. Markdown tables might be a good choice, if not HTML tables work too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for text instead of the image. Unfortunately markdown isn't going to work inside of html formatted stuff as far as I know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, how about I have sample markdown, comment blocks which the I/O writer can simple copy and fill out and add?

</ul>
<p>Instead, we highly recommend exposing Beam-native interfaces and an adaptor be implemented to translate.
<p>If you believe that the library in question is extremely static in nature. Please note it in the I/O itself.
<p>As part of the API of an I/O, it is <strong>highly discouraged</strong> to expose third-party libraries in the public API of a Beam API or connector,. Instead, a Beam-native interface should be used and adapted into the third-library object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repeats the same thing again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hermanmak looks like you copied the text using a view that showed editing markup (such as removed / edited text) of the google doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I rewrote this and forgot to remove the old!

Copy link
Member

@mosche mosche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Done with the remainder of the document, but I skipped over the Python section.
Would be good to get someone else to review that section.

Awesome work pushing this forward, great to finally get this on the website!

<td>
<p>The unit/integration/performance tests should live under the package <strong>org.apache.beam.sdk.io.{connector}.testing</strong>. This will cause the various tests to work with the standard user-facing interfaces of the connector.
<p>Unit tests should reside in the same package (i.e. <strong>org.apache.beam.sdk.io.{connector}</strong>), as they may often test internals of the connector.
<p>The BigQueryIO belongs in the java package <a href="https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java">org.apache.beam.sdk.io.bigquery</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate from above, line can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed extra.

<tr>
<td>
<p>An I/O transform should avoid receiving user lambdas to map elements from a user type to a connector-specific type. Instead, they should interface with a connector-specific data type (with schema information when possible).
<p>When necessary, then an I/O transform should receive a type parameter that specifies the input type (for sinks) or output type (for sources) of the transform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<p>When necessary, then an I/O transform should receive a type parameter that specifies the input type (for sinks) or output type (for sources) of the transform.
<p>When necessary, an I/O transform should receive a type parameter that specifies the input type (for sinks) or output type (for sources) of the transform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the then

<p>class IO.Read
</td>
<td>
<p>Gives access to the class that represents reads within the I/O. The Read class should implement an fluent interface similar to the fluentbuilder pattern (e.g. withX(...).withY(...)). Together with default values; it provides a fail-fast (with immediate validation feedback after each .withX() and slightly less verbosity compared to builder pattern.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text contained editing markup, also 2nd part wasn't clear. Does that capture what you wanted to say?

Suggested change
<p>Gives access to the class that represents reads within the I/O. The Read class should implement an fluent interface similar to the fluentbuilder pattern (e.g. withX(...).withY(...)). Together with default values; it provides a fail-fast (with immediate validation feedback after each .withX() and slightly less verbosity compared to builder pattern.
<p>Gives access to the class that represents reads within the I/O. The <code>Read</code> class should implement a fluent interface pattern (e.g. <code>withX(...).withY(...)</code>). Together with default values, it provide fail-fast configuration (with immediate validation after each <code>withX()</code>) that is slightly less verbose than the builder pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding using <code> tags, i personally think it helps the reader. But that's certainly optional. It probably only makes sense if applied more consistently throughout the document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added code blocks

</ul>
<p>Instead, we highly recommend exposing Beam-native interfaces and an adaptor be implemented to translate.
<p>If you believe that the library in question is extremely static in nature. Please note it in the I/O itself.
<p>As part of the API of an I/O, it is <strong>highly discouraged</strong> to expose third-party libraries in the public API of a Beam API or connector,. Instead, a Beam-native interface should be used and adapted into the third-library object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hermanmak looks like you copied the text using a view that showed editing markup (such as removed / edited text) of the google doc

<p>A few different sources implement runtime configuration for reading from a data source. This is a valuable pattern because it enables a purely batch source to become a more sophisticated streaming source.
<p>As much as possible, this type of transform should have the type richness of a construction-time-configured transform:
<ul>
<li>Support Beam Row output with a schema known at construction-time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<li>Support Beam Row output with a schema known at construction-time
<li>Support Beam Row output with a schema known at construction-time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosche yes, haha I made a new version (strikeout old) in Google Docs and it was also ported over, cleaned up now!



* To evaluate if the cost and performance of a specific I/O or dataflow template meets the customer’s business requirements.
* To illustrate performance regressions/improvements to I/O or dataflow templates between code changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* To illustrate performance regressions/improvements to I/O or dataflow templates between code changes
* To illustrate performance regressions/improvements to I/O or dataflow templates between code changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.


* To evaluate if the cost and performance of a specific I/O or dataflow template meets the customer’s business requirements.
* To illustrate performance regressions/improvements to I/O or dataflow templates between code changes
* To help end customers estimate costs, plan capacity to meet their SLOs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* To help end customers estimate costs, plan capacity to meet their SLOs
* To help end customers estimate costs, plan capacity to meet their SLOs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.


#### Dashboard

Google performs the performance tests routinely for built-in I/Os and publishes them to an externally viewable dashboard for [Java](http://104.154.241.245/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1) and [Python](http://104.154.241.245/d/gP7vMPqZz/python-io-it-tests-dataflow?orgId=1).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Google performs the performance tests routinely for built-in I/Os and publishes them to an externally viewable dashboard for [Java](http://104.154.241.245/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1) and [Python](http://104.154.241.245/d/gP7vMPqZz/python-io-it-tests-dataflow?orgId=1).
Google runs performance tests routinely for built-in I/Os and publishes them to an externally viewable dashboard for [Java](https://s.apache.org/beam-community-metrics/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1) and [Python](https://s.apache.org/beam-community-metrics/d/gP7vMPqZz/python-io-it-tests-dataflow?orgId=1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

<td>
<p>Include a <strong>resource scalability</strong> section into your page under <strong> <a href="#built-in-io">Built-in I/O connector guides</a> </strong>documentation<strong> </strong> which will indicate the upper bounds which the IO has integration tests for.
<p>For example:
<p>An indication that kafkaIO has integration tests with <strong>xxxx</strong> topics. The documentation can state if the connector authors believe that the connector can scale beyond the integration test number, however this will make it clear to the user the limits of the tested paths.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<p>An indication that kafkaIO has integration tests with <strong>xxxx</strong> topics. The documentation can state if the connector authors believe that the connector can scale beyond the integration test number, however this will make it clear to the user the limits of the tested paths.
<p>An indication that KafkaIO has integration tests with <strong>xxxx</strong> topics. The documentation can state if the connector authors believe that the connector can scale beyond the integration test number, however this will make it clear to the user the limits of the tested paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed!

</tr>
<tr>
<td>
<p>Document the performance / internal metrics that you I/O collects, including what they mean, and how they can be used (some connectors collect and publish performance metrics like latency/bundle size/etc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<p>Document the performance / internal metrics that you I/O collects, including what they mean, and how they can be used (some connectors collect and publish performance metrics like latency/bundle size/etc)
<p>Document the performance / internal metrics that your I/O collects including what they mean, and how they can be used (some connectors collect and publish performance metrics like latency/bundle size/etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@codecov
Copy link

codecov bot commented Jan 18, 2023

Codecov Report

Merging #24962 (d9d322b) into master (3838528) will increase coverage by 0.00%.
The diff coverage is 33.33%.

❗ Current head d9d322b differs from pull request most recent head 215feef. Consider uploading reports for the commit 215feef to get more accurate results

@@           Coverage Diff           @@
##           master   #24962   +/-   ##
=======================================
  Coverage   73.13%   73.13%           
=======================================
  Files         735      735           
  Lines       98146    98151    +5     
=======================================
+ Hits        71777    71782    +5     
  Misses      25006    25006           
  Partials     1363     1363           
Flag Coverage Δ
go 51.83% <22.22%> (+<0.01%) ⬆️
python 82.68% <66.66%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/go/pkg/beam/core/runtime/graphx/coder.go 53.48% <0.00%> (+0.11%) ⬆️
sdks/go/pkg/beam/io/mongodbio/read.go 30.97% <0.00%> (-0.24%) ⬇️
...s/python/apache_beam/examples/snippets/snippets.py 76.70% <0.00%> (ø)
sdks/go/pkg/beam/core/runtime/graphx/serialize.go 27.56% <66.66%> (+0.22%) ⬆️
...dks/python/apache_beam/options/pipeline_options.py 93.97% <100.00%> (+0.02%) ⬆️
sdks/python/apache_beam/io/source_test_utils.py 88.47% <0.00%> (-1.39%) ⬇️
...hon/apache_beam/runners/direct/test_stream_impl.py 93.28% <0.00%> (-0.75%) ⬇️
...on/apache_beam/runners/dataflow/dataflow_runner.py 81.74% <0.00%> (-0.15%) ⬇️
...hon/apache_beam/runners/worker/bundle_processor.py 93.42% <0.00%> (-0.13%) ⬇️
...eam/runners/interactive/interactive_environment.py 92.02% <0.00%> (+0.30%) ⬆️
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@johnjcasey johnjcasey merged commit 5cfc09c into apache:master Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add connector standards to Developing a new I/O connector page
4 participants