-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #30007 [YAML] Several improvements to the document…
…ation.
- Loading branch information
Showing
4 changed files
with
128 additions
and
40 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,42 @@ It should be noted that everything here is still under development, but any | |
features already included are considered stable. Feedback is welcome at | ||
[email protected]. | ||
|
||
## Running pipelines | ||
|
||
The Beam yaml parser is currently included as part of the Apache Beam Python SDK. | ||
This can be installed (e.g. within a virtual environment) as | ||
|
||
``` | ||
pip install apache_beam[yaml,gcp] | ||
``` | ||
|
||
In addition, several of the provided transforms (such as SQL) are implemented | ||
in Java and their expansion will require a working Java interpeter. (The | ||
requisite artifacts will be automatically downloaded from the apache maven | ||
repositories, so no further installs will be required.) | ||
Docker is also currently required for local execution of these | ||
cross-language-requiring transforms, but not for submission to a non-local | ||
runner such as Flink or Dataflow. | ||
|
||
Once the prerequisites are installed, you can execute a pipeline defined | ||
in a yaml file as | ||
|
||
``` | ||
python -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml [other pipeline options such as the runner] | ||
``` | ||
|
||
You can do a dry-run of your pipeline using the render runner to see what the | ||
execution graph is, e.g. | ||
|
||
``` | ||
python -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml --runner=apache_beam.runners.render.RenderRunner --render_output=out.png [--render_port=0] | ||
``` | ||
|
||
(This requires [Graphviz](https://graphviz.org/download/) to be installed to render the pipeline.) | ||
|
||
We intend to support running a pipeline on Dataflow by directly passing the | ||
yaml specification to a template, no local installation of the Beam SDKs required. | ||
|
||
## Example pipelines | ||
|
||
Here is a simple pipeline that reads some data from csv files and | ||
|
@@ -98,16 +134,45 @@ pipeline: | |
keep: "col3 > 100" | ||
input: ReadFromCsv | ||
- type: Sql | ||
name: MySqlTransform | ||
config: | ||
query: "select col1, count(*) as cnt from PCOLLECTION group by col1" | ||
input: Filter | ||
- type: WriteToJson | ||
config: | ||
path: /path/to/output.json | ||
input: Sql | ||
``` | ||
|
||
Transforms can be named to help with monitoring and debugging. | ||
|
||
``` | ||
pipeline: | ||
transforms: | ||
- type: ReadFromCsv | ||
name: ReadMyData | ||
config: | ||
path: /path/to/input*.csv | ||
- type: Filter | ||
name: KeepBigRecords | ||
config: | ||
language: python | ||
keep: "col3 > 100" | ||
input: ReadMyData | ||
- type: Sql | ||
name: MySqlTransform | ||
config: | ||
query: "select col1, count(*) as cnt from PCOLLECTION group by col1" | ||
input: KeepBigRecords | ||
- type: WriteToJson | ||
name: WriteTheOutput | ||
config: | ||
path: /path/to/output.json | ||
input: MySqlTransform | ||
``` | ||
|
||
(This is also needed to disambiguate if more than one transform of the same | ||
type is used.) | ||
|
||
If the pipeline is linear, we can let the inputs be implicit by designating | ||
the pipeline as a `chain` type. | ||
|
||
|
@@ -180,10 +245,10 @@ pipeline: | |
- type: Sql | ||
config: | ||
query: select left.col1, right.col2 from left join right using (col3) | ||
query: select A.col1, B.col2 from A join B using (col3) | ||
input: | ||
left: ReadLeft | ||
right: ReadRight | ||
A: ReadLeft | ||
B: ReadRight | ||
- type: WriteToJson | ||
name: WriteAll | ||
|
@@ -224,10 +289,10 @@ pipeline: | |
- type: Sql | ||
config: | ||
query: select left.col1, right.col2 from left join right using (col3) | ||
query: select A.col1, B.col2 from A join B using (col3) | ||
input: | ||
left: ReadLeft | ||
right: ReadRight | ||
A: ReadLeft | ||
B: ReadRight | ||
- type: WriteToJson | ||
name: WriteAll | ||
|
@@ -285,7 +350,9 @@ pipeline: | |
windowing: | ||
type: fixed | ||
size: 60s | ||
- type: SomeAggregation | ||
- type: SomeGroupingTransform | ||
config: | ||
arg: ... | ||
- type: WriteToPubSub | ||
config: | ||
topic: anotherPubSubTopic | ||
|
@@ -305,7 +372,9 @@ pipeline: | |
topic: myPubSubTopic | ||
format: ... | ||
schema: ... | ||
- type: SomeAggregation | ||
- type: SomeGroupingTransform | ||
config: | ||
arg: ... | ||
windowing: | ||
type: sliding | ||
size: 60s | ||
|
@@ -363,10 +432,10 @@ pipeline: | |
- type: Sql | ||
config: | ||
query: select left.col1, right.col2 from left join right using (col3) | ||
query: select A.col1, B.col2 from A join B using (col3) | ||
input: | ||
left: ReadLeft | ||
right: ReadRight | ||
A: ReadLeft | ||
B: ReadRight | ||
windowing: | ||
type: fixed | ||
size: 60s | ||
|
@@ -504,26 +573,15 @@ providers: | |
MyCustomTransform: "pkg.subpkg.PTransformClassOrCallable" | ||
``` | ||
|
||
## Running pipelines | ||
## Other Resources | ||
|
||
The Beam yaml parser is currently included as part of the Apache Beam Python SDK. | ||
This can be installed (e.g. within a virtual environment) as | ||
|
||
``` | ||
pip install apache_beam[yaml,gcp] | ||
``` | ||
|
||
In addition, several of the provided transforms (such as SQL) are implemented | ||
in Java and their expansion will require a working Java interpeter. (The | ||
requisite artifacts will be automatically downloaded from the apache maven | ||
repositories, so no further installs will be required.) | ||
Docker is also currently required for local execution of these | ||
cross-language-requiring transforms, but not for submission to a non-local | ||
runner such as Flink or Dataflow. | ||
* [Example pipelines](https://gist.github.com/robertwb/2cb26973f1b1203e8f5f8f88c5764da0) | ||
* [More examples](https://github.com/Polber/beam/tree/jkinard/bug-bash/sdks/python/apache_beam/yaml/examples) | ||
* [Transform glossary](https://gist.github.com/robertwb/64e2f51ff88320eeb6ffd96634202df7) | ||
|
||
Once the prerequisites are installed, you can execute a pipeline defined | ||
in a yaml file as | ||
Additional documentation in this directory | ||
|
||
``` | ||
python -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml [other pipeline options such as the runner] | ||
``` | ||
* [Mapping](yaml_mapping.md) | ||
* [Aggregation](yaml_combine.md) | ||
* [Error handling](yaml_errors.md) | ||
* [Inlining Python](inline_python.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters