Add working example for Per Entity Training #25081

shub-kris · 2023-01-19T14:19:50Z

This PR aims to illustrates an example of how to do Per Entity Training using Apache Beam. The pipeline performs the following steps:

Reads data from a CSV file,
Does some filtering
Creates key for grouping
Groups them
Does some preprocessing
Trains DecisionTree Classifier to predict if salary >= 50k
Saves them

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

shub-kris · 2023-01-19T14:19:56Z

@damccorm please, have a look. I will add documentation page soon.

codecov · 2023-01-19T14:40:47Z

Codecov Report

Merging #25081 (3919cba) into master (cd20288) will decrease coverage by 0.06%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #25081      +/-   ##
==========================================
- Coverage   73.14%   73.08%   -0.06%     
==========================================
  Files         735      736       +1     
  Lines       98161    98224      +63     
==========================================
- Hits        71796    71784      -12     
- Misses      25002    25076      +74     
- Partials     1363     1364       +1

Flag	Coverage Δ
python	`82.60% <0.00%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...python/apache_beam/examples/per_entity_training.py	`0.00% <0.00%> (ø)`
sdks/go/pkg/beam/core/metrics/dumper.go	`49.20% <0.00%> (-4.77%)`	⬇️
...python/apache_beam/runners/worker/worker_status.py	`75.33% <0.00%> (-1.34%)`	⬇️
...eam/runners/portability/fn_api_runner/execution.py	`92.49% <0.00%> (-0.64%)`	⬇️
sdks/python/apache_beam/runners/direct/executor.py	`96.46% <0.00%> (-0.55%)`	⬇️
...ks/python/apache_beam/runners/worker/sdk_worker.py	`89.24% <0.00%> (-0.17%)`	⬇️
...hon/apache_beam/runners/worker/bundle_processor.py	`93.54% <0.00%> (-0.13%)`	⬇️
...on/apache_beam/runners/dataflow/dataflow_runner.py	`81.74% <0.00%> (ø)`
...n/apache_beam/ml/gcp/recommendations_ai_test_it.py	`75.51% <0.00%> (+2.04%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

github-actions · 2023-01-19T16:12:15Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @AnandInguva for label python.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

damccorm · 2023-01-19T20:40:14Z

sdks/python/apache_beam/examples/per_entity_training.py

+  """Discard data point if contains ?, and
+  doesn't have all features, and
+  doesn't have Bachelors, Masters or a Doctorate Degree"""


Suggested change

"""Discard data point if contains ?, and

doesn't have all features, and

doesn't have Bachelors, Masters or a Doctorate Degree"""

"""Discard data point if contains ?,

doesn't have all features, or

doesn't have Bachelors, Masters or a Doctorate Degree"""

damccorm · 2023-01-19T20:46:53Z

sdks/python/apache_beam/examples/per_entity_training.py

+  """Saves the trained model to specified location."""
+  def process(self, element, path, *args, **kwargs):
+    key, trained_model = element
+    dump(trained_model, os.path.join(path, f"{key}_model.joblib"))


Instead of using dump with a fixed path, could we please use fileio with dynamic destinations? That way its easy to write to gcs or other known filesystems.

It would also be good to make that path a configurable optional known_arg

damccorm

This looks good other than the comments and still needing the associated documentation

damccorm

This looks pretty good to me, @rszper would you mind taking a look at the markdown portions (staged here)?

damccorm · 2023-01-20T19:45:28Z