Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add working example for Per Entity Training #25081

Merged
merged 10 commits into from
Jan 24, 2023

Conversation

shub-kris
Copy link
Contributor

This PR aims to illustrates an example of how to do Per Entity Training using Apache Beam. The pipeline performs the following steps:

  • Reads data from a CSV file,
  • Does some filtering
  • Creates key for grouping
  • Groups them
  • Does some preprocessing
  • Trains DecisionTree Classifier to predict if salary >= 50k
  • Saves them

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@shub-kris
Copy link
Contributor Author

@damccorm please, have a look. I will add documentation page soon.

@codecov
Copy link

codecov bot commented Jan 19, 2023

Codecov Report

Merging #25081 (3919cba) into master (cd20288) will decrease coverage by 0.06%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #25081      +/-   ##
==========================================
- Coverage   73.14%   73.08%   -0.06%     
==========================================
  Files         735      736       +1     
  Lines       98161    98224      +63     
==========================================
- Hits        71796    71784      -12     
- Misses      25002    25076      +74     
- Partials     1363     1364       +1     
Flag Coverage Δ
python 82.60% <0.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...python/apache_beam/examples/per_entity_training.py 0.00% <0.00%> (ø)
sdks/go/pkg/beam/core/metrics/dumper.go 49.20% <0.00%> (-4.77%) ⬇️
...python/apache_beam/runners/worker/worker_status.py 75.33% <0.00%> (-1.34%) ⬇️
...eam/runners/portability/fn_api_runner/execution.py 92.49% <0.00%> (-0.64%) ⬇️
sdks/python/apache_beam/runners/direct/executor.py 96.46% <0.00%> (-0.55%) ⬇️
...ks/python/apache_beam/runners/worker/sdk_worker.py 89.24% <0.00%> (-0.17%) ⬇️
...hon/apache_beam/runners/worker/bundle_processor.py 93.54% <0.00%> (-0.13%) ⬇️
...on/apache_beam/runners/dataflow/dataflow_runner.py 81.74% <0.00%> (ø)
...n/apache_beam/ml/gcp/recommendations_ai_test_it.py 75.51% <0.00%> (+2.04%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @AnandInguva for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Comment on lines 55 to 57
"""Discard data point if contains ?, and
doesn't have all features, and
doesn't have Bachelors, Masters or a Doctorate Degree"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Discard data point if contains ?, and
doesn't have all features, and
doesn't have Bachelors, Masters or a Doctorate Degree"""
"""Discard data point if contains ?,
doesn't have all features, or
doesn't have Bachelors, Masters or a Doctorate Degree"""

"""Saves the trained model to specified location."""
def process(self, element, path, *args, **kwargs):
key, trained_model = element
dump(trained_model, os.path.join(path, f"{key}_model.joblib"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using dump with a fixed path, could we please use fileio with dynamic destinations? That way its easy to write to gcs or other known filesystems.

It would also be good to make that path a configurable optional known_arg

Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good other than the comments and still needing the associated documentation

@github-actions github-actions bot added website and removed website labels Jan 20, 2023
@github-actions github-actions bot added website and removed website labels Jan 20, 2023
@github-actions github-actions bot added website and removed website labels Jan 20, 2023
@github-actions github-actions bot added website and removed website labels Jan 20, 2023
@github-actions github-actions bot added website and removed website labels Jan 20, 2023
@shub-kris shub-kris requested a review from damccorm January 20, 2023 18:21
Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, @rszper would you mind taking a look at the markdown portions (staged here)?

### Run the Pipeline ?
First, install the required packages `apache-beam==2.44.0`, `scikit-learn==1.0.2` and `pandas==1.3.5`.
You can view the code on [GitHub](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/per_entity_training.py).
Use `python per_entity_training.py --input path_to_data`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path_to_data - could you provide more specific info on the expected data/format? Is this supposed to be adult.data from https://archive.ics.uci.edu/ml/machine-learning-databases/adult/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can use this file. As it's a CSV file. By default it downloads the CSV format

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Use `python per_entity_training.py --input path_to_data`
Use `python per_entity_training.py --input path/to/adult.data`

Thanks for clarifying!

website/www/site/content/en/documentation/ml/overview.md Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the website label Jan 23, 2023
@github-actions github-actions bot added website and removed website labels Jan 23, 2023
@shub-kris shub-kris requested a review from damccorm January 23, 2023 09:55
Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple small nits and then this should be good to merge!


* It can also address the issue of bias and fairness, as a single model trained on a diverse dataset may not generalize well to certain groups, separate models for each group can reduce the impact of bias.
* Having seperate models can address issues of bias and fairness. Because a single model trained on a diverse dataset might not generalize well to certain groups, separate models for each group can reduce the impact of bias.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Having seperate models can address issues of bias and fairness. Because a single model trained on a diverse dataset might not generalize well to certain groups, separate models for each group can reduce the impact of bias.
* Having separate models can address issues of bias and fairness. Because a single model trained on a diverse dataset might not generalize well to certain groups, separate models for each group can reduce the impact of bias.


## Dataset
This example uses [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains information about individuals, including their demographic characteristics, employment status, and income level. The dataset includes both categorical and numerical features, such as age, education, occupation, and hours worked per week, as well as a binary label indicating whether an individual's income is above or below 50K. The primary goal of this dataset is to be used for classification tasks, where the model will predict whether an individual's income is above or below a certain threshold based on the provided features.
This example uses [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains information about individuals, including their demographic characteristics, employment status, and income level. The dataset includes both categorical and numerical features, such as age, education, occupation, and hours worked per week, as well as a binary label indicating whether an individual's income is above or below 50,000 USD. The primary goal of this dataset is to be used for classification tasks, where the model will predict whether an individual's income is above or below a certain threshold based on the provided features.The pipeline expects the `adult.data` CSV file which can be downloaded from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This example uses [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains information about individuals, including their demographic characteristics, employment status, and income level. The dataset includes both categorical and numerical features, such as age, education, occupation, and hours worked per week, as well as a binary label indicating whether an individual's income is above or below 50,000 USD. The primary goal of this dataset is to be used for classification tasks, where the model will predict whether an individual's income is above or below a certain threshold based on the provided features.The pipeline expects the `adult.data` CSV file which can be downloaded from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/).
This example uses [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains information about individuals, including their demographic characteristics, employment status, and income level. The dataset includes both categorical and numerical features, such as age, education, occupation, and hours worked per week, as well as a binary label indicating whether an individual's income is above or below 50,000 USD. The primary goal of this dataset is to be used for classification tasks, where the model will predict whether an individual's income is above or below a certain threshold based on the provided features. The pipeline expects the `adult.data` CSV file as an input. This file can be downloaded from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/).

### Run the Pipeline ?
First, install the required packages `apache-beam==2.44.0`, `scikit-learn==1.0.2` and `pandas==1.3.5`.
You can view the code on [GitHub](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/per_entity_training.py).
Use `python per_entity_training.py --input path_to_data`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Use `python per_entity_training.py --input path_to_data`
Use `python per_entity_training.py --input path/to/adult.data`

Thanks for clarifying!

@github-actions github-actions bot added website and removed website labels Jan 24, 2023
Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@damccorm
Copy link
Contributor

I will merge once presubmit checks pass

@damccorm
Copy link
Contributor

Looks like Jenkins statuses may be broken on this PR, but all checks have passed

@damccorm damccorm merged commit 89cd059 into apache:master Jan 24, 2023
@shub-kris
Copy link
Contributor Author

Thanks @damccorm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants