-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add working example for Per Entity Training #25081
Add working example for Per Entity Training #25081
Conversation
@damccorm please, have a look. I will add documentation page soon. |
Codecov Report
@@ Coverage Diff @@
## master #25081 +/- ##
==========================================
- Coverage 73.14% 73.08% -0.06%
==========================================
Files 735 736 +1
Lines 98161 98224 +63
==========================================
- Hits 71796 71784 -12
- Misses 25002 25076 +74
- Partials 1363 1364 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Assigning reviewers. If you would like to opt out of this review, comment R: @AnandInguva for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
"""Discard data point if contains ?, and | ||
doesn't have all features, and | ||
doesn't have Bachelors, Masters or a Doctorate Degree""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Discard data point if contains ?, and | |
doesn't have all features, and | |
doesn't have Bachelors, Masters or a Doctorate Degree""" | |
"""Discard data point if contains ?, | |
doesn't have all features, or | |
doesn't have Bachelors, Masters or a Doctorate Degree""" |
"""Saves the trained model to specified location.""" | ||
def process(self, element, path, *args, **kwargs): | ||
key, trained_model = element | ||
dump(trained_model, os.path.join(path, f"{key}_model.joblib")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using dump with a fixed path, could we please use fileio with dynamic destinations? That way its easy to write to gcs or other known filesystems.
It would also be good to make that path a configurable optional known_arg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good other than the comments and still needing the associated documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Run the Pipeline ? | ||
First, install the required packages `apache-beam==2.44.0`, `scikit-learn==1.0.2` and `pandas==1.3.5`. | ||
You can view the code on [GitHub](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/per_entity_training.py). | ||
Use `python per_entity_training.py --input path_to_data` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path_to_data
- could you provide more specific info on the expected data/format? Is this supposed to be adult.data
from https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can use this file. As it's a CSV file. By default it downloads the CSV format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use `python per_entity_training.py --input path_to_data` | |
Use `python per_entity_training.py --input path/to/adult.data` |
Thanks for clarifying!
website/www/site/content/en/documentation/ml/per-entity-training.md
Outdated
Show resolved
Hide resolved
website/www/site/content/en/documentation/ml/per-entity-training.md
Outdated
Show resolved
Hide resolved
website/www/site/content/en/documentation/ml/per-entity-training.md
Outdated
Show resolved
Hide resolved
website/www/site/content/en/documentation/ml/per-entity-training.md
Outdated
Show resolved
Hide resolved
website/www/site/content/en/documentation/ml/per-entity-training.md
Outdated
Show resolved
Hide resolved
website/www/site/content/en/documentation/ml/per-entity-training.md
Outdated
Show resolved
Hide resolved
website/www/site/content/en/documentation/ml/per-entity-training.md
Outdated
Show resolved
Hide resolved
website/www/site/content/en/documentation/ml/per-entity-training.md
Outdated
Show resolved
Hide resolved
website/www/site/content/en/documentation/ml/per-entity-training.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple small nits and then this should be good to merge!
|
||
* It can also address the issue of bias and fairness, as a single model trained on a diverse dataset may not generalize well to certain groups, separate models for each group can reduce the impact of bias. | ||
* Having seperate models can address issues of bias and fairness. Because a single model trained on a diverse dataset might not generalize well to certain groups, separate models for each group can reduce the impact of bias. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Having seperate models can address issues of bias and fairness. Because a single model trained on a diverse dataset might not generalize well to certain groups, separate models for each group can reduce the impact of bias. | |
* Having separate models can address issues of bias and fairness. Because a single model trained on a diverse dataset might not generalize well to certain groups, separate models for each group can reduce the impact of bias. |
|
||
## Dataset | ||
This example uses [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains information about individuals, including their demographic characteristics, employment status, and income level. The dataset includes both categorical and numerical features, such as age, education, occupation, and hours worked per week, as well as a binary label indicating whether an individual's income is above or below 50K. The primary goal of this dataset is to be used for classification tasks, where the model will predict whether an individual's income is above or below a certain threshold based on the provided features. | ||
This example uses [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains information about individuals, including their demographic characteristics, employment status, and income level. The dataset includes both categorical and numerical features, such as age, education, occupation, and hours worked per week, as well as a binary label indicating whether an individual's income is above or below 50,000 USD. The primary goal of this dataset is to be used for classification tasks, where the model will predict whether an individual's income is above or below a certain threshold based on the provided features.The pipeline expects the `adult.data` CSV file which can be downloaded from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example uses [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains information about individuals, including their demographic characteristics, employment status, and income level. The dataset includes both categorical and numerical features, such as age, education, occupation, and hours worked per week, as well as a binary label indicating whether an individual's income is above or below 50,000 USD. The primary goal of this dataset is to be used for classification tasks, where the model will predict whether an individual's income is above or below a certain threshold based on the provided features.The pipeline expects the `adult.data` CSV file which can be downloaded from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/). | |
This example uses [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult). The dataset contains information about individuals, including their demographic characteristics, employment status, and income level. The dataset includes both categorical and numerical features, such as age, education, occupation, and hours worked per week, as well as a binary label indicating whether an individual's income is above or below 50,000 USD. The primary goal of this dataset is to be used for classification tasks, where the model will predict whether an individual's income is above or below a certain threshold based on the provided features. The pipeline expects the `adult.data` CSV file as an input. This file can be downloaded from [here](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/). |
### Run the Pipeline ? | ||
First, install the required packages `apache-beam==2.44.0`, `scikit-learn==1.0.2` and `pandas==1.3.5`. | ||
You can view the code on [GitHub](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/per_entity_training.py). | ||
Use `python per_entity_training.py --input path_to_data` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use `python per_entity_training.py --input path_to_data` | |
Use `python per_entity_training.py --input path/to/adult.data` |
Thanks for clarifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
I will merge once presubmit checks pass |
Looks like Jenkins statuses may be broken on this PR, but all checks have passed |
Thanks @damccorm |
This PR aims to illustrates an example of how to do Per Entity Training using Apache Beam. The pipeline performs the following steps:
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.