-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test for train API #2199
base: master
Are you sure you want to change the base?
Add e2e test for train API #2199
Conversation
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Pull Request Test Coverage Report for Build 12425333047Details
💛 - Coveralls |
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@andreyvelich No problem at all. We can also keep the e2e test for train API in integration tests, but skip this if use gang scheduling. Which way do you think is better? |
Yes, maybe it is less changes. Let's just log in the tests that we don't run this tests with gang-scheduling. |
Signed-off-by: helenxie-bit <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Hezhi (Helen) Xie <[email protected]>
@andreyvelich The e2e test for PyTorchJob is still failing due to an image pull backoff error. I suspect it might be caused by insufficient disk space. I think we have two options:
Which approach do you think is better? |
Signed-off-by: helenxie-bit <[email protected]>
Can you try to separate them and see if issue will be resolved ? |
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@andreyvelich I've separated the e2e test for train API and now it works. Please review when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this overall lgtm, just small comment.
/assign @deepanker13 @kubeflow/wg-training-leads @Electronic-Waste
/lgtm |
Signed-off-by: helenxie-bit <[email protected]>
New changes are detected. LGTM label has been removed. |
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
I've updated the Kubernetes version to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. I left some comments for you @helenxie-bit
strategy: | ||
fail-fast: false | ||
matrix: | ||
kubernetes-version: ["v1.31.4"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we change the Kubernetes version to be aligned with other ci tests? Like:
training-operator/.github/workflows/test-example-notebooks.yaml
Lines 16 to 18 in 69094e1
matrix: | |
kubernetes-version: ["v1.28.7", "v1.29.2", "v1.30.6"] | |
python-version: ["3.9", "3.10", "3.11"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to save compute resources, I think for now we can just run this test on a single k8s version, since we run the rests E2E tests on the all versions.
WDYT @Electronic-Waste ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. Maybe we can select one k8s version from this list:)
What this PR does / why we need it:
Add an e2e test in the
test_e2e_train_api.py
for the train API.Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: