Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e test for train API #2199

Open
wants to merge 112 commits into
base: master
Choose a base branch
from

Conversation

helenxie-bit
Copy link
Contributor

What this PR does / why we need it:
Add an e2e test in the test_e2e_train_api.py for the train API.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@coveralls
Copy link

coveralls commented Aug 9, 2024

Pull Request Test Coverage Report for Build 12425333047

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12381609583: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit helenxie-bit changed the title Add e2e test for train API [WIP]Add e2e test for train API Aug 9, 2024
@helenxie-bit helenxie-bit changed the title [WIP]Add e2e test for train API [WIP] Add e2e test for train API Aug 9, 2024
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit
Copy link
Contributor Author

helenxie-bit commented Dec 18, 2024

I noticed that other e2e tests include separate test functions for gang scheduling and plain scheduling. Should the e2e test for the train API also cover both cases? Currently, it fails under gang scheduling, and I’m not sure why.

Oh, I see, we run our integration test with scheduling plugins as well. Maybe to unblock this PR, let's create separate GitHub action for now, sorry to ask you again refactor this. In the V2, we can refactor it to make it more consistent.

@andreyvelich No problem at all. We can also keep the e2e test for train API in integration tests, but skip this if use gang scheduling. Which way do you think is better?

@andreyvelich
Copy link
Member

We can also keep the e2e test for train API in integration tests, but skip this if use gang scheduling

Yes, maybe it is less changes. Let's just log in the tests that we don't run this tests with gang-scheduling.

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Hezhi (Helen) Xie <[email protected]>
@helenxie-bit
Copy link
Contributor Author

helenxie-bit commented Dec 18, 2024

@andreyvelich The e2e test for PyTorchJob is still failing due to an image pull backoff error. I suspect it might be caused by insufficient disk space. I think we have two options:

  1. Run the e2e test for the train API as a separate GitHub Action.
  2. Reduce the size of the kubeflow/pytorch-dist-mnist:latest image by switching to a CPU-based version.

Which approach do you think is better?

Signed-off-by: helenxie-bit <[email protected]>
@andreyvelich
Copy link
Member

Run the e2e test for the train API as a separate GitHub Action.

Can you try to separate them and see if issue will be resolved ?
cc @kubeflow/wg-training-leads

Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit
Copy link
Contributor Author

Run the e2e test for the train API as a separate GitHub Action.

Can you try to separate them and see if issue will be resolved ? cc @kubeflow/wg-training-leads

@andreyvelich I've separated the e2e test for train API and now it works. Please review when you have time.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this overall lgtm, just small comment.
/assign @deepanker13 @kubeflow/wg-training-leads @Electronic-Waste

.github/workflows/e2e-test-train-api.yaml Outdated Show resolved Hide resolved
@deepanker13
Copy link
Contributor

/lgtm
Thanks @helenxie-bit

Signed-off-by: helenxie-bit <[email protected]>
Copy link

New changes are detected. LGTM label has been removed.

@google-oss-prow google-oss-prow bot removed the lgtm label Dec 20, 2024
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
Signed-off-by: helenxie-bit <[email protected]>
@helenxie-bit
Copy link
Contributor Author

helenxie-bit commented Dec 20, 2024

I've updated the Kubernetes version to v1.31.4. Please review when you have time @andreyvelich @Electronic-Waste @kubeflow/wg-training-leads

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM. I left some comments for you @helenxie-bit

strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.31.4"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we change the Kubernetes version to be aligned with other ci tests? Like:

matrix:
kubernetes-version: ["v1.28.7", "v1.29.2", "v1.30.6"]
python-version: ["3.9", "3.10", "3.11"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to save compute resources, I think for now we can just run this test on a single k8s version, since we run the rests E2E tests on the all versions.
WDYT @Electronic-Waste ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. Maybe we can select one k8s version from this list:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants