Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better error handling for non-rank 0 during Monolithic Checkpoint Loading #3647

Merged
merged 7 commits into from
Oct 14, 2024

Conversation

j316chuck
Copy link
Contributor

@j316chuck j316chuck commented Oct 11, 2024

What does this PR do?

Previously, all rank's 1 - N would return a File not Found Error when local rank 0 failed to download the monolithic checkpoint resulting in a confusing debugging experience.

This new logic changes the exception on non global rank 0's to raise an Error that points users to look at the local rank 0 for better debugging experience.

What issue(s) does this change relate to?

https://databricks.atlassian.net/browse/GRT-3308

Tests

EXTRA_ARGS='-vv -k test_load_incorrect_path' WORLD_SIZE=2 make test-dist-gpu

@j316chuck j316chuck requested a review from mvpatel2000 October 11, 2024 23:23
@j316chuck j316chuck force-pushed the chuck/add_better_error_handling branch from 21e5588 to 2905366 Compare October 12, 2024 00:14
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits on wording

tests/trainer/test_checkpoint.py Outdated Show resolved Hide resolved
composer/utils/checkpoint.py Outdated Show resolved Hide resolved
@j316chuck j316chuck force-pushed the chuck/add_better_error_handling branch from 029f5b8 to edaf21c Compare October 14, 2024 19:18
@j316chuck j316chuck requested a review from mvpatel2000 October 14, 2024 19:19
@j316chuck j316chuck enabled auto-merge (squash) October 14, 2024 19:20
@j316chuck j316chuck disabled auto-merge October 14, 2024 19:43
@j316chuck j316chuck requested a review from mvpatel2000 October 14, 2024 19:45
@j316chuck j316chuck force-pushed the chuck/add_better_error_handling branch from 95624dd to add1fa1 Compare October 14, 2024 19:50
@j316chuck j316chuck requested a review from mvpatel2000 October 14, 2024 19:51
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@j316chuck j316chuck enabled auto-merge (squash) October 14, 2024 19:52
@j316chuck j316chuck force-pushed the chuck/add_better_error_handling branch from add1fa1 to 915251b Compare October 14, 2024 19:56
@j316chuck j316chuck merged commit 2972a2a into main Oct 14, 2024
14 checks passed
@j316chuck j316chuck deleted the chuck/add_better_error_handling branch October 14, 2024 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants