Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass PG into checkpoint load and load rng with state_dict #2897

Merged
merged 10 commits into from
Jan 24, 2024

Conversation

mvpatel2000
Copy link
Contributor

What does this PR do?

  1. Pass process group into load call (which should be faster)
  2. Load RNG in one call to avoid duplicate reads

@mvpatel2000 mvpatel2000 requested a review from eracah January 23, 2024 19:24
eracah
eracah previously approved these changes Jan 23, 2024
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a few clarifying comments to add to make it easier to read

composer/utils/checkpoint.py Show resolved Hide resolved
composer/utils/checkpoint.py Show resolved Hide resolved
composer/utils/checkpoint.py Outdated Show resolved Hide resolved
@eracah eracah self-requested a review January 23, 2024 19:54
@eracah eracah dismissed their stale review January 23, 2024 19:54

tests?

@eracah
Copy link
Contributor

eracah commented Jan 23, 2024

Tests?

@mvpatel2000
Copy link
Contributor Author

Tests?

I think existing sharded checkpoints cover this. We unfortunately cannot test HSDP with 2 GPUs, so I've queued manual tests I'll DM you.

Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks decent

@mvpatel2000 mvpatel2000 merged commit cfc439a into mosaicml:dev Jan 24, 2024
16 checks passed
@mvpatel2000 mvpatel2000 deleted the mvpatel2000/fix-ckpt-load branch January 24, 2024 00:50
ShashankMosaicML pushed a commit to ShashankMosaicML/composer that referenced this pull request Feb 3, 2024
)

* checkdown

* remove comment

* lint

* comments

* fix

* accelerate test

* fix test

* lint

* fix test
ShashankMosaicML pushed a commit to ShashankMosaicML/composer that referenced this pull request Feb 3, 2024
)

* checkdown

* remove comment

* lint

* comments

* fix

* accelerate test

* fix test

* lint

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants