-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP Enhancements #63
Merged
Merged
DDP Enhancements #63
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Added timeout to the hparams for initialize_process_group. The default of 30 minutes was too long for failing tests, which prevents one from getting any meaningful log. 2. In cleanup(), using `os.killpg` to terminate DDP subprocesses instead of just kill(). 3. In cleanup(), attempting a `sigterm` before resorting to a `sigkill()` 5 seconds later. Graceful cleanup is preferred!
Fixed escape sequence
Tests were failing when composer was not installed with `-e`, causing the yamls to be not found and included
ravi-mosaicml
requested review from
ajaysaini725,
anisehsani,
bandish-shah and
Averylamp
November 6, 2021 01:38
jbloxham
added a commit
to jbloxham/composer
that referenced
this pull request
Nov 8, 2021
ajaysaini725
reviewed
Nov 8, 2021
ajaysaini725
approved these changes
Nov 8, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
jbloxham
added a commit
to jbloxham/composer
that referenced
this pull request
Nov 9, 2021
jbloxham
added a commit
that referenced
this pull request
Nov 10, 2021
* it basically works * WIP, seeing if CircleCI can handle the dreaded 20-wide batch * use torch.distributed.run instead * finally onto something * god forbid python make any sense as a programming language * a bit of trim * the tests pass * minor cleanup * rebasing and restoring * replace filestore with hashstore * formatting * pyright cleanup * more pyright cleanup * everything should now be green * last pyright error * cleanup * fix train_model test to reduce losses across processes * get rid of torch.distributed.run * don't need higher version * incorporating parts of #63 * integrate ddp sync strategy * fix the tests * cleanup * address comments on launcher script * addressing more comments * fix pyright * address the final comments, i think * fix pyright * fixing some print statements in the launch script * how did i miss this * initial docs * address comments, sans docs * fix up the docs * fix pyright * formatting
coryMosaicML
pushed a commit
to coryMosaicML/composer
that referenced
this pull request
Feb 23, 2022
Added timeout to the hparams for initialize_process_group. The default of 30 minutes was too long for failing tests, which prevents one from getting any meaningful log. In cleanup(), using os.killpg to terminate DDP subprocesses instead of just subprocess.kill(). It appears that sometimes zombie processes would be still there (e.g. from ddp / dataloader workers) In cleanup(), attempting a sigterm before resorting to a sigkill() 5 seconds later. Graceful cleanup is preferred! Directing output from stdout and stderr to tempfiles instead of subprocess.PIPE, which can hang if a subprocess generates significant output
coryMosaicML
pushed a commit
to coryMosaicML/composer
that referenced
this pull request
Feb 23, 2022
* it basically works * WIP, seeing if CircleCI can handle the dreaded 20-wide batch * use torch.distributed.run instead * finally onto something * god forbid python make any sense as a programming language * a bit of trim * the tests pass * minor cleanup * rebasing and restoring * replace filestore with hashstore * formatting * pyright cleanup * more pyright cleanup * everything should now be green * last pyright error * cleanup * fix train_model test to reduce losses across processes * get rid of torch.distributed.run * don't need higher version * incorporating parts of mosaicml#63 * integrate ddp sync strategy * fix the tests * cleanup * address comments on launcher script * addressing more comments * fix pyright * address the final comments, i think * fix pyright * fixing some print statements in the launch script * how did i miss this * initial docs * address comments, sans docs * fix up the docs * fix pyright * formatting
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
os.killpg
to terminate DDP subprocesses instead of just subprocess.kill(). It appears that sometimes zombie processes would be still there (e.g. from ddp / dataloader workers)sigterm
before resorting to asigkill()
5 seconds later. Graceful cleanup is preferred!stdout
andstderr
to tempfiles instead ofsubprocess.PIPE
, which can hang if a subprocess generates significant output