Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP Enhancements #63

Merged
merged 23 commits into from
Nov 8, 2021
Merged

DDP Enhancements #63

merged 23 commits into from
Nov 8, 2021

Conversation

ravi-mosaicml
Copy link
Contributor

@ravi-mosaicml ravi-mosaicml commented Nov 4, 2021

  1. Added timeout to the hparams for initialize_process_group. The default of 30 minutes was too long for failing tests, which prevents one from getting any meaningful log.
  2. In cleanup(), using os.killpg to terminate DDP subprocesses instead of just subprocess.kill(). It appears that sometimes zombie processes would be still there (e.g. from ddp / dataloader workers)
  3. In cleanup(), attempting a sigterm before resorting to a sigkill() 5 seconds later. Graceful cleanup is preferred!
  4. Directing output from stdout and stderr to tempfiles instead of subprocess.PIPE, which can hang if a subprocess generates significant output

1. Added timeout to the hparams for initialize_process_group. The default of 30 minutes was too long for failing tests, which prevents one from getting any meaningful log.
2. In cleanup(), using `os.killpg` to terminate DDP subprocesses instead of just kill().
3.  In cleanup(), attempting a `sigterm` before resorting to a `sigkill()` 5 seconds later. Graceful cleanup is preferred!
@ravi-mosaicml ravi-mosaicml changed the base branch from dev to ravi/log_debug_in_tests November 4, 2021 16:41
@hanlint
Copy link
Contributor

hanlint commented Nov 8, 2021

is this going to conflict with #65 ? cc: @jbloxham

@jbloxham
Copy link
Contributor

jbloxham commented Nov 8, 2021

is this going to conflict with #65 ? cc: @jbloxham

Hmm, I definitely need to incorporate these changes into that PR, but I don't see any reason that shouldn't work. :)

jbloxham added a commit to jbloxham/composer that referenced this pull request Nov 8, 2021
composer/trainer/ddp.py Show resolved Hide resolved
composer/trainer/ddp.py Outdated Show resolved Hide resolved
composer/trainer/ddp.py Outdated Show resolved Hide resolved
composer/trainer/ddp.py Outdated Show resolved Hide resolved
Base automatically changed from ravi/log_debug_in_tests to dev November 8, 2021 22:42
Copy link
Contributor

@ajaysaini725 ajaysaini725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@ravi-mosaicml ravi-mosaicml merged commit 94ebc10 into dev Nov 8, 2021
@ravi-mosaicml ravi-mosaicml deleted the ravi/ddp_enhancements branch November 8, 2021 23:28
jbloxham added a commit to jbloxham/composer that referenced this pull request Nov 9, 2021
jbloxham added a commit that referenced this pull request Nov 10, 2021
* it basically works

* WIP, seeing if CircleCI can handle the dreaded 20-wide batch

* use torch.distributed.run instead

* finally onto something

* god forbid python make any sense as a programming language

* a bit of trim

* the tests pass

* minor cleanup

* rebasing and restoring

* replace filestore with hashstore

* formatting

* pyright cleanup

* more pyright cleanup

* everything should now be green

* last pyright error

* cleanup

* fix train_model test to reduce losses across processes

* get rid of torch.distributed.run

* don't need higher version

* incorporating parts of #63

* integrate ddp sync strategy

* fix the tests

* cleanup

* address comments on launcher script

* addressing more comments

* fix pyright

* address the final comments, i think

* fix pyright

* fixing some print statements in the launch script

* how did i miss this

* initial docs

* address comments, sans docs

* fix up the docs

* fix pyright

* formatting
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
Added timeout to the hparams for initialize_process_group. The default of 30 minutes was too long for failing tests, which prevents one from getting any meaningful log.
In cleanup(), using os.killpg to terminate DDP subprocesses instead of just subprocess.kill(). It appears that sometimes zombie processes would be still there (e.g. from ddp / dataloader workers)
In cleanup(), attempting a sigterm before resorting to a sigkill() 5 seconds later. Graceful cleanup is preferred!
Directing output from stdout and stderr to tempfiles instead of subprocess.PIPE, which can hang if a subprocess generates significant output
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
* it basically works

* WIP, seeing if CircleCI can handle the dreaded 20-wide batch

* use torch.distributed.run instead

* finally onto something

* god forbid python make any sense as a programming language

* a bit of trim

* the tests pass

* minor cleanup

* rebasing and restoring

* replace filestore with hashstore

* formatting

* pyright cleanup

* more pyright cleanup

* everything should now be green

* last pyright error

* cleanup

* fix train_model test to reduce losses across processes

* get rid of torch.distributed.run

* don't need higher version

* incorporating parts of mosaicml#63

* integrate ddp sync strategy

* fix the tests

* cleanup

* address comments on launcher script

* addressing more comments

* fix pyright

* address the final comments, i think

* fix pyright

* fixing some print statements in the launch script

* how did i miss this

* initial docs

* address comments, sans docs

* fix up the docs

* fix pyright

* formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants