Executable for distributed launching #65

jbloxham · 2021-11-04T21:52:38Z

Refactors DDP initialization to make use of a dedicated executable for launching multiprocessing, rather than relaunching in the middle of the trainer initialization itself. For now, this executable takes only a single argument, --nproc, but this can easily be extended to handle multinode setups, and could also invoke the deepspeed exectuable. Most of the previous DDP hparams, like DDP store and fork_rank_0, are no more.

I experimented a bit with having this executable wrap torch.distributed.run, but I found that script to be buggy, verbose, and lacking good config options (for example, the script doesn't support nproc=1 at all, and some of the config options get confused if you try passing a 2-digit number for node rank). Instead, I repurposed a lot of the existing DDP code.

Here's a pair of RN50 runs on 4x3080s to demonstrate mathematical equivalence: https://wandb.ai/mosaic-ml/jamie-ddp-launch-regression?workspace=user-jbloxham. I accidentally only ran to 5 epochs, but I think this is enough to be compelling.

Most of the complexity of this PR comes from refactoring the tests so that they are no longer dependent on the previous behavior. The solution I landed on was to introduce a helper function with_distributed that runs a function with multiprocessing. This turned out to be fairly versatile - it was convenient for checkpointing tests to be able to invoke this function twice within a single test, for instance, which wasn't possible before.

Admittedly, I still need to add docs, and I swear by my honor that I will not merge this PR before then. The code is final though, and ready for review.

ravi-mosaicml

Great work. Love the simplicity of the launch script. Main comment relates to testing and whether it would be possible to use the launch script to run pytest, instead of using mp.spawn. Other than that, most other comments are minor.

composer/cli/launcher.py

tests/fixtures/ddp_fixtures.py

tests/trainer/test_trainer.py

tests/utils/trainer_fit.py

ravi-mosaicml

Looking really close...just a final few more comments.

composer/cli/launcher.py

composer/trainer/ddp.py

composer/utils/ddp.py

docs/source/getting_started/distributed.rst

composer/utils/ddp.py

composer/cli/launcher.py

…o distributed-launch

ravi-mosaicml

💯

#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.

#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in init. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.

Before #65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. #97 and #101 depend on this functionality. Also removed DDP from the state, as that is available globally.

1. dataset.initialize_object() returns a Dataloader (or a Dataloader, split_fn, and preprocessing_fn tuple), rather than a dataset and dataloader init args. This is made possible by #65. 2. Better synethic dataset support. Each dataset hparams can optionally have a `synthetic` field. If this field exists and is not None, then `dataset.initialize_object()` should return a synthetic dataset instead of the real one. The `synthetic` field can contain user-specified hparams for the synthetic dataset, such as the # of unique samples to create, device, or memory format. 3. Subset datasets. Each dataset hprams can optionally have a `num_total_batches` field. If this field exists and is not None, then the dataset should limit the total number of samples to this value times the batch size parameter (which is passed in on initialize_object). This field allows for smoketesting (e.g. set num_total_batches to 1 to ensure that the path is accessible on disk). If used with a synthetic dataset, then the synthetic dataset will have this many batches. TODO: - [ ] Fix tests - [ ] Update model yamls for new format - [ ] Create issues for adding synthetic datagen for NLP models and brats - [ ] Add tests for num_total_batches. There will probably be some issues with DDP and DistributedSampler.

## 1. Added a `run_event` function for callbacks. Previously, callbacks had one method corresponding to each event name. This format isn't flexible to callbacks that need to run on all events, and to future profiling work that may use callbacks to run on things other than events. Instead, this PR adds a more generic structure, where a callback can instead implement a `def _run_event(event, state, logger)`. The engine calls `callback.run_event()` which dispatches to `callback._run_event()`. By default, `callback._run_event()` calls the methods named after each event, but individual callbacks can override it for generic functionality. Closes #11, cleaned up some of the tests and rank zero callbacks. ## 2. Removed deferred logging #65 made the rank available upon startup, which made the deferred logging docs out of date and the use case for deferred logging obsolete. Now, callbacks initialize themselves upon `Event.INIT`. Closes #87.

Before #65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algorithms, etc... to use DDP (such as barriers and reductions) as needed. #97 and #101 depend on this functionality. Also removed DDP from the state, as that is available globally.

* Added `run_event` to callback Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work. * Removed callback helper methods * Fixed tests * Formatting * Addressed PR feedback * Fixed tests * Formatting * Fixed _run_event * Formatting * Removed ip * Instrumentation WIP * Stash * Create dataloader on trainer __init__() #65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state. * Stash * Added JSON trace handler * Formatting * Fixed trace generation * Prettified memory * Fixed setup.py * Changed setup.py * testing * Removed prepare * Run Directory Uploader Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes #98. Depends on #85 and (for tests) #92. * Supporting both styles for callbacks Removed deferred logging since rank is now known at the init event * Minimizing Diff * Fixed tests * Added fasteners * Fixed tests * Formatting * Lazy population of kwargs * 1. Added object_name_prefix 2. Tested on google cloud storage 3. Added exponential backoff and retrying for transient errors * Addressed PR feedback * Remove the composer.trainer.ddp class Before #65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. #97 and #101 depend on this functionality. Also removed DDP from the state, as that is available globally. * Added in DDP barrier * Fixed tests * Update composer/utils/ddp.py * Update composer/utils/ddp.py * Switched tqdm to using callback hooks Added test case for TQDM * Fixed pyright * Fixed DDP barriers * Increased timeout for run directory uploader * Switched callback format for run directory uploader * Replaced `atexit` with cleanup methods When running the trainer multiple times, such as in interactive enviroments, `atexit` does not fire. Instead, replaced it with `.close()` and `.post_close()` hooks on callbacks. `.close()` can be used to write and flush files. `.post_close()` can be used to backup the run directory and capture any changes that may have been made on `.close()` * Uncommented code * Running callbacks befor algorithms for the INIT event in the engine * For the INIT event, run the callbacks first to initialize the loggers. * For other events, run the algorithms first, so the callbacks have the state after algorithms modify it. * Fixed tests * Addressed PR feedback * Added in the scheduler * Added instant events * Fixes * Fixed profile scheduling * Added decorator option * Formatting * Added documentation for the profiler * 1. Added test cases 2. Fixed trace files to be proper json on successful training runs * Profiler entry point * Ravi/instrumentation point (#140) 1. Using `os.getpid()` for process IDs to enable synchronization with the pytorch profiler 2. Switched to using object format instead of array format for the traces 3. Added in extra metadata such as global rank and timestamps for clock syncing * Writing metadata to a seperate file * Fixed tests * Removed the perf counter * Recording IO stats * Log global rank in each torch profiler file * Merging process traces (#144) * Refactor the system profiler and dataloader profiler into callbacks Configuring the pytorch profiler based off of the mosaic profiler hparams * 1. Updated the merge script to merge pytorch trace files 2. Renamed the `MosaicProfiler` to `Profiler` * Increased timeout * Formatting * Fixed the `run_mosaic_profiler` * Added detailed option * Added sort index * Setting `pid` to global rank and `tid` to `os.getpid()` The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave. Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict. * Simplifying diff * Put the backwards thread second * Thread sorting in trace * Fix * Fixes * Fixed tests * Fixed the profiler * Fixes Co-authored-by: Jamie Bloxham <[email protected]> Co-authored-by: Bandish Shah <[email protected]> Co-authored-by: anisehsani <[email protected]>

* it basically works * WIP, seeing if CircleCI can handle the dreaded 20-wide batch * use torch.distributed.run instead * finally onto something * god forbid python make any sense as a programming language * a bit of trim * the tests pass * minor cleanup * rebasing and restoring * replace filestore with hashstore * formatting * pyright cleanup * more pyright cleanup * everything should now be green * last pyright error * cleanup * fix train_model test to reduce losses across processes * get rid of torch.distributed.run * don't need higher version * incorporating parts of mosaicml#63 * integrate ddp sync strategy * fix the tests * cleanup * address comments on launcher script * addressing more comments * fix pyright * address the final comments, i think * fix pyright * fixing some print statements in the launch script * how did i miss this * initial docs * address comments, sans docs * fix up the docs * fix pyright * formatting

mosaicml#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in init. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.

## 1. Added a `run_event` function for callbacks. Previously, callbacks had one method corresponding to each event name. This format isn't flexible to callbacks that need to run on all events, and to future profiling work that may use callbacks to run on things other than events. Instead, this PR adds a more generic structure, where a callback can instead implement a `def _run_event(event, state, logger)`. The engine calls `callback.run_event()` which dispatches to `callback._run_event()`. By default, `callback._run_event()` calls the methods named after each event, but individual callbacks can override it for generic functionality. Closes #11, cleaned up some of the tests and rank zero callbacks. ## 2. Removed deferred logging mosaicml#65 made the rank available upon startup, which made the deferred logging docs out of date and the use case for deferred logging obsolete. Now, callbacks initialize themselves upon `Event.INIT`. Closes mosaicml#87.

…aicml#105) Before mosaicml#65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algorithms, etc... to use DDP (such as barriers and reductions) as needed. mosaicml#97 and mosaicml#101 depend on this functionality. Also removed DDP from the state, as that is available globally.

* Added `run_event` to callback Closes #11 This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work. * Removed callback helper methods * Fixed tests * Formatting * Addressed PR feedback * Fixed tests * Formatting * Fixed _run_event * Formatting * Removed ip * Instrumentation WIP * Stash * Create dataloader on trainer __init__() mosaicml#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__. This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state. * Stash * Added JSON trace handler * Formatting * Fixed trace generation * Prettified memory * Fixed setup.py * Changed setup.py * testing * Removed prepare * Run Directory Uploader Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin. Closes mosaicml#98. Depends on mosaicml#85 and (for tests) mosaicml#92. * Supporting both styles for callbacks Removed deferred logging since rank is now known at the init event * Minimizing Diff * Fixed tests * Added fasteners * Fixed tests * Formatting * Lazy population of kwargs * 1. Added object_name_prefix 2. Tested on google cloud storage 3. Added exponential backoff and retrying for transient errors * Addressed PR feedback * Remove the composer.trainer.ddp class Before mosaicml#65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp. This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. mosaicml#97 and mosaicml#101 depend on this functionality. Also removed DDP from the state, as that is available globally. * Added in DDP barrier * Fixed tests * Update composer/utils/ddp.py * Update composer/utils/ddp.py * Switched tqdm to using callback hooks Added test case for TQDM * Fixed pyright * Fixed DDP barriers * Increased timeout for run directory uploader * Switched callback format for run directory uploader * Replaced `atexit` with cleanup methods When running the trainer multiple times, such as in interactive enviroments, `atexit` does not fire. Instead, replaced it with `.close()` and `.post_close()` hooks on callbacks. `.close()` can be used to write and flush files. `.post_close()` can be used to backup the run directory and capture any changes that may have been made on `.close()` * Uncommented code * Running callbacks befor algorithms for the INIT event in the engine * For the INIT event, run the callbacks first to initialize the loggers. * For other events, run the algorithms first, so the callbacks have the state after algorithms modify it. * Fixed tests * Addressed PR feedback * Added in the scheduler * Added instant events * Fixes * Fixed profile scheduling * Added decorator option * Formatting * Added documentation for the profiler * 1. Added test cases 2. Fixed trace files to be proper json on successful training runs * Profiler entry point * Ravi/instrumentation point (mosaicml#140) 1. Using `os.getpid()` for process IDs to enable synchronization with the pytorch profiler 2. Switched to using object format instead of array format for the traces 3. Added in extra metadata such as global rank and timestamps for clock syncing * Writing metadata to a seperate file * Fixed tests * Removed the perf counter * Recording IO stats * Log global rank in each torch profiler file * Merging process traces (mosaicml#144) * Refactor the system profiler and dataloader profiler into callbacks Configuring the pytorch profiler based off of the mosaic profiler hparams * 1. Updated the merge script to merge pytorch trace files 2. Renamed the `MosaicProfiler` to `Profiler` * Increased timeout * Formatting * Fixed the `run_mosaic_profiler` * Added detailed option * Added sort index * Setting `pid` to global rank and `tid` to `os.getpid()` The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave. Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict. * Simplifying diff * Put the backwards thread second * Thread sorting in trace * Fix * Fixes * Fixed tests * Fixed the profiler * Fixes Co-authored-by: Jamie Bloxham <[email protected]> Co-authored-by: Bandish Shah <[email protected]> Co-authored-by: anisehsani <[email protected]>

jbloxham force-pushed the distributed-launch branch 2 times, most recently from 1fe40e7 to e381a34 Compare November 5, 2021 19:02

jbloxham marked this pull request as ready for review November 5, 2021 23:31

jbloxham requested review from Averylamp, ravi-mosaicml and ajaysaini725 and removed request for Averylamp November 5, 2021 23:32

hanlint mentioned this pull request Nov 8, 2021

DDP Enhancements #63

Merged

jbloxham force-pushed the distributed-launch branch from ad74470 to e99b84d Compare November 8, 2021 19:50

ravi-mosaicml suggested changes Nov 8, 2021

View reviewed changes

jbloxham added 19 commits November 9, 2021 00:20

it basically works

b46feb6

WIP, seeing if CircleCI can handle the dreaded 20-wide batch

30b12f6

use torch.distributed.run instead

d311429

finally onto something

fee7e2d

god forbid python make any sense as a programming language

ee32119

a bit of trim

1dd32f0

the tests pass

f66150a

minor cleanup

cee8bff

rebasing and restoring

298769f

replace filestore with hashstore

76b91bf

formatting

188dfc6

pyright cleanup

cb7033a

more pyright cleanup

8c69837

everything should now be green

1686428

last pyright error

d77898d

cleanup

a0bcf17

fix train_model test to reduce losses across processes

0ea8c79

get rid of torch.distributed.run

4805487

don't need higher version

2964a9d

ravi-mosaicml suggested changes Nov 10, 2021

View reviewed changes

jbloxham added 4 commits November 10, 2021 21:21

address comments, sans docs

586ea19

Merge branch 'distributed-launch' of github.com:jbloxham/composer int…

5b129db

…o distributed-launch

fix up the docs

c9f5225

fix pyright

d5427e9

jbloxham requested a review from ravi-mosaicml November 10, 2021 21:50

formatting

24e40fd

ravi-mosaicml approved these changes Nov 10, 2021

View reviewed changes

jbloxham merged commit fe8f90e into mosaicml:dev Nov 10, 2021

hanlint mentioned this pull request Nov 11, 2021

Training run processes do not stop at the end of training #51

Closed

ravi-mosaicml mentioned this pull request Nov 15, 2021

Launch DDP processes before initializing trainer #59

Closed

ravi-mosaicml linked an issue Nov 15, 2021 that may be closed by this pull request

Launch DDP processes before initializing trainer #59

Closed

This was referenced Nov 15, 2021

Auto-TCP Port Selection by default #77

Closed

Remove deferred logging #87

Closed

This was referenced Nov 19, 2021

Create dataloader on trainer __init__() #92

Merged

run_event for callbacks, removed deferred logging #85

Merged

ravi-mosaicml mentioned this pull request Nov 23, 2021

Remove composer.trainer.ddp; replace with composer.utils.ddp #105

Merged

ravi-mosaicml mentioned this pull request Nov 30, 2021

Synthetic Datasets and Subset Sampling #110

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executable for distributed launching #65

Executable for distributed launching #65

jbloxham commented Nov 4, 2021 •

edited

Loading

ravi-mosaicml left a comment

ravi-mosaicml left a comment

ravi-mosaicml left a comment

Executable for distributed launching #65

Executable for distributed launching #65

Conversation

jbloxham commented Nov 4, 2021 • edited Loading

ravi-mosaicml left a comment

Choose a reason for hiding this comment

ravi-mosaicml left a comment

Choose a reason for hiding this comment

ravi-mosaicml left a comment

Choose a reason for hiding this comment

jbloxham commented Nov 4, 2021 •

edited

Loading