Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataloader on trainer __init__() #92

Merged
merged 3 commits into from
Nov 23, 2021

Conversation

ravi-mosaicml
Copy link
Contributor

#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in init.

This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.

#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__.

This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.
@ravi-mosaicml ravi-mosaicml requested review from Averylamp, jbloxham and a team November 19, 2021 19:29
@ravi-mosaicml ravi-mosaicml mentioned this pull request Nov 20, 2021
13 tasks
ravi-mosaicml added a commit that referenced this pull request Nov 22, 2021
Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin.

Closes #98. Depends on #85 and (for tests) #92.
@ravi-mosaicml ravi-mosaicml mentioned this pull request Nov 22, 2021
4 tasks
Copy link
Contributor

@ajaysaini725 ajaysaini725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't change the external API at all so LGTM

I think we should have some discussion around whether to have the __init__ take any information about datasets at all or have the train + eval datasets be passed directly into fit() as well (this is a standard among other ML libraries like sklearn and HuggingFace).

@@ -108,6 +108,10 @@ class State(Serializable):
# stopping conditions
max_epochs: int

# dataloaders
train_dataloader: types.DataLoader
eval_dataloader: types.DataLoader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anisehsani should be aware of this change since he's working on supporting multiple eval dataloaders

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping. This actually simplifies some thing for me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be down to discuss this, though my initial thoughts are to keep fit as a zero-argument function, which can allow __init__ to be used as a kind of static analysis check.

@ravi-mosaicml ravi-mosaicml merged commit 2219555 into dev Nov 23, 2021
@ravi-mosaicml ravi-mosaicml deleted the ravi/create_dataloaders_in_init branch November 23, 2021 00:50
hanlint pushed a commit that referenced this pull request Jan 19, 2022
* Added `run_event` to callback

Closes #11

This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work.

* Removed callback helper methods

* Fixed tests

* Formatting

* Addressed PR feedback

* Fixed tests

* Formatting

* Fixed _run_event

* Formatting

* Removed ip

* Instrumentation WIP

* Stash

* Create dataloader on trainer __init__()

#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__.

This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.

* Stash

* Added JSON trace handler

* Formatting

* Fixed trace generation

* Prettified memory

* Fixed setup.py

* Changed setup.py

* testing

* Removed prepare

* Run Directory Uploader

Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin.

Closes #98. Depends on #85 and (for tests) #92.

* Supporting both styles for callbacks
Removed deferred logging since rank is now known at the init event

* Minimizing Diff

* Fixed tests

* Added fasteners

* Fixed tests

* Formatting

* Lazy population of kwargs

* 1. Added object_name_prefix
2. Tested on google cloud storage
3. Added exponential backoff and retrying for transient errors

* Addressed PR feedback

* Remove the composer.trainer.ddp class

Before #65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp.

This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. #97 and #101 depend on this functionality.

Also removed DDP from the state, as that is available globally.

* Added in DDP barrier

* Fixed tests

* Update composer/utils/ddp.py

* Update composer/utils/ddp.py

* Switched tqdm to using callback hooks
Added test case for TQDM

* Fixed pyright

* Fixed DDP barriers

* Increased timeout for run directory uploader

* Switched callback format for run directory uploader

* Replaced `atexit` with cleanup methods

When running the trainer multiple times, such as in interactive enviroments, `atexit` does not fire. Instead, replaced it with `.close()` and `.post_close()` hooks on callbacks.

`.close()` can be used to write and flush files. `.post_close()` can be used to backup the run directory and capture any changes that may have been made on `.close()`

* Uncommented code

* Running callbacks befor algorithms for the INIT event in the engine

* For the INIT event, run the callbacks first to initialize the loggers.
* For other events, run the algorithms first, so the callbacks have the state  after algorithms modify it.

* Fixed tests

* Addressed PR feedback

* Added in the scheduler

* Added instant events

* Fixes

* Fixed profile scheduling

* Added decorator option

* Formatting

* Added documentation for the profiler

* 1. Added test cases
2. Fixed trace files to be proper json on successful training runs

* Profiler entry point

* Ravi/instrumentation point (#140)

1. Using `os.getpid()` for process IDs to enable synchronization with the pytorch profiler
2. Switched to using object format instead of array format for the traces
3. Added in extra metadata such as global rank and timestamps for clock syncing

* Writing metadata to a seperate file

* Fixed tests

* Removed the perf counter

* Recording IO stats

* Log global rank in each torch profiler file

* Merging process traces (#144)

* Refactor the system profiler and dataloader profiler into callbacks
Configuring the pytorch profiler based off of the mosaic profiler hparams

* 1. Updated the merge script to merge pytorch trace files
2. Renamed the `MosaicProfiler` to `Profiler`

* Increased timeout

* Formatting

* Fixed the `run_mosaic_profiler`

* Added detailed option

* Added sort index

* Setting `pid` to global rank and `tid` to `os.getpid()`

The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave.

Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict.

* Simplifying diff

* Put the backwards thread second

* Thread sorting in trace

* Fix

* Fixes

* Fixed tests

* Fixed the profiler

* Fixes

Co-authored-by: Jamie Bloxham <[email protected]>
Co-authored-by: Bandish Shah <[email protected]>
Co-authored-by: anisehsani <[email protected]>
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
mosaicml#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in init.

This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.
coryMosaicML pushed a commit to coryMosaicML/composer that referenced this pull request Feb 23, 2022
* Added `run_event` to callback

Closes #11

This PR helps clean up some of the tests, rank zero callbacks, and will be used by future profiling work.

* Removed callback helper methods

* Fixed tests

* Formatting

* Addressed PR feedback

* Fixed tests

* Formatting

* Fixed _run_event

* Formatting

* Removed ip

* Instrumentation WIP

* Stash

* Create dataloader on trainer __init__()

mosaicml#65 made the global rank available in the process start, so it is no longer necessarry to wait until training_start() to create the dataloader. Instead, dataloaders are now initialized in __init__.

This change will help with dataloader profiling, as now the dataloader will be immediately bound to the state.

* Stash

* Added JSON trace handler

* Formatting

* Fixed trace generation

* Prettified memory

* Fixed setup.py

* Changed setup.py

* testing

* Removed prepare

* Run Directory Uploader

Added uploading of the run directory to various cloud providers via a callback. Depends on the LibCloud plugin.

Closes mosaicml#98. Depends on mosaicml#85 and (for tests) mosaicml#92.

* Supporting both styles for callbacks
Removed deferred logging since rank is now known at the init event

* Minimizing Diff

* Fixed tests

* Added fasteners

* Fixed tests

* Formatting

* Lazy population of kwargs

* 1. Added object_name_prefix
2. Tested on google cloud storage
3. Added exponential backoff and retrying for transient errors

* Addressed PR feedback

* Remove the composer.trainer.ddp class

Before mosaicml#65, composer.trainer.ddp ensured that DDP functionality was accessed only after ddp was initialized. Now, DDP is available from process start, so this class is no longer needed. Moved all the functionality from this class to the global composer.utils.ddp.

This change allows callbacks, algroithms, etc... to use DDP (such as barriers and reductions) as needed. mosaicml#97 and mosaicml#101 depend on this functionality.

Also removed DDP from the state, as that is available globally.

* Added in DDP barrier

* Fixed tests

* Update composer/utils/ddp.py

* Update composer/utils/ddp.py

* Switched tqdm to using callback hooks
Added test case for TQDM

* Fixed pyright

* Fixed DDP barriers

* Increased timeout for run directory uploader

* Switched callback format for run directory uploader

* Replaced `atexit` with cleanup methods

When running the trainer multiple times, such as in interactive enviroments, `atexit` does not fire. Instead, replaced it with `.close()` and `.post_close()` hooks on callbacks.

`.close()` can be used to write and flush files. `.post_close()` can be used to backup the run directory and capture any changes that may have been made on `.close()`

* Uncommented code

* Running callbacks befor algorithms for the INIT event in the engine

* For the INIT event, run the callbacks first to initialize the loggers.
* For other events, run the algorithms first, so the callbacks have the state  after algorithms modify it.

* Fixed tests

* Addressed PR feedback

* Added in the scheduler

* Added instant events

* Fixes

* Fixed profile scheduling

* Added decorator option

* Formatting

* Added documentation for the profiler

* 1. Added test cases
2. Fixed trace files to be proper json on successful training runs

* Profiler entry point

* Ravi/instrumentation point (mosaicml#140)

1. Using `os.getpid()` for process IDs to enable synchronization with the pytorch profiler
2. Switched to using object format instead of array format for the traces
3. Added in extra metadata such as global rank and timestamps for clock syncing

* Writing metadata to a seperate file

* Fixed tests

* Removed the perf counter

* Recording IO stats

* Log global rank in each torch profiler file

* Merging process traces (mosaicml#144)

* Refactor the system profiler and dataloader profiler into callbacks
Configuring the pytorch profiler based off of the mosaic profiler hparams

* 1. Updated the merge script to merge pytorch trace files
2. Renamed the `MosaicProfiler` to `Profiler`

* Increased timeout

* Formatting

* Fixed the `run_mosaic_profiler`

* Added detailed option

* Added sort index

* Setting `pid` to global rank and `tid` to `os.getpid()`

The pytorch profiler uses `os.getpid()` for the thread id. Updating the training loop profiler to be consistent so the events will interleave.

Updated the merge script to replace the PID with the global rank. This ensures that GPU streams will show up under the correct rank, since pytorch by default uses the local GPU rank as the PID. This change also ensures that traces will merge properly across nodes where PIDs could conflict.

* Simplifying diff

* Put the backwards thread second

* Thread sorting in trace

* Fix

* Fixes

* Fixed tests

* Fixed the profiler

* Fixes

Co-authored-by: Jamie Bloxham <[email protected]>
Co-authored-by: Bandish Shah <[email protected]>
Co-authored-by: anisehsani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants