[feat] batch broadcast requests into a configurable buffer #43

blefaudeux · 2020-08-14T23:09:05Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Improves on #42 by batching the small broadcasts into a bigger one. There's a tradeoff in between doing more copies and incurring less latency on the communication side, so the broadcast buffer and a size above which the broadcast is direct are configurable.
This is a long lived PR so some listed commits are unrelated and come from merges with upstream master over time.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2020-08-14T23:15:09Z

Codecov Report

Merging #43 into master will increase coverage by 0.07%.
The diff coverage is 92.85%.

@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
+ Coverage   94.18%   94.26%   +0.07%     
==========================================
  Files          35       35              
  Lines        2065     2092      +27     
==========================================
+ Hits         1945     1972      +27     
  Misses        120      120

Flag	Coverage Δ
#Python	`94.26% <92.85%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
fairscale/optim/oss.py	`97.11% <86.36%> (-2.89%)`	⬇️
fairscale/nn/data_parallel/oss_ddp.py	`84.29% <100.00%> (+1.75%)`	⬆️
fairscale/optim/utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2d6f4b...56974ed. Read the comment docs.

fairscale/optim/utils.py

fairscale/optim/oss.py

blefaudeux · 2020-09-03T22:08:27Z

benchmarks/oss.py

@@ -119,7 +119,7 @@ def closure():
    print(f"[{dist.get_rank()}] : Mean speed: {mean:.2f} +/- {std:.2f}")

    if use_oss and check_regression and dist.get_rank() == 0:
-        assert (mean - 3.0 * std) < reference_speed, "Speed regression detected"
+        assert (mean + 3.0 * std) > reference_speed, "Speed regression detected"


this was actually wrong before, we want speed to be equal to or better than the baseline, not lower...

blefaudeux · 2020-09-03T22:09:05Z

benchmarks/oss.py

-    parser.add_argument("--check_regression", action="store", default=True, type=bool)
-    parser.add_argument("--reference_speed", action="store", default=39.82, type=float)
+    parser.add_argument("--check_regression", action="store_true", default=False)
+    parser.add_argument("--reference_speed", action="store", default=33, type=float)


a previous PR switched the optimizer from SGD to RMSProp, which is a tad slower. I checked that this 33fps value is not degraded by this PR

msbaines

It is not safe to assume all params in a param_groups are on the same device.

msbaines · 2020-09-03T23:15:08Z

fairscale/optim/oss.py

@@ -44,12 +45,25 @@ class OSS(Optimizer):
            optimizer to shard (default: SGD)
        group (group):
            torch.distributed group (default: group.WORLD)
+        buffer_size (int, optional): number of elements to buffer before


What does optional mean in this context? The parameter does not look like an optional.

I meant to write that people are free to pass it in or not, there's a default provided

msbaines · 2020-09-03T23:22:56Z

fairscale/optim/oss.py

@@ -67,6 +81,12 @@ def __init__(self, params: _params_t, optim: Type[Optimizer] = SGD, group: Any =
        # Current device is set by the parameters allocated to this rank
        self._device = split_param_groups[self.rank][0]["params"][0].device

+        # Broadcast buffer settings
+        self._buffer: Optional[torch.Tensor] = None


parameters cannot change after init so you could pre-process the params and pre-create the batch buffers.

fairscale/optim/oss.py

…n terms of perfs, may need some iterations

…n time

blefaudeux · 2020-09-10T16:51:52Z

Now deduplicating in between oss_ddp and oss @min-xu-ai, slightly tighter. I think that it would be nicer to adapt the buffering parameters to the loaded model, that's something we know at construction time (very large model and few shards -> adapt the buffer size)

blefaudeux · 2020-09-10T16:57:31Z

Something else that I'm planning is to do is to pre-sort the parameters per size at construction time, so that the step and oss_ddp logic is simpler, as suggested by Min

blefaudeux added 4 commits August 14, 2020 15:30

first take at comms batching when broadcasting the state

86628e1

sorting imports..

4dfc71c

nit

5b00c3b

new machine means new linting..

df15aa7

blefaudeux requested a review from min-xu-ai August 14, 2020 23:09

facebook-github-bot added the CLA Signed label Aug 14, 2020

blefaudeux marked this pull request as draft August 14, 2020 23:25

blefaudeux marked this pull request as ready for review August 15, 2020 00:09

better unit testing

3565391

blefaudeux marked this pull request as draft August 15, 2020 00:41

hotfix, dimension

f0e6814

blefaudeux marked this pull request as ready for review August 15, 2020 02:15

blefaudeux requested a review from msbaines August 17, 2020 16:19

blefaudeux added 2 commits August 17, 2020 15:06

better unit testing still, preemptive bugfix

b7f7802

linting

8084f98

blefaudeux marked this pull request as draft August 17, 2020 22:38

unit test fix, a little prettier, should be gtg

510f773

blefaudeux marked this pull request as ready for review August 17, 2020 23:04

annoying, remove coverage check on the type hints

582134f

blefaudeux self-assigned this Aug 18, 2020

msbaines reviewed Aug 18, 2020

View reviewed changes

fairscale/optim/utils.py Outdated Show resolved Hide resolved

msbaines suggested changes Aug 18, 2020

View reviewed changes

fairscale/optim/oss.py Outdated Show resolved Hide resolved

blefaudeux commented Aug 18, 2020

View reviewed changes

fairscale/optim/oss.py Outdated Show resolved Hide resolved

blefaudeux added 6 commits August 19, 2020 20:49

initial commit, dummy training loop, pure pytorch but not DDP

4ed074b

probably slightly broken, but rough DDP benchmark run

a167289

adding the torchvision requirement for testing

20b981d

brainfart

8a2377c

reduce the loss, do something slightly distributed

41dcf69

Some cleanup, distributing the training on two GPUs

b212dee

blefaudeux marked this pull request as draft September 3, 2020 21:19

adjust speed for RMSProp

5cbe21e

blefaudeux commented Sep 3, 2020

View reviewed changes

blefaudeux added 2 commits September 3, 2020 15:28

bugfix

b3aad66

back to 100% code coverage, slightly cleaner unit test

07d6626

blefaudeux marked this pull request as ready for review September 3, 2020 23:01

msbaines suggested changes Sep 3, 2020

View reviewed changes

msbaines requested a review from shruti-bh September 3, 2020 23:43

blefaudeux added 5 commits September 4, 2020 13:15

WIP

b09d2a1

Merge branch 'master' into oss_batch_broadcast

3739ee0

better bucketing, across devices and ranks. credits to oss_ddp. WIP i…

5f78ccf

…n terms of perfs, may need some iterations

cosmetics

c711b73

WIP

af9dc13

blefaudeux marked this pull request as draft September 9, 2020 20:54

blefaudeux added 4 commits September 9, 2020 14:08

Merge remote-tracking branch 'upstream/master' into oss_batch_broadcast

3dd3c27

merge fixes + tentative perf improvements

8916c50

allocate per-device broadcast buffer once and for all, at constructio…

68a67fb

…n time

deduplicate oss_ddp/oss

4cadd58

blefaudeux added 6 commits September 10, 2020 15:21

merge with upstream master, could still be optimized

07c20a9

Merge remote-tracking branch 'upstream/master' into oss_batch_broadcast

5decde1

Merge remote-tracking branch 'upstream/master' into oss_batch_broadcast

a8f601c

restoring working state, nccl deadlocking unfortunately

b34bedd

wip

08ce45d

in working order, but unbearably slow

26308b4

blefaudeux closed this Oct 1, 2020

blefaudeux deleted the oss_batch_broadcast branch October 1, 2020 18:49

myleott added a commit that referenced this pull request Feb 22, 2021

Misc comments from @anj-s (#43)

5bb212f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] batch broadcast requests into a configurable buffer #43

[feat] batch broadcast requests into a configurable buffer #43

blefaudeux commented Aug 14, 2020 •

edited

Loading

codecov bot commented Aug 14, 2020 •

edited

Loading

blefaudeux Sep 3, 2020 •

edited

Loading

blefaudeux Sep 3, 2020

msbaines left a comment

msbaines Sep 3, 2020

blefaudeux Sep 8, 2020

msbaines Sep 3, 2020

blefaudeux commented Sep 10, 2020

blefaudeux commented Sep 10, 2020

[feat] batch broadcast requests into a configurable buffer #43

[feat] batch broadcast requests into a configurable buffer #43

Conversation

blefaudeux commented Aug 14, 2020 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

codecov bot commented Aug 14, 2020 • edited Loading

Codecov Report

blefaudeux Sep 3, 2020 • edited Loading

Choose a reason for hiding this comment

blefaudeux Sep 3, 2020

Choose a reason for hiding this comment

msbaines left a comment

Choose a reason for hiding this comment

msbaines Sep 3, 2020

Choose a reason for hiding this comment

blefaudeux Sep 8, 2020

Choose a reason for hiding this comment

msbaines Sep 3, 2020

Choose a reason for hiding this comment

blefaudeux commented Sep 10, 2020

blefaudeux commented Sep 10, 2020

blefaudeux commented Aug 14, 2020 •

edited

Loading

codecov bot commented Aug 14, 2020 •

edited

Loading

blefaudeux Sep 3, 2020 •

edited

Loading