Allow sharded grad scaler to cpu offload with FSDP #831

anupambhatnagar · 2021-10-26T22:58:33Z

What does this PR do?

The ShardedGradScaler class implements _amp_update_scale_cpu_ and _foreach_check_finite_and_unscale_cpu_ functions. These functions are required to enable loss scaling when FSDP is used along with cpu_offload.

Additional Changes:

In several places we will require PyTorch version >= 1.8.0
Removed test_fsdp_grad_scaler.py since we have implemented the test in test_fsdp.py
Increased number of epochs in test_fsdp_state_dict.py. With less than 6 epochs the test_load_local_state_dict__True_True fails. We need 6 epochs for the scale * loss to be not inf/nan and then the update happens. If the loss is inf/nan then the update does not take place.

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

anupambhatnagar · 2021-10-26T22:59:58Z

Added a test for ShardedGradScaler class with no cpu offload.

fairscale/optim/grad_scaler.py

anupambhatnagar · 2021-10-29T19:12:12Z

@anj-s some of the tests are failing because isort checks are failing on many of the files that I did not touch. I will apply isort to the entire repo in a different PR so that this is easy to review.

anj-s · 2021-11-05T00:56:36Z

fairscale/optim/grad_scaler.py

        super().__init__(
            init_scale=init_scale,
            growth_factor=growth_factor,
            backoff_factor=backoff_factor,
            growth_interval=growth_interval,
            enabled=enabled,
        )
-        self.display_warning = True
-        self.group = process_group
+        if enabled and amp_definitely_not_available():  # type: ignore


how does this work for the CPU only version of GradScaler? even though this is a CPU only version we are assuming that we have GPUs to run the model. just wondering if that is a fair assumption to make.

I don't have a clear answer to this. Would the cpu-tests be enough to ensure that this works?

fairscale/optim/grad_scaler.py

min-xu-ai

bunch of tests are now skipped with 1.8 as well. Since 1.8 is LTS, we want those to be still tested?

tests/nn/data_parallel/test_fsdp_regnet.py

anj-s · 2021-11-05T01:06:30Z

tests/nn/data_parallel/test_fsdp.py

@@ -305,6 +314,24 @@ def test_cpu_offload_and_cpu_grads(self):
        )
        spawn_and_init(test_fn)

+    def test_no_cpu_offload_with_sharded_grad_scaler(self):
+        # We don't test the False condition because that requires the optimizer to internally do


nit: I know you did not add it but I wonder if we need this comment duplicated in all three tests. Also if could mention what the False property is it would make it a lot clearer.

updated comments to reflect what we are testing now.

anj-s · 2021-11-05T01:09:36Z

tests/nn/data_parallel/test_fsdp.py

+        )
+        spawn_and_init(test_fn)
+
+    def test_no_cpu_offload_with_sharded_grad_scaler_and_mixed_precision(self):


Can we use parametrization to combine these 4 tests? you are testing mixed_precision=True/False and cpu_offload=True/False. we can skip the move_grads_to_cpu since that is set by default by the cpu_offload param. Also wanted to mention that we have deprecated the cpu_offload param and should use move_params_to_cpu instead.

Upon changing the config dictionary key from "cpu_offload" to "move_params_to_cpu" the test_cpu_offload_and_cpu_grads test breaks with the following error message:

E Traceback (most recent call last): E File "/private/home/anupamb/fairscale/tests/nn/data_parallel/test_fsdp.py", line 133, in _test_identical_outputs E torch.testing.assert_allclose(ref_loss, shard_loss) E File "/private/home/anupamb/miniconda3/lib/python3.9/site-packages/torch/testing/__init__.py", line 222, in assert_allclose E result, debug_msg = _compare_tensors_internal(actual, expected, E File "/private/home/anupamb/miniconda3/lib/python3.9/site-packages/torch/testing/__init__.py", line 130, in _compare_tensors_internal E if torch.allclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan): E RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! E E During handling of the above exception, another exception occurred: E E Traceback (most recent call last): E File "/private/home/anupamb/miniconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap E fn(i, *args) E File "/private/home/anupamb/fairscale/tests/nn/data_parallel/test_fsdp.py", line 814, in init_and_run E fn(rank, group, *args) E File "/private/home/anupamb/fairscale/tests/nn/data_parallel/test_fsdp.py", line 136, in _test_identical_outputs E raise Exception(f"FullyShardedDataParallel didn't match PyTorch DDP using config: {config}\n\n {e}") E Exception: FullyShardedDataParallel didn't match PyTorch DDP using config: {'mixed_precision': True, 'move_params_to_cpu': True, 'compute_dtype': torch.float32} E E Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

anj-s · 2021-11-05T01:10:55Z

tests/nn/data_parallel/test_fsdp.py

@@ -268,6 +276,7 @@ def rename_test(testcase_func, param_num, param):
    return "%s_%s" % (testcase_func.__name__, parameterized.to_safe_name(str(param.args)),)


+@pytest.mark.skipif(torch_version() < (1, 9, 0), reason="pytorch version >= 1.9.0 required")


should we add a class level pytest skip annotation? maybe I missed that another test class is being run that does not call the _train_for_several_steps function.

I don't follow this comment. can you please elaborate?

tests/nn/data_parallel/test_fsdp.py

anj-s · 2021-11-05T01:15:00Z

Thank you for the PR! We have a lot of skip messages added to tests. I am worried that it might reduce our test coverage. Is there any way we can import selected functions (I know we discussed this offline as well).

anj-s · 2021-11-05T01:15:56Z

we should also update the CHANGELOG.md file that contains the release notes.

facebook-github-bot · 2021-11-07T20:43:40Z

Hi @anupambhatnagar!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

anupambhatnagar · 2021-11-12T01:01:38Z

bunch of tests are now skipped with 1.8 as well. Since 1.8 is LTS, we want those to be still tested?

changed minimum version requirement to 1.8 and made sure all tests are passing with 1.8.

anupambhatnagar · 2021-11-12T03:04:38Z

Thank you for the PR! We have a lot of skip messages added to tests. I am worried that it might reduce our test coverage. Is there any way we can import selected functions (I know we discussed this offline as well).

I have made code changes allowing us to support PyTorch 1.8. The skip messages are there but it does not bring a decline in the coverage. It is impossible to support 1.7 as it does not have the _amp_update_scale_ function.

fairscale/optim/grad_scaler.py

anj-s · 2021-11-15T18:00:43Z

nit: Would it be possible to add comments highlighting where CPU scaler specific code has been added? I know we discussed this offline but would be good to add this to help future development/debugging.

anupambhatnagar · 2021-11-15T22:29:29Z

nit: Would it be possible to add comments highlighting where CPU scaler specific code has been added? I know we discussed this offline but would be good to add this to help future development/debugging.

I have added comments above both functions which are the key pieces to this implementation. See lines 147-149 and 320-322.

Anupam Bhatnagar added 3 commits October 18, 2021 18:09

first commit

e51e891

sharded scaler hitting nan assertions

c63d1c3

adding test for sharded grad scaler without cpu offload

a39b37b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 26, 2021

anupambhatnagar requested a review from anj-s October 26, 2021 22:58

anupambhatnagar commented Oct 26, 2021

View reviewed changes

fairscale/optim/grad_scaler.py Outdated Show resolved Hide resolved

Anupam Bhatnagar added 8 commits October 26, 2021 18:06

ddp grad scaler and fsdp sharded grad scaler test failing

daafd25

removing test_output

bc9e244

fix no cpu offload test

917ae0e

changing optimizer from OSS to SGD

65c3093

all tests passing, code cleanup pending

3404b69

code cleanup

cf7d2e2

fix pyproject.toml

fa18c8e

removing .isort.cfg

6bb3a71

anupambhatnagar changed the title ~~[DRAFT] allow sharded grad scaler to cpu offload with FSDP~~ Allow sharded grad scaler to cpu offload with FSDP Oct 29, 2021

resolving merge conflicts

cda515f

anupambhatnagar mentioned this pull request Oct 29, 2021

[0.4.1] ValueError: Attempting to unscale FP16 gradients. #834

Closed

Anupam Bhatnagar added 6 commits November 1, 2021 12:46

running isort linter

98d04fd

resolving isort issues

41d012e

resolving black linter issue

6346960

resolving mypy issues

c4e94a5

fix import statement

8d79f1b

Merge branch 'main' into cpu_gradscaler. Taking in changes from PR 838.

006db9c

anupambhatnagar self-assigned this Nov 2, 2021

Anupam Bhatnagar added 4 commits November 2, 2021 12:24

fix mypy error

9ad7d3e

modifying import statement

bd7c7a9

adding pytorch version requirement

a51b49d

fixing pytest skip test decorator

cc63fbd

anj-s reviewed Nov 5, 2021

View reviewed changes

fairscale/optim/grad_scaler.py Show resolved Hide resolved

anj-s reviewed Nov 5, 2021

View reviewed changes

fairscale/optim/grad_scaler.py Show resolved Hide resolved

anj-s reviewed Nov 5, 2021

View reviewed changes

fairscale/optim/grad_scaler.py Outdated Show resolved Hide resolved

min-xu-ai reviewed Nov 5, 2021

View reviewed changes

tests/nn/data_parallel/test_fsdp_regnet.py Outdated Show resolved Hide resolved

anj-s reviewed Nov 5, 2021

View reviewed changes

tests/nn/data_parallel/test_fsdp.py Outdated Show resolved Hide resolved

Anupam Bhatnagar added 3 commits November 11, 2021 14:37

adding support for torch 1.8

345835b

minor edit

b5cfc86

[skip ci] more torch 1.8 changes

1ad6277

Anupam Bhatnagar added 4 commits November 11, 2021 19:27

parametrizing the tests

110e52d

Merge branch 'main' into cpu_gradscaler

447b9db

cleanup code with linters

ad5e979

[skip ci] update doc string

e693c31

anj-s reviewed Nov 15, 2021

View reviewed changes

fairscale/optim/grad_scaler.py Outdated Show resolved Hide resolved

anj-s reviewed Nov 15, 2021

View reviewed changes

fairscale/optim/grad_scaler.py Show resolved Hide resolved

anj-s reviewed Nov 15, 2021

View reviewed changes

fairscale/optim/grad_scaler.py Outdated Show resolved Hide resolved

anj-s approved these changes Nov 15, 2021

View reviewed changes

[skip ci] addressing some more comments

acb4304

anupambhatnagar merged commit ba5785f into main Nov 15, 2021

min-xu-ai deleted the cpu_gradscaler branch September 23, 2022 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow sharded grad scaler to cpu offload with FSDP #831

Allow sharded grad scaler to cpu offload with FSDP #831

anupambhatnagar commented Oct 26, 2021 •

edited

Loading

anupambhatnagar commented Oct 26, 2021

anupambhatnagar commented Oct 29, 2021

anj-s Nov 5, 2021

anupambhatnagar Nov 12, 2021

min-xu-ai left a comment

anj-s Nov 5, 2021

anupambhatnagar Nov 12, 2021

anj-s Nov 5, 2021

anupambhatnagar Nov 12, 2021

anupambhatnagar Nov 12, 2021

anj-s Nov 5, 2021

anupambhatnagar Nov 12, 2021

anj-s commented Nov 5, 2021 •

edited

Loading

anj-s commented Nov 5, 2021

facebook-github-bot commented Nov 7, 2021

anupambhatnagar commented Nov 12, 2021

anupambhatnagar commented Nov 12, 2021

anj-s commented Nov 15, 2021

anupambhatnagar commented Nov 15, 2021

		@@ -268,6 +276,7 @@ def rename_test(testcase_func, param_num, param):
		return "%s_%s" % (testcase_func.__name__, parameterized.to_safe_name(str(param.args)),)


		@pytest.mark.skipif(torch_version() < (1, 9, 0), reason="pytorch version >= 1.9.0 required")

Allow sharded grad scaler to cpu offload with FSDP #831

Allow sharded grad scaler to cpu offload with FSDP #831

Conversation

anupambhatnagar commented Oct 26, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

anupambhatnagar commented Oct 26, 2021

anupambhatnagar commented Oct 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

min-xu-ai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anj-s commented Nov 5, 2021 • edited Loading

anj-s commented Nov 5, 2021

facebook-github-bot commented Nov 7, 2021

Process

anupambhatnagar commented Nov 12, 2021

anupambhatnagar commented Nov 12, 2021

anj-s commented Nov 15, 2021

anupambhatnagar commented Nov 15, 2021

anupambhatnagar commented Oct 26, 2021 •

edited

Loading

anj-s commented Nov 5, 2021 •

edited

Loading