Fix dynamo issue #6527

oraluben · 2024-09-12T05:59:40Z

Dynamo use faketensor to trace tensor ops. In some case, the mechanism break compiling with deepspeed.

An example could be found at https://gist.github.com/oraluben/9b8240c2fe482eb4382453d6c97a5f76, to see issues, install deepspeed==0.14.4 instead of my fork

without this PR, llama cannot be compiled.

Detailed explanation:

ZeROOrderedDict
dynamo use deepcopy to copy tensors, which will call object.__reduce__. When copying ZeROOrderedDict, the default implementation do not copy its _parent_module and will lead to failure.
param maybe faketensor and do not have ds_status yet, but during tracing it's ok to just skip the register_external_parameter, it should be done ways before.

tohtana

@oraluben Thank you for offering a great investigation! I think this is a clean and simple solution for the issue.

deepspeed/runtime/zero/parameter_offload.py

oraluben · 2024-09-13T03:46:19Z

torch.compiler.is_compiling() should be better for this case, however there's still issue, presumably on dynamo side (since we have faketensor we're definitely tracing). So keep it for now.

[rank1]:   File "/home/yyc/accelerate-demo/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1720, in __getattr__
[rank1]:     return _parameters[name]
[rank1]:   File "/home/yyc/repo/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 67, in __getitem__
[rank1]:     if not is_compiling() and param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
[rank1]: torch._dynamo.exc.TorchRuntimeError: Failed running call_module L__self___self_attn_q_proj(*(FakeTensor(..., device='cuda:1', size=(1, s0, 4096), dtype=torch.float16,
[rank1]:            grad_fn=<MulBackward0>),), **{}):
[rank1]: 'FakeTensor' object has no attribute 'ds_status'

my patch in deepspeed.runtime:

diff --git a/deepspeed/runtime/compiler.py b/deepspeed/runtime/compiler.py
index 879c0a1a..3994c1f5 100644
--- a/deepspeed/runtime/compiler.py
+++ b/deepspeed/runtime/compiler.py
@@ -10,6 +10,15 @@ def is_compile_supported():
     return hasattr(torch, "compiler") and hasattr(torch.nn.Module, "compile")
 
 
+def is_compiling():
+    if not is_compile_supported():
+        return False
+    elif hasattr(torch.compiler, 'is_compiling'):  # torch >= 2.3
+        return torch.compiler.is_compiling()
+    else:
+        return torch._dynamo.is_compiling()
+
+
 def disable(func):
     if is_compile_supported():
         return torch.compiler.disable(func)

loadams · 2024-10-23T20:28:20Z

@oraluben - sorry this PR has taken so long to be merged, I think it just needed to have master merged again to get the XPU fixes.

oraluben added 3 commits September 12, 2024 13:58

Fix dynamo issue in llama

76449bf

Merge branch 'master' into fix-compile-deepcopy

156c092

cleanup

b513045

oraluben marked this pull request as ready for review September 12, 2024 06:17

oraluben requested a review from tjruwase as a code owner September 12, 2024 06:17

oraluben changed the title ~~Fix dynamo issue in llama~~ Fix dynamo issue Sep 12, 2024

tjruwase requested a review from tohtana September 12, 2024 15:57

tohtana approved these changes Sep 12, 2024

View reviewed changes

loadams reviewed Sep 12, 2024

View reviewed changes

deepspeed/runtime/zero/parameter_offload.py Outdated Show resolved Hide resolved

Fix for python < 3.8

78b29c2

oraluben mentioned this pull request Sep 28, 2024

fix errors when setting zero3 leaf modules with torch.compile #6564

Merged

oraluben and others added 11 commits October 2, 2024 22:50

Merge remote-tracking branch 'upstream/master' into fix-compile-deepcopy

3cd2e4a

Merge branch 'master' into fix-compile-deepcopy

d7224be

Merge branch 'master' into fix-compile-deepcopy

91786cd

Merge branch 'master' into fix-compile-deepcopy

831e601

Merge branch 'master' into fix-compile-deepcopy

ae3fa1a

Merge branch 'master' into fix-compile-deepcopy

2765ca2

Merge branch 'master' into fix-compile-deepcopy

b5b853c

Merge branch 'master' into fix-compile-deepcopy

306dfad

Merge branch 'master' into fix-compile-deepcopy

2eebfbf

Merge branch 'master' into fix-compile-deepcopy

4044c8c

Merge branch 'master' into fix-compile-deepcopy

50e4d18

loadams approved these changes Oct 23, 2024

View reviewed changes

loadams enabled auto-merge October 23, 2024 22:04

loadams added this pull request to the merge queue Oct 23, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 24, 2024

Merge branch 'master' into fix-compile-deepcopy

94f9ed4

tohtana enabled auto-merge October 24, 2024 05:12

tohtana added this pull request to the merge queue Oct 25, 2024

Merged via the queue into microsoft:master with commit 3d5cf73 Oct 25, 2024
13 checks passed

oraluben deleted the fix-compile-deepcopy branch October 25, 2024 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dynamo issue #6527

Fix dynamo issue #6527

oraluben commented Sep 12, 2024 •

edited

Loading

tohtana left a comment

oraluben commented Sep 13, 2024 •

edited

Loading

loadams commented Oct 23, 2024

Fix dynamo issue #6527

Fix dynamo issue #6527

Conversation

oraluben commented Sep 12, 2024 • edited Loading

tohtana left a comment

Choose a reason for hiding this comment

oraluben commented Sep 13, 2024 • edited Loading

loadams commented Oct 23, 2024

oraluben commented Sep 12, 2024 •

edited

Loading

oraluben commented Sep 13, 2024 •

edited

Loading