Allow layer-wise recompute #18566

pengwa · 2023-11-23T07:21:35Z

Allow layer-wise recompute

Early, we need users/developers to specify the subgraphs to recompute, now we introduced a more user-friendly way to enable recompute for all detected stashed activation recomputation subgraphs. This scarifies getting the best configs while makes it easier to support user requirements when they switches from PyTorch per-layer gradient checkpoint to ORTModule.

ORTMODULE_MEMORY_OPT_LEVEL is introduced to control the usage, by default, it is 0, e.g. USER_SPECIFIED, all subgraphs definedin ORTMODULE_MEMORY_OPT_CONFIG will be recomputed. So this is compatible to existing recompute usage in ORTModule integrated models.

Using ORTMODULE_MEMORY_OPT_LEVEL=1, we will enable all recompute plans detected, so those configs in ORTMODULE_MEMORY_OPT_CONFIG will not be respected any more.

Add Unit Tests using 3 layer blooms.

https://github.com/microsoft/onnxruntime/blob/pengwa/add_aggresive_recompute/docs/Memory_Optimizer.md

orttraining/orttraining/python/training/ortmodule/options.py

AdamLouly · 2023-11-29T21:55:59Z

How does ORT finds all the recompute opportunities in this implementation?
Are there numbers that shows how much memory optimization we have against how much perf we lost?

…s even smaller than the other one in forward pass (FusedMatmul which is replaced by a new node after gradient graph is built)

…pengwa/add_aggresive_recompute

pengwa · 2023-11-30T11:49:38Z

How does ORT finds all the recompute opportunities in this implementation?
It is here: https://github.com/microsoft/onnxruntime/blob/f3369a8bf87190552ad551a6de56df01cccf7a62/orttraining/orttraining/core/optimizer/memory_optimizer/memory_insight.cc#L249C23-L249C53.

Find those stashed activations that are used by backward operators. Put all those activations as candidates; For each candidate, https://github.com/microsoft/onnxruntime/blob/f3369a8bf87190552ad551a6de56df01cccf7a62/orttraining/orttraining/core/optimizer/memory_optimizer/memory_insight.cc#L272C9-L272C30 will check whether it is recomputable, and how the subgraph looks like,.

Are there numbers that shows how much memory optimization we have against how much perf we lost?

No, that's the long term goal to have all those feature ready, to help dynamically choose a good plan for users.

…pengwa/add_aggresive_recompute

onnxruntime/core/graph/graph_viewer.cc

orttraining/orttraining/python/training/ortmodule/_training_manager.py

orttraining/orttraining/core/optimizer/memory_optimizer/recompute_analysis.cc

orttraining/orttraining/python/training/ortmodule/_runtime_inspector.py

orttraining/orttraining/test/optimizer/memory_optimizer_test.cc

…pengwa/add_aggresive_recompute

pengwa · 2023-12-12T00:44:25Z

Thank you @askhade, @zhijxu-MS !!

This reverts commit ccf3b20.

full recompute

7164c2f

pengwa requested review from zhijxu-MS, jambayk and askhade November 23, 2023 07:22

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Nov 23, 2023

github-advanced-security bot found potential problems Nov 23, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/options.py Fixed Show fixed Hide fixed

pengwa added 7 commits November 23, 2023 07:40

docs

7217ad3

fix

eed2916

minor

afc9674

fix memory inspector

1df5dd7

move files

deedc44

typo

15a0640

allow layer-wise recompute

4a88196

pengwa changed the title ~~Allow AGGRESSIVE_FULL_RECOMPUTE in memory optimization~~ Allow layer-wise recompute Nov 24, 2023

pengwa added 7 commits November 24, 2023 10:22

fixes

f81a3d6

add layerwise recompute test case

d79d171

missing test file

c5ba05a

refine a bit

3334d4d

fixes

bf89c04

minor

38fa66f

fix

e8817c7

pengwa added 4 commits November 30, 2023 07:44

Gelu is followed with a Reshape for gradient ops, but its NodeIndex i…

cb4eb7b

…s even smaller than the other one in forward pass (FusedMatmul which is replaced by a new node after gradient graph is built)

fix

486da22

fix

c4e72bb

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

f3369a8

…pengwa/add_aggresive_recompute

pengwa added 3 commits December 4, 2023 09:44

fix CIs

e342f59

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

7c3bbc3

…pengwa/add_aggresive_recompute

enable tile recompute UT

30aebbc

askhade reviewed Dec 5, 2023

View reviewed changes

onnxruntime/core/graph/graph_viewer.cc Show resolved Hide resolved

askhade reviewed Dec 5, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_training_manager.py Outdated Show resolved Hide resolved

zhijxu-MS reviewed Dec 8, 2023

View reviewed changes

pengwa added 2 commits December 8, 2023 08:27

refine review comments

b8e6088

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

0a5f4d4

…pengwa/add_aggresive_recompute

zhijxu-MS previously approved these changes Dec 8, 2023

View reviewed changes

lint

0fdfa4b

pengwa dismissed zhijxu-MS’s stale review via 0fdfa4b December 8, 2023 10:44

pengwa added 2 commits December 8, 2023 10:45

typos

1d0c2bc

typo

1602060

zhijxu-MS approved these changes Dec 11, 2023

View reviewed changes

askhade approved these changes Dec 11, 2023

View reviewed changes

pengwa merged commit ccf3b20 into main Dec 12, 2023
96 checks passed

pengwa deleted the pengwa/add_aggresive_recompute branch December 12, 2023 00:44

snnn added a commit that referenced this pull request Dec 12, 2023

Revert "Allow layer-wise recompute (#18566)"

9323dec

This reverts commit ccf3b20.

snnn mentioned this pull request Dec 12, 2023

Revert "Allow layer-wise recompute" to fix a build pipeline break #18796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow layer-wise recompute #18566

Allow layer-wise recompute #18566

pengwa commented Nov 23, 2023 •

edited

Loading

AdamLouly commented Nov 29, 2023

pengwa commented Nov 30, 2023

pengwa commented Dec 12, 2023

Allow layer-wise recompute #18566

Allow layer-wise recompute #18566

Conversation

pengwa commented Nov 23, 2023 • edited Loading