Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LF-MMI GPU OOM #196

Open
wwxm0523 opened this issue Jan 29, 2022 · 6 comments
Open

LF-MMI GPU OOM #196

wwxm0523 opened this issue Jan 29, 2022 · 6 comments

Comments

@wwxm0523
Copy link

There is a GPU OOM problem when I use lf-mmi for training, my token size about 1300 , I want to know how to avoid this problem.

@csukuangfj
Copy link
Collaborator

What's your training command? What's the value of --max-duration?

@danpovey
Copy link
Collaborator

It would be helpful to see the traceback from when it dies.

@wwxm0523
Copy link
Author

This is the error log.(When the number of phones is 220, it can run normally)
`2022-01-30 05:34:59,582 INFO Loading L.fst
INFO from MMI module:
device: cuda
use pruned_intersect: True
use segment info: True
self.lo Sequential(
(0): Dropout(p=0.1, inplace=False)
(1): Linear(in_features=256, out_features=1253, bias=True)
)
number of phones 1252
2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08
2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before
2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0
2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch)
Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #5: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #6: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #7: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #8: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #9: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #10: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #11: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #12: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #13: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #14: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6)

Killing subprocess 3803024
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
`

@danpovey
Copy link
Collaborator

danpovey commented Jan 30, 2022 via email

@wwxm0523
Copy link
Author

Hm, there should be a max_arcs option to MultiGraphDenseIntersectPruned() [I forget the python-level wrapper, probably intersect_dense_pruned()]. Setting that to, e.g. 1000, may resolve the issue. Early in training you can get too many arcs active, and if you are using the "normal" topology (not modified topology), the LF-MMI denominator graph size is quadratic in the number of symbols.

On Sun, Jan 30, 2022 at 1:51 PM abner @.> wrote: This is the error log.(When the number of phones is 220, it can run normally) 2022-01-30 05:34:59,582 INFO Loading L.fst INFO from MMI module: device: cuda use pruned_intersect: True use segment info: True self.lo Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=256, out_features=1253, bias=True) ) number of phones 1252 2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08 2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before 2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0 2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 <#1>: + 0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2 <#2>: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 <#3>: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 <#4>: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 <#5>: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 <#6>: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 <#7>: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 <#8>: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 <#9>: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #10 <#10>: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 <#1>}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #11 <#11>: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #12 <#12>: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #13 <#13>: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14 <#14>: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6) Killing subprocess 3803024 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) — Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: @.>

Thanks, Is it max_active_states? Will lowering this parameter lead to poor training accuracy?

@danpovey
Copy link
Collaborator

danpovey commented Jan 31, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants