-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2 #33
Comments
Note that it can be also a memory issue, because I have a small memory (16gb). However, If the problem was a memory issue, I would expect to observe an error like:
|
Perhaps it's trying to use >1 GPU somehow? (But it shouldn't). If that's the case, setting something likeCUDA_VISIBLE_DEVICES=0(or whatever)should address it.Another possibility is that cuda:-2 is not a real device but some kind of error code. That error message likely comes from torch.I think it would be worthwhile to try to catch the error in pdb, and print out the devices of all inputs to the function that failed.Once we know which object has the bad device, we can more easily debug. |
Could you modify
It may show something that is useful. |
@csukuangfj I already printed devices before, but all of them was cuda:0. |
@danpovey I have 4 devices, but before training, I am setting CUDA_VISIBLE_DEVICES=0. I will also try to debug with pdb. |
I added try-catch block to function decode_one_batch() in decode.py as:
when I run
Problem occurs in nbest_decoding(). Only lattice tensor is given to that function and its device is 0. |
I think you are not quite at the place where it failed-need to do "c" (continue) maybe? |
When I didn't add a try-catch block, log is :
I can't reach lattice after error, hence I added try-catch block. |
I added breakpoint to place where @csukuangfj said. Log is here:
the place in miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py :
|
It might be possible to catch the exception in gdb by doing:
gdb --args python3 whatever.py
(gdb) catch throw
(gdb) r
...
…On Thu, Sep 2, 2021 at 9:54 PM Yunusemre ***@***.***> wrote:
I added breakpoint to place where @csukuangfj
<https://github.com/csukuangfj> said. Log is here:
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb) src.device; index.device; default_value;
device(type='cuda', index=0)
device(type='cuda', index=0)
0.0
(Pdb) c
Traceback (most recent call last):
File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main
pdb._runscript(mainpyfile)
File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript
self.run(statement)
File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run
exec(cmd, globals, locals)
File "<string>", line 1, in <module>
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 435, in <module>
main()
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main
results_dict = decode_dataset(
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
hyps_dict = decode_one_batch(
File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
best_path = nbest_decoding(
File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
path_lattice = _intersect_device(
File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
return k2.intersect_device(
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
value = index_select(a_value, a_arc_map, default_value=filler) \
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 161, in index_select
ans = _IndexSelectFunction.apply(src, index, default_value)
File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 67, in forward
return _k2.index_select(src, index, default_value)
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe9a54c82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fe9a54c567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x28200 (0x7fe904576200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x10e0a1 (0x7fe90465c0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x84bce (0x7fe9045d2bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x8858f (0x7fe9045d658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x9f876 (0x7fe9045ed876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x1dfcf (0x7fe90456bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7fe9fc41f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
-> return _k2.index_select(src, index, default_value)
(Pdb)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZOQ7YW7B6MVE3R5CTT756YLANCNFSM5DI5NN6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
... running with a debug version of k2 would help, there, though.
…On Thu, Sep 2, 2021 at 10:02 PM Daniel Povey ***@***.***> wrote:
It might be possible to catch the exception in gdb by doing:
gdb --args python3 whatever.py
(gdb) catch throw
(gdb) r
...
On Thu, Sep 2, 2021 at 9:54 PM Yunusemre ***@***.***> wrote:
> I added breakpoint to place where @csukuangfj
> <https://github.com/csukuangfj> said. Log is here:
>
> (Pdb) c
> > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
> -> return _k2.index_select(src, index, default_value)
> (Pdb) src.device; index.device; default_value;
> device(type='cuda', index=0)
> device(type='cuda', index=0)
> 0.0
> (Pdb) c
> > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
> -> return _k2.index_select(src, index, default_value)
> (Pdb) src.device; index.device; default_value;
> device(type='cuda', index=0)
> device(type='cuda', index=0)
> 0.0
> (Pdb) c
> > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
> -> return _k2.index_select(src, index, default_value)
> (Pdb) src.device; index.device; default_value;
> device(type='cuda', index=0)
> device(type='cuda', index=0)
> 0.0
> (Pdb) c
> > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
> -> return _k2.index_select(src, index, default_value)
> (Pdb) src.device; index.device; default_value;
> device(type='cuda', index=0)
> device(type='cuda', index=0)
> 0.0
> (Pdb) c
> > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
> -> return _k2.index_select(src, index, default_value)
> (Pdb) src.device; index.device; default_value;
> device(type='cuda', index=0)
> device(type='cuda', index=0)
> 0.0
> (Pdb) c
> Traceback (most recent call last):
> File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1705, in main
> pdb._runscript(mainpyfile)
> File "/path/to/miniconda3/envs/k2/lib/python3.8/pdb.py", line 1573, in _runscript
> self.run(statement)
> File "/path/to/miniconda3/envs/k2/lib/python3.8/bdb.py", line 580, in run
> exec(cmd, globals, locals)
> File "<string>", line 1, in <module>
> File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 435, in <module>
> main()
> File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
> return func(*args, **kwargs)
> File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 418, in main
> results_dict = decode_dataset(
> File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 253, in decode_dataset
> hyps_dict = decode_one_batch(
> File "/path/to/k2/icefall/egs/sestek/ASR/tdnn_lstm_ctc/decode.py", line 176, in decode_one_batch
> best_path = nbest_decoding(
> File "/path/to/k2/icefall/icefall/decode.py", line 208, in nbest_decoding
> path_lattice = _intersect_device(
> File "/path/to/k2/icefall/icefall/decode.py", line 25, in _intersect_device
> return k2.intersect_device(
> File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/fsa_algo.py", line 204, in intersect_device
> out_fsas = k2.utils.fsa_from_binary_function_tensor(a_fsas, b_fsas,
> File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/utils.py", line 581, in fsa_from_binary_function_tensor
> value = index_select(a_value, a_arc_map, default_value=filler) \
> File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 161, in index_select
> ans = _IndexSelectFunction.apply(src, index, default_value)
> File "/path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py", line 67, in forward
> return _k2.index_select(src, index, default_value)
> RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
> Exception raised from from_blob at /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/include/ATen/Functions.h:2267 (most recent call first):
> frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe9a54c82f2 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
> frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fe9a54c567b in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
> frame #2: <unknown function> + 0x28200 (0x7fe904576200 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
> frame #3: <unknown function> + 0x10e0a1 (0x7fe90465c0a1 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
> frame #4: <unknown function> + 0x84bce (0x7fe9045d2bce in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
> frame #5: <unknown function> + 0x8858f (0x7fe9045d658f in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
> frame #6: <unknown function> + 0x9f876 (0x7fe9045ed876 in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
> frame #7: <unknown function> + 0x1dfcf (0x7fe90456bfcf in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so)
> <omitting python frames>
> frame #13: THPFunction_apply(_object*, _object*) + 0x8fd (0x7fe9fc41f41d in /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
>
> Uncaught exception. Entering post mortem debugging
> Running 'cont' or 'step' will restart the program
> > /path/to/miniconda3/envs/k2/lib/python3.8/site-packages/k2/ops.py(67)forward()
> -> return _k2.index_select(src, index, default_value)
> (Pdb)
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#33 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAZFLOZOQ7YW7B6MVE3R5CTT756YLANCNFSM5DI5NN6Q>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
|
https://k2.readthedocs.io/en/latest/installation/for_developers.html The above link contains instructions to build a debug version of k2. |
Could you also print the shape of
to verify that neither of them is empty? |
I checked if index or src is empty, and noticed that index is empty when the problem occurs.
|
@EmreOzkose $ python3 -m k2.version should give you such information. |
@csukuangfj
I think I understand the issue. I am trying different architectures and features. Since my memory is small, when I increase number of layer of the model, I have to decrease |
I would recommend you to update your k2. k2 v1.6 contains several bug fixes, including the one you are facing, I think. |
Thank you so much! I am updating at once. |
I want to report here. I updated k2 and run decode.py again. The problem is not occurring now, thank you. However hyps are coming empty :). After now, it is my design's problem :). |
Hello,
I am training a TDNN-LSTM model with librispeech recipe on 16k 100 hours data. After training, I run decode.py. I sometimes observe a cuda issue (given below). Have you ever observe something like that? I think it is related to something during training. Because after some trainings, decode.py works well, however after some of trainings, decode.py gives this error. I googled
RuntimeError: Specified device cuda:0 does not match device of data cuda:-2
error, but found nothing. I have Tesla-p100 16gb. I should also mention that 1best works well, but problem occurs during nbest and rescorings.The text was updated successfully, but these errors were encountered: