-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Bug of CuDNN RNN with variable sequence length #10453
Comments
The following code, which always use from mxnet.gluon.rnn import LSTM
import mxnet as mx
import numpy as np
ctx = mx.gpu()
lstm = LSTM(num_layers=1, hidden_size=200, dropout=0.0)
lstm.initialize(ctx=ctx)
batch_size = 32
for seq_len in range(500, 10, -1):
for repeat in range(10):
real_seq_len = 500
print(real_seq_len, repeat)
inputs_nd = mx.nd.random.normal(0, 1, shape=(real_seq_len, batch_size, 200), ctx=ctx)
out = lstm(inputs_nd)
print(out[0].sum().asscalar())
mx.nd.waitall() |
The bug occurs when we have variable sequence length. I think it may be related to how the mxnet reuses the memory. |
I was able to finish running the script by setting |
@szha how much memory consumption did you observe? |
What I observed is that it doesn't fail consistently on certain specific batch. Another team observed the same issue before, and it is likely caused by our backend memory pool holding too much memory, in which case the curand doesn't have enough memory to keep the random number generator states for each stream multiprocessor. |
I have similar issue when training speech model. even after |
I find that the |
It's related to pytorch/pytorch#953. |
#11004 "fixes" this issue. The filter descriptors that are freed in the destructor were not created if cudaMalloc would fail during Now the following error will be returned in an OOM situation:
In particular, #11004 makes sure that the descriptors are always created during class initialization and not just somewhere down the line in |
Description
Segfault will be triggered by the following code:
I'm using V100 + cuda 9.0 + cudnn 7.0.4 (P3 instance). The GPU memory keeps increasing and finally raises seg fault.
Also, the same script + configuration has not triggered an error in M60 (g3 instance).
@eric-haibin-lin @DickJC123 @szha @szhengac
backtrace:
The text was updated successfully, but these errors were encountered: