-
Notifications
You must be signed in to change notification settings - Fork 6.8k
mxnet gets stuck on cudaMemGetInfo #6281
Comments
Cannot reproduce, I have |
yes, my driver version is 375.26... |
@sifmelcara just tried 375.51. it takes 6 min to alloc memory... |
Do you mean the |
I know what happened.. This issue is caused by cuda jit. I checked default configuration of Makefile, the gen_code option for nvcc doesn't contain CUDA compute 61 arch so every kernel is converted to compute 61 in runtime because my gpu arch is 61. As mentioned in https://groups.google.com/d/msg/arrayfire-users/D3RORyrvn4s/N7AoKueSCAAJ, the conversion takes in the order of minutes, which matches my observation. It works well after adding sm_61 to gen_code option. @piiswrong cmake handles different cuda archs correctly. (https://github.com/dmlc/mshadow/blob/master/cmake/Cuda.cmake) Do you think it's good idea to port that part to Makefile? I don't know how does pip release handle this problem, but as c++ users need to compile by themselves, it makes life easier to check gpu archs automatically. |
pip packages rely on NVRTC and build for all archs are turned off. |
@lx75249 is this still on-going? |
Environment info
Operating System: CentOS with cuda V8.0.61
Compiler: g++ 5.3.1
MXNet commit hash (
git rev-parse HEAD
): 3d545d7Steps to reproduce
Part of gdb backtrace:
#0 0x00007fff5d718990 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#1 0x00007fff5d718ac6 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#2 0x00007fff5d778e8a in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#3 0x00007fff5d71fecb in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#4 0x00007fff5d99becf in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#5 0x00007fff5d99bf39 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#6 0x00007fff5d5eed6d in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#7 0x00007fff5d5f64f8 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#8 0x00007fff5dbf140d in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#9 0x00007fff5d5f9b94 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#10 0x00007fff5d5fb2e9 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#11 0x00007fff5d5f1abc in _cuda_CallJitEntryPoint ()
from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#12 0x00007fffc4bff582 in fatBinaryCtl_Compile ()
from /usr/lib64/nvidia/libnvidia-fatbinaryloader.so.375.26
#13 0x00007fffd3625e42 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#14 0x00007fffd36269c3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#15 0x00007fffd357f35e in ?? () from /usr/lib64/nvidia/libcuda.so.1
#16 0x00007fffd357f640 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#17 0x00007fffe30dfa5d in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#18 0x00007fffe30d3e60 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#19 0x00007fffe30decc6 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#20 0x00007fffe30e3401 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#21 0x00007fffe30d672e in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#22 0x00007fffe30c3e8e in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#23 0x00007fffe30f417c in cudaMemGetInfo () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#24 0x00007fffe652aea5 in mxnet::storage::GPUPooledStorageManager::Alloc (this=0xa5fe80,
raw_size=401408) at src/storage/./pooled_storage_manager.h:77
#25 0x00007fffe652b3f9 in mxnet::StorageImpl::Alloc (this=0x7fff6c0052d0, size=401408, ctx=...)
at src/storage/storage.cc:86
#26 0x00007fffe6010bfa in mxnet::NDArray::Chunk::CheckAndAlloc (this=0xa6c790)
at include/mxnet/./ndarray.h:391
#27 0x00007fffe6010bb5 in mxnet::NDArray::Chunk::Chunk (this=0xa6c790, size=100352, ctx=...,
delay_alloc=false, dtype=0) at include/mxnet/./ndarray.h:386
It only stuck on cuda 8.0.61. I tried another machine with cuda 8.0.44 and it worked well.
The text was updated successfully, but these errors were encountered: