You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am testing nccl at Centos 7.3 with glibc 2.17. Tested with Cuda 8.0, openmpi/2.1.2, openmpi/3.0.0 using Driver Version: 384.66. test program hangs in all cases.
Building is OK but testing just hangs:
./build/test/single/all_reduce_test 10000000
Using devices
Rank 0 uses device 0 [0x0d] Tesla P100-PCIE-16GB
Rank 1 uses device 1 [0x13] Tesla P100-PCIE-16GB
Rank 2 uses device 2 [0x8e] Tesla P100-PCIE-16GB
Rank 3 uses device 3 [0x91] Tesla P100-PCIE-16GB
out-of-place in-place
bytes N type op time algbw busbw res time algbw busbw res
I checked the process using gstack, and it says:
Thread 1 (process 24745):
#0 0x00007ffed7dd47c2 in clock_gettime () #1 0x00002b1973e0424d in __GI___clock_gettime (clock_id=, tp=) at ../sysdeps/unix/clock_gettime.c:115 #2 0x00002b196e2fb3ee in ?? () from /usr/lib64/libcuda.so.1 #3 0x00002b196e38e805 in ?? () from /usr/lib64/libcuda.so.1 #4 0x00002b196e2e66c3 in ?? () from /usr/lib64/libcuda.so.1 #5 0x00002b196e2e6819 in ?? () from /usr/lib64/libcuda.so.1 #6 0x00002b196e200287 in ?? () from /usr/lib64/libcuda.so.1 #7 0x00002b196e344ae2 in cuStreamSynchronize () from /usr/lib64/libcuda.so.1 #8 0x00002b196dca8f60 in ?? () from /usr/nic/libs/cuda/8.0/lib64/libcudart.so.8.0 #9 0x00002b196dcdc47d in cudaStreamSynchronize () from /usr/nic/libs/cuda/8.0/lib64/libcudart.so.8.0 #10 0x00000000004076be in void RunTest(char**, char**, int, ncclDataType_t, ncclRedOp_t, ncclComm**, std::vector<int, std::allocator > const&) () #11 0x0000000000408178 in void RunTests(int, ncclDataType_t, ncclComm**, std::vector<int, std::allocator > const&) () #12 0x00000000004027fa in main ()
Installed glibc.2.19 from the source and liked but still fails. Is there any step missed? Or any comments are appreciated.
Byoungseon
The text was updated successfully, but these errors were encountered:
Hi, I am testing nccl at Centos 7.3 with glibc 2.17. Tested with Cuda 8.0, openmpi/2.1.2, openmpi/3.0.0 using Driver Version: 384.66. test program hangs in all cases.
Building is OK but testing just hangs:
./build/test/single/all_reduce_test 10000000
Using devices
Rank 0 uses device 0 [0x0d] Tesla P100-PCIE-16GB
Rank 1 uses device 1 [0x13] Tesla P100-PCIE-16GB
Rank 2 uses device 2 [0x8e] Tesla P100-PCIE-16GB
Rank 3 uses device 3 [0x91] Tesla P100-PCIE-16GB
out-of-place in-place
bytes N type op time algbw busbw res time algbw busbw res
I checked the process using gstack, and it says:
Thread 1 (process 24745):
#0 0x00007ffed7dd47c2 in clock_gettime ()
#1 0x00002b1973e0424d in __GI___clock_gettime (clock_id=, tp=) at ../sysdeps/unix/clock_gettime.c:115
#2 0x00002b196e2fb3ee in ?? () from /usr/lib64/libcuda.so.1
#3 0x00002b196e38e805 in ?? () from /usr/lib64/libcuda.so.1
#4 0x00002b196e2e66c3 in ?? () from /usr/lib64/libcuda.so.1
#5 0x00002b196e2e6819 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00002b196e200287 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00002b196e344ae2 in cuStreamSynchronize () from /usr/lib64/libcuda.so.1
#8 0x00002b196dca8f60 in ?? () from /usr/nic/libs/cuda/8.0/lib64/libcudart.so.8.0
#9 0x00002b196dcdc47d in cudaStreamSynchronize () from /usr/nic/libs/cuda/8.0/lib64/libcudart.so.8.0
#10 0x00000000004076be in void RunTest(char**, char**, int, ncclDataType_t, ncclRedOp_t, ncclComm**, std::vector<int, std::allocator > const&) ()
#11 0x0000000000408178 in void RunTests(int, ncclDataType_t, ncclComm**, std::vector<int, std::allocator > const&) ()
#12 0x00000000004027fa in main ()
Installed glibc.2.19 from the source and liked but still fails. Is there any step missed? Or any comments are appreciated.
Byoungseon
The text was updated successfully, but these errors were encountered: