nccl all_reduce_test hangs #117

hpjeonGIT · 2017-10-28T17:15:15Z

Hi, I am testing nccl at Centos 7.3 with glibc 2.17. Tested with Cuda 8.0, openmpi/2.1.2, openmpi/3.0.0 using Driver Version: 384.66. test program hangs in all cases.

Building is OK but testing just hangs:
./build/test/single/all_reduce_test 10000000

Using devices

Rank 0 uses device 0 [0x0d] Tesla P100-PCIE-16GB

Rank 1 uses device 1 [0x13] Tesla P100-PCIE-16GB

Rank 2 uses device 2 [0x8e] Tesla P100-PCIE-16GB

Rank 3 uses device 3 [0x91] Tesla P100-PCIE-16GB

out-of-place in-place

bytes N type op time algbw busbw res time algbw busbw res

I checked the process using gstack, and it says:
Thread 1 (process 24745):
#0 0x00007ffed7dd47c2 in clock_gettime ()
#1 0x00002b1973e0424d in __GI___clock_gettime (clock_id=, tp=) at ../sysdeps/unix/clock_gettime.c:115
#2 0x00002b196e2fb3ee in ?? () from /usr/lib64/libcuda.so.1
#3 0x00002b196e38e805 in ?? () from /usr/lib64/libcuda.so.1
#4 0x00002b196e2e66c3 in ?? () from /usr/lib64/libcuda.so.1
#5 0x00002b196e2e6819 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00002b196e200287 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00002b196e344ae2 in cuStreamSynchronize () from /usr/lib64/libcuda.so.1
#8 0x00002b196dca8f60 in ?? () from /usr/nic/libs/cuda/8.0/lib64/libcudart.so.8.0
#9 0x00002b196dcdc47d in cudaStreamSynchronize () from /usr/nic/libs/cuda/8.0/lib64/libcudart.so.8.0
#10 0x00000000004076be in void RunTest(char**, char**, int, ncclDataType_t, ncclRedOp_t, ncclComm**, std::vector<int, std::allocator > const&) ()
#11 0x0000000000408178 in void RunTests(int, ncclDataType_t, ncclComm**, std::vector<int, std::allocator > const&) ()
#12 0x00000000004027fa in main ()

Installed glibc.2.19 from the source and liked but still fails. Is there any step missed? Or any comments are appreciated.

Byoungseon

nluehr · 2017-10-30T16:15:55Z

This sounds similar to #19. See in particular this comment.

hpjeonGIT · 2017-10-30T18:05:04Z

Hi nluehr,

Thanks for the comment. After applying sudo setpci -s 8d:10.0 f2a.w=0000 for all pci's, it works now. I appreciate your help.

Best regards,

Byoungseon

nluehr · 2017-10-30T18:08:19Z

Glad you were able to get it running!

nluehr closed this as completed Oct 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nccl all_reduce_test hangs #117

nccl all_reduce_test hangs #117

hpjeonGIT commented Oct 28, 2017

nluehr commented Oct 30, 2017

hpjeonGIT commented Oct 30, 2017

nluehr commented Oct 30, 2017

nccl all_reduce_test hangs #117

nccl all_reduce_test hangs #117

Comments

hpjeonGIT commented Oct 28, 2017

Using devices

Rank 0 uses device 0 [0x0d] Tesla P100-PCIE-16GB

Rank 1 uses device 1 [0x13] Tesla P100-PCIE-16GB

Rank 2 uses device 2 [0x8e] Tesla P100-PCIE-16GB

Rank 3 uses device 3 [0x91] Tesla P100-PCIE-16GB

out-of-place in-place

bytes N type op time algbw busbw res time algbw busbw res

nluehr commented Oct 30, 2017

hpjeonGIT commented Oct 30, 2017

nluehr commented Oct 30, 2017