Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl all_reduce_test hangs #117

Closed
hpjeonGIT opened this issue Oct 28, 2017 · 3 comments
Closed

nccl all_reduce_test hangs #117

hpjeonGIT opened this issue Oct 28, 2017 · 3 comments

Comments

@hpjeonGIT
Copy link

Hi, I am testing nccl at Centos 7.3 with glibc 2.17. Tested with Cuda 8.0, openmpi/2.1.2, openmpi/3.0.0 using Driver Version: 384.66. test program hangs in all cases.

Building is OK but testing just hangs:
./build/test/single/all_reduce_test 10000000

Using devices

Rank 0 uses device 0 [0x0d] Tesla P100-PCIE-16GB

Rank 1 uses device 1 [0x13] Tesla P100-PCIE-16GB

Rank 2 uses device 2 [0x8e] Tesla P100-PCIE-16GB

Rank 3 uses device 3 [0x91] Tesla P100-PCIE-16GB

out-of-place in-place

bytes N type op time algbw busbw res time algbw busbw res


I checked the process using gstack, and it says:
Thread 1 (process 24745):
#0 0x00007ffed7dd47c2 in clock_gettime ()
#1 0x00002b1973e0424d in __GI___clock_gettime (clock_id=, tp=) at ../sysdeps/unix/clock_gettime.c:115
#2 0x00002b196e2fb3ee in ?? () from /usr/lib64/libcuda.so.1
#3 0x00002b196e38e805 in ?? () from /usr/lib64/libcuda.so.1
#4 0x00002b196e2e66c3 in ?? () from /usr/lib64/libcuda.so.1
#5 0x00002b196e2e6819 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00002b196e200287 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00002b196e344ae2 in cuStreamSynchronize () from /usr/lib64/libcuda.so.1
#8 0x00002b196dca8f60 in ?? () from /usr/nic/libs/cuda/8.0/lib64/libcudart.so.8.0
#9 0x00002b196dcdc47d in cudaStreamSynchronize () from /usr/nic/libs/cuda/8.0/lib64/libcudart.so.8.0
#10 0x00000000004076be in void RunTest(char**, char**, int, ncclDataType_t, ncclRedOp_t, ncclComm**, std::vector<int, std::allocator > const&) ()
#11 0x0000000000408178 in void RunTests(int, ncclDataType_t, ncclComm**, std::vector<int, std::allocator > const&) ()
#12 0x00000000004027fa in main ()

Installed glibc.2.19 from the source and liked but still fails. Is there any step missed? Or any comments are appreciated.

Byoungseon

@nluehr
Copy link
Contributor

nluehr commented Oct 30, 2017

This sounds similar to #19. See in particular this comment.

@hpjeonGIT
Copy link
Author

Hi nluehr,

Thanks for the comment. After applying sudo setpci -s 8d:10.0 f2a.w=0000 for all pci's, it works now. I appreciate your help.

Best regards,

Byoungseon

@nluehr
Copy link
Contributor

nluehr commented Oct 30, 2017

Glad you were able to get it running!

@nluehr nluehr closed this as completed Oct 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants