We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Got Segmentation fault when using the newest nccl test and nccl with setting NCCL_SHM_DISABLE=1.
NCCL_SHM_DISABLE=1
all_reduce_perf -g 4-b 8 -e 16 [tb2b03210:78142] *** Process received signal *** [tb2b03210:78142] Signal: Segmentation fault (11) [tb2b03210:78142] Signal code: Address not mapped (1) [tb2b03210:78142] Failing at address: 0x8 [tb2b03210:78142] [ 0] /lib64/libpthread.so.0(+0x115d0)[0x7f8eaaf0e5d0] [tb2b03210:78142] [ 1] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x3533c)[0x7f8eab35933c] [tb2b03210:78142] [ 2] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x2508e)[0x7f8eab34908e] [tb2b03210:78142] [ 3] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x10acd)[0x7f8eab334acd] [tb2b03210:78142] [ 4] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x15f15)[0x7f8eab339f15] [tb2b03210:78142] [ 5] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x16e47)[0x7f8eab33ae47] [tb2b03210:78142] [ 6] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x1fa56)[0x7f8eab343a56] [tb2b03210:78142] [ 7] /lib64/libpthread.so.0(+0x76ca)[0x7f8eaaf046ca] [tb2b03210:78142] [ 8] /lib64/libc.so.6(clone+0x5f)[0x7f8eaa213edf]
Here is the backtrace
(gdb) bt #0 0x00007f1f0ba9b33c in pathDistance (links=<optimized out>) at graph/topo.cc:618 #1 ncclTopoGpuDistance (system=system@entry=0x1c9dd40, busId1=<optimized out>, busId2=<optimized out>, distance=distance@entry=0x7ffe15bd0b10) at graph/topo.cc:627 #2 0x00007f1f0ba8b08e in p2pCanConnect (ret=0x7ffe15bd0bf0, topo=0x1c9dd40, graph=<optimized out>, info1=0x470d700, info2=0x470d670) at transport/p2p.cc:116 #3 0x00007f1f0ba76d4b in selectTransport<1> (channelId=0, buffSize=4194304, connector=0x4713ff0, connect=0x7ffe15bd0c00, peerInfo=0x470d670, myInfo=0x470d700, graph=0x7ffe15bd8f10, topo=0x1c9dd40) at init.cc:281 #4 p2pSetup (comm=comm@entry=0x4706f80, graph=graph@entry=0x7ffe15bd8f10, channel=channel@entry=0x4706f80, nrecv=nrecv@entry=1, peerRecv=peerRecv@entry=0x4706f80, nsend=nsend@entry=1, peerSend=peerSend@entry=0x4706f84) at init.cc:404 #5 0x00007f1f0ba7bf15 in initTransportsRank (comm=comm@entry=0x4706f80, commId=commId@entry=0x7ffe15be1590) at init.cc:617 #6 0x00007f1f0ba7ce47 in ncclCommInitRankSync (newcomm=newcomm@entry=0x1bd83e8, nranks=nranks@entry=4, commId=..., myrank=myrank@entry=3, cudaDev=cudaDev@entry=3) at init.cc:732 #7 0x00007f1f0ba7da84 in ncclCommInitRankDev (cudaDev=3, myrank=3, commId=..., nranks=4, newcomm=0x1bd83e8) at init.cc:771 #8 ncclCommInitRank (newcomm=0x1bd83e8, nranks=4, commId=..., myrank=3) at init.cc:782
This error happens when using the master branch and release version 2.5.6. It works well when using release version 2.4.2
#nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X PIX SYS SYS 0-95 GPU1 PIX X SYS SYS 0-95 GPU2 SYS SYS X PIX 0-95 GPU3 SYS SYS PIX X 0-95
The text was updated successfully, but these errors were encountered:
Could you check whether this is fixed in 2.6 ? Thanks !
https://github.com/nvidia/nccl/tree/v2.6
Sorry, something went wrong.
Could you check whether this is fixed in 2.6 ? Thanks ! https://github.com/nvidia/nccl/tree/v2.6
It works well in 2.6. Thanks
Closing as this is fixed in 2.6.
No branches or pull requests
Got Segmentation fault when using the newest nccl test and nccl with setting
NCCL_SHM_DISABLE=1
.Here is the backtrace
This error happens when using the master branch and release version 2.5.6.
It works well when using release version 2.4.2
The text was updated successfully, but these errors were encountered: