Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when setting the environment NCCL_SHM_DISABLE=1 #291

Closed
shaochuang-wsc opened this issue Feb 12, 2020 · 3 comments
Closed

Comments

@shaochuang-wsc
Copy link

shaochuang-wsc commented Feb 12, 2020

Got Segmentation fault when using the newest nccl test and nccl with setting NCCL_SHM_DISABLE=1.

all_reduce_perf -g 4-b 8 -e 16

[tb2b03210:78142] *** Process received signal ***
[tb2b03210:78142] Signal: Segmentation fault (11)
[tb2b03210:78142] Signal code: Address not mapped (1)
[tb2b03210:78142] Failing at address: 0x8
[tb2b03210:78142] [ 0] /lib64/libpthread.so.0(+0x115d0)[0x7f8eaaf0e5d0]
[tb2b03210:78142] [ 1] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x3533c)[0x7f8eab35933c]
[tb2b03210:78142] [ 2] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x2508e)[0x7f8eab34908e]
[tb2b03210:78142] [ 3] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x10acd)[0x7f8eab334acd]
[tb2b03210:78142] [ 4] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x15f15)[0x7f8eab339f15]
[tb2b03210:78142] [ 5] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x16e47)[0x7f8eab33ae47]
[tb2b03210:78142] [ 6] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x1fa56)[0x7f8eab343a56]
[tb2b03210:78142] [ 7] /lib64/libpthread.so.0(+0x76ca)[0x7f8eaaf046ca]
[tb2b03210:78142] [ 8] /lib64/libc.so.6(clone+0x5f)[0x7f8eaa213edf]

Here is the backtrace

(gdb) bt
#0  0x00007f1f0ba9b33c in pathDistance (links=<optimized out>) at graph/topo.cc:618
#1  ncclTopoGpuDistance (system=system@entry=0x1c9dd40, busId1=<optimized out>, busId2=<optimized out>,
    distance=distance@entry=0x7ffe15bd0b10) at graph/topo.cc:627
#2  0x00007f1f0ba8b08e in p2pCanConnect (ret=0x7ffe15bd0bf0, topo=0x1c9dd40, graph=<optimized out>, info1=0x470d700,
    info2=0x470d670) at transport/p2p.cc:116
#3  0x00007f1f0ba76d4b in selectTransport<1> (channelId=0, buffSize=4194304, connector=0x4713ff0, connect=0x7ffe15bd0c00,
    peerInfo=0x470d670, myInfo=0x470d700, graph=0x7ffe15bd8f10, topo=0x1c9dd40) at init.cc:281
#4  p2pSetup (comm=comm@entry=0x4706f80, graph=graph@entry=0x7ffe15bd8f10, channel=channel@entry=0x4706f80, nrecv=nrecv@entry=1,
    peerRecv=peerRecv@entry=0x4706f80, nsend=nsend@entry=1, peerSend=peerSend@entry=0x4706f84) at init.cc:404
#5  0x00007f1f0ba7bf15 in initTransportsRank (comm=comm@entry=0x4706f80, commId=commId@entry=0x7ffe15be1590) at init.cc:617
#6  0x00007f1f0ba7ce47 in ncclCommInitRankSync (newcomm=newcomm@entry=0x1bd83e8, nranks=nranks@entry=4, commId=...,
    myrank=myrank@entry=3, cudaDev=cudaDev@entry=3) at init.cc:732
#7  0x00007f1f0ba7da84 in ncclCommInitRankDev (cudaDev=3, myrank=3, commId=..., nranks=4, newcomm=0x1bd83e8) at init.cc:771
#8  ncclCommInitRank (newcomm=0x1bd83e8, nranks=4, commId=..., myrank=3) at init.cc:782

This error happens when using the master branch and release version 2.5.6.
It works well when using release version 2.4.2

#nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	CPU Affinity
GPU0	 X 	PIX	SYS	SYS	0-95
GPU1	PIX	 X 	SYS	SYS	0-95
GPU2	SYS	SYS	 X 	PIX	0-95
GPU3	SYS	SYS	PIX	 X 	0-95
@sjeaugey
Copy link
Member

Could you check whether this is fixed in 2.6 ? Thanks !

https://github.com/nvidia/nccl/tree/v2.6

@shaochuang-wsc
Copy link
Author

Could you check whether this is fixed in 2.6 ? Thanks !

https://github.com/nvidia/nccl/tree/v2.6

It works well in 2.6. Thanks

@sjeaugey
Copy link
Member

Closing as this is fixed in 2.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants