Segmentation fault when setting the environment NCCL_SHM_DISABLE=1 #291

shaochuang-wsc · 2020-02-12T02:59:51Z

Got Segmentation fault when using the newest nccl test and nccl with setting NCCL_SHM_DISABLE=1.

all_reduce_perf -g 4-b 8 -e 16

[tb2b03210:78142] *** Process received signal ***
[tb2b03210:78142] Signal: Segmentation fault (11)
[tb2b03210:78142] Signal code: Address not mapped (1)
[tb2b03210:78142] Failing at address: 0x8
[tb2b03210:78142] [ 0] /lib64/libpthread.so.0(+0x115d0)[0x7f8eaaf0e5d0]
[tb2b03210:78142] [ 1] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x3533c)[0x7f8eab35933c]
[tb2b03210:78142] [ 2] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x2508e)[0x7f8eab34908e]
[tb2b03210:78142] [ 3] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x10acd)[0x7f8eab334acd]
[tb2b03210:78142] [ 4] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x15f15)[0x7f8eab339f15]
[tb2b03210:78142] [ 5] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x16e47)[0x7f8eab33ae47]
[tb2b03210:78142] [ 6] /usr/local/cuda-10.0/targets/x86_64-linux/lib/libnccl.so.2(+0x1fa56)[0x7f8eab343a56]
[tb2b03210:78142] [ 7] /lib64/libpthread.so.0(+0x76ca)[0x7f8eaaf046ca]
[tb2b03210:78142] [ 8] /lib64/libc.so.6(clone+0x5f)[0x7f8eaa213edf]

Here is the backtrace

(gdb) bt
#0  0x00007f1f0ba9b33c in pathDistance (links=<optimized out>) at graph/topo.cc:618
#1  ncclTopoGpuDistance (system=system@entry=0x1c9dd40, busId1=<optimized out>, busId2=<optimized out>,
    distance=distance@entry=0x7ffe15bd0b10) at graph/topo.cc:627
#2  0x00007f1f0ba8b08e in p2pCanConnect (ret=0x7ffe15bd0bf0, topo=0x1c9dd40, graph=<optimized out>, info1=0x470d700,
    info2=0x470d670) at transport/p2p.cc:116
#3  0x00007f1f0ba76d4b in selectTransport<1> (channelId=0, buffSize=4194304, connector=0x4713ff0, connect=0x7ffe15bd0c00,
    peerInfo=0x470d670, myInfo=0x470d700, graph=0x7ffe15bd8f10, topo=0x1c9dd40) at init.cc:281
#4  p2pSetup (comm=comm@entry=0x4706f80, graph=graph@entry=0x7ffe15bd8f10, channel=channel@entry=0x4706f80, nrecv=nrecv@entry=1,
    peerRecv=peerRecv@entry=0x4706f80, nsend=nsend@entry=1, peerSend=peerSend@entry=0x4706f84) at init.cc:404
#5  0x00007f1f0ba7bf15 in initTransportsRank (comm=comm@entry=0x4706f80, commId=commId@entry=0x7ffe15be1590) at init.cc:617
#6  0x00007f1f0ba7ce47 in ncclCommInitRankSync (newcomm=newcomm@entry=0x1bd83e8, nranks=nranks@entry=4, commId=...,
    myrank=myrank@entry=3, cudaDev=cudaDev@entry=3) at init.cc:732
#7  0x00007f1f0ba7da84 in ncclCommInitRankDev (cudaDev=3, myrank=3, commId=..., nranks=4, newcomm=0x1bd83e8) at init.cc:771
#8  ncclCommInitRank (newcomm=0x1bd83e8, nranks=4, commId=..., myrank=3) at init.cc:782

This error happens when using the master branch and release version 2.5.6.
It works well when using release version 2.4.2

#nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	CPU Affinity
GPU0	 X 	PIX	SYS	SYS	0-95
GPU1	PIX	 X 	SYS	SYS	0-95
GPU2	SYS	SYS	 X 	PIX	0-95
GPU3	SYS	SYS	PIX	 X 	0-95

The text was updated successfully, but these errors were encountered:

sjeaugey · 2020-02-12T17:21:48Z

Could you check whether this is fixed in 2.6 ? Thanks !

https://github.com/nvidia/nccl/tree/v2.6

shaochuang-wsc · 2020-02-13T01:47:19Z

Could you check whether this is fixed in 2.6 ? Thanks !

https://github.com/nvidia/nccl/tree/v2.6

It works well in 2.6. Thanks

sjeaugey · 2020-04-15T01:55:44Z

Closing as this is fixed in 2.6.

sjeaugey closed this as completed Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when setting the environment NCCL_SHM_DISABLE=1 #291

Segmentation fault when setting the environment NCCL_SHM_DISABLE=1 #291

shaochuang-wsc commented Feb 12, 2020 •

edited by sjeaugey

Loading

sjeaugey commented Feb 12, 2020

shaochuang-wsc commented Feb 13, 2020

sjeaugey commented Apr 15, 2020

Segmentation fault when setting the environment NCCL_SHM_DISABLE=1 #291

Segmentation fault when setting the environment NCCL_SHM_DISABLE=1 #291

Comments

shaochuang-wsc commented Feb 12, 2020 • edited by sjeaugey Loading

sjeaugey commented Feb 12, 2020

shaochuang-wsc commented Feb 13, 2020

sjeaugey commented Apr 15, 2020

shaochuang-wsc commented Feb 12, 2020 •

edited by sjeaugey

Loading