nccl-test hung and tcp socket failed sometimes #914

dbfancier · 2023-07-12T06:59:29Z

I am using nccl-test to test the performance of the RoCE network, but occasionally I encounter tcp sockets being closed by the peer (this peer is often itself), and the trigger probability increases with the increase in the number of nodes. How do I identify the problem?

NCCL Version: 2.13.4
cudaDriverVersion: 12010

host :22337:22453 [6] NCCL INFO misc/socket.cc:373 -> 2
host:22337:22453 [6] NCCL INFO misc/socket.cc:387 -> 2
host:22337:22453 [6] NCCL INFO transport/net_ib.cc:605 -> 2
host:22337:22453 [6] NCCL INFO include/net.h:26 -> 2
host:22337:22453 [6] NCCL INFO transport/net.cc:515 -> 2
host:22337:22453 [6] NCCL INFO proxy.cc:968 -> 2

host:22337:22453 [6] proxy.cc:1097 NCCL WARN [Proxy Service 14] Failed to execute operation Connect from rank 14, retcode 2

host:22337:22400 [6] misc/socket.cc:521 NCCL WARN Net : Connection closed by remote peer 10.104.148.191<37877>
host:22337:22400 [6] NCCL INFO misc/socket.cc:529 -> 6
host:22337:22400 [6] NCCL INFO misc/socket.cc:541 -> 6
host22337:22400 [6] NCCL INFO proxy.cc:859 -> 6

host:22337:22400 [6] proxy.cc:862 NCCL WARN Proxy Call to rank 14 failed (Connect)
host:22337:22400 [6] NCCL INFO transport/net.cc:269 -> 6
host:22337:22400 [6] NCCL INFO transport.cc:124 -> 6
host22337:22400 [6] NCCL INFO init.cc:784 -> 6
host:22337:22400 [6] NCCL INFO init.cc:1045 -> 6
host:22337:22337 [6] NCCL INFO group.cc:59 -> 6
host: Test NCCL failure common.cu:958 'remote process exited or there was a network error / '
 .. host pid 22337: Test failure common.cu:842

The text was updated successfully, but these errors were encountered:

sjeaugey · 2023-07-12T09:33:23Z

Could it be the remote process died and this is just a side-effect? Other than that, there isn't much we can do; we need sockets to work properly to exchange IB QP data. Did you try using a different IP interface for out-of-band bootstrap? Like the system Eth NIC, or IP over IB?

dbfancier · 2023-07-12T13:15:19Z

Could it be the remote process died and this is just a side-effect? Other than that, there isn't much we can do; we need sockets to work properly to exchange IB QP data. Did you try using a different IP interface for out-of-band bootstrap? Like the system Eth NIC, or IP over IB?

I find that the poll revents in getFdState return POLLERR(0x0008) at this time. I now use the bond0 device of the system as NCCL_SOCKET_IFNAME. I once tried to use the net device corresponding to RoCE as NCCL_SOCKET_IFNAME, and there was a similar problem. So I'm wondering if there's some TCP-related configuration I'm doing wrong.

I am currently building an environment using hostNetwork in a container to run nccl-test. This phenomenon also occurs in actual training.

addition >>>
Occasionally POLLOUT POLLHUP POLLERR appears at the same time, which results in EQCHECK (ret == 1 && (pfd.revents & POLLOUT), 0) being normal, but getsockopt getting a value not in 0, EINPROGRESS or ECONNREFUSED. Cause ncclSocketError to be returned

dbfancier · 2023-07-13T07:41:34Z

Could it be the remote process died and this is just a side-effect? Other than that, there isn't much we can do; we need sockets to work properly to exchange IB QP data. Did you try using a different IP interface for out-of-band bootstrap? Like the system Eth NIC, or IP over IB?

I find that the poll revents in getFdState return POLLERR(0x0008) at this time. I now use the bond0 device of the system as NCCL_SOCKET_IFNAME. I once tried to use the net device corresponding to RoCE as NCCL_SOCKET_IFNAME, and there was a similar problem. So I'm wondering if there's some TCP-related configuration I'm doing wrong.

I am currently building an environment using hostNetwork in a container to run nccl-test. This phenomenon also occurs in actual training.

addition >>> Occasionally POLLOUT POLLHUP POLLERR appears at the same time, which results in EQCHECK (ret == 1 && (pfd.revents & POLLOUT), 0) being normal, but getsockopt getting a value not in 0, EINPROGRESS or ECONNREFUSED. Cause ncclSocketError to be returned

It seems that ncclIbConnect will be called to try to connect to same local device before the problem occurs, and eventually get the POLLERR revent at getFdState, and get ETIMEDOUT errno

dbfancier · 2023-07-14T09:28:57Z

Could it be the remote process died and this is just a side-effect? Other than that, there isn't much we can do; we need sockets to work properly to exchange IB QP data. Did you try using a different IP interface for out-of-band bootstrap? Like the system Eth NIC, or IP over IB?

It appears that node A wants to access NCCL Service 1 (ncclProxyService) of node B, while the connection is in SynSent state, and it appears that the peer NCCL Service is not returning a SynAck.

I try to trace the thread listening for this port on node B and find that it keeps trying to accept and poll. and both return EAGAIN. I look at the socket fd contained by poll fd and the process holding it and find that one of the threads is in the same process as the thread.

The thread appears to be in the ncclTransportP2pSetup function:

#0  0x00007f0a5e81691e in __libc_recv (fd=276, buf=0x7f0918cc3ba0, len=20640, flags=flags@entry=64) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  0x00007f0a5ea866a2 in recv (__flags=64, __n=<optimized out>, __buf=<optimized out>, __fd=<optimized out>)
    at /usr/include/x86_64-linux-gnu/bits/socket2.h:44
#2  ncclSocketProgressOpt (sock=sock@entry=0x7f0918c0ef10, ptr=ptr@entry=0x7f0918cc3ba0, size=size@entry=20640, offset=offset@entry=0x7f0a15dcc200, 
    closed=closed@entry=0x7f0a15dcc204, block=0, op=1) at misc/socket.cc:517
#3  0x00007f0a5ea88b50 in ncclSocketProgress (op=1, offset=0x7f0a15dcc200, size=20640, ptr=0x7f0918cc3ba0, sock=0x7f0918c0ef10) at misc/socket.cc:542
#4  ncclSocketWait (op=1, offset=0x7f0a15dcc200, size=20640, ptr=0x7f0918cc3ba0, sock=0x7f0918c0ef10) at misc/socket.cc:553
#5  ncclSocketRecv (sock=sock@entry=0x7f0918c0ef10, ptr=ptr@entry=0x7f0918cc3ba0, size=20640) at misc/socket.cc:565
#6  0x00007f0a5ea7d442 in ncclProxyCall (proxyConn=proxyConn@entry=0x7f0918c41fd8, type=<optimized out>, type@entry=4, 
    reqBuff=reqBuff@entry=0x7f0a15dcc400, reqSize=<optimized out>, reqSize@entry=4, respBuff=respBuff@entry=0x7f0918cc3ba0, respSize=<optimized out>, 
    respSize@entry=20640) at proxy.cc:859
#7  0x00007f0a5ea90168 in recvConnect (comm=<optimized out>, connectInfo=0x7f0a15dcc400, nranks=<optimized out>, rank=<optimized out>, recv=0x7f0918c41fd0)
    at transport/net.cc:343
#8  0x00007f0a5ea6c7d6 in ncclTransportP2pSetup (comm=comm@entry=0x7f0918000c20, graph=graph@entry=0x7f0a15dd6770, connIndex=connIndex@entry=0, 
    highestTransportType=highestTransportType@entry=0x0) at transport.cc:134
#9  0x00007f0a5ea5bbf8 in initTransportsRank (comm=comm@entry=0x7f0918000c20, commId=commId@entry=0x7f0a15de6e60) at init.cc:784
#10 0x00007f0a5ea5db12 in ncclCommInitRankFunc (job_=<optimized out>) at init.cc:1045
#11 0x00007f0a5ea75507 in ncclAsyncJobMain (arg=0x5637ac0ce330) at group.cc:41
#12 0x00007f0a5e80c6db in start_thread (arg=0x7f0a15ded700) at pthread_create.c:463
#13 0x00007f0a5de0e61f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The search continued through fd's thread and found that the thread with which he established a tcp connection also seemed to belong to the same process. But I haven't added any more debug messages to NCCL, so it's not clear which thread tried to connect in which process and what state it is in.

@sjeaugey I wonder if you could give me some help. Thanks

dbfancier · 2023-07-15T00:33:57Z

I'm sorry, that's an almost idiotic question. Our op colleague set the parameter net.ipv4.ip_local_port_range to 20000 65535, but created 3 NortPort services within the port range (30080 32000 30050).

When bind or connect uses a random port that matches an ipvs rule, it does not reply ACK because of seq

dbfancier closed this as completed Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nccl-test hung and tcp socket failed sometimes #914

nccl-test hung and tcp socket failed sometimes #914

dbfancier commented Jul 12, 2023 •

edited

Loading

sjeaugey commented Jul 12, 2023

dbfancier commented Jul 12, 2023 •

edited

Loading

dbfancier commented Jul 13, 2023 •

edited

Loading

dbfancier commented Jul 14, 2023 •

edited

Loading

dbfancier commented Jul 15, 2023 •

edited

Loading

nccl-test hung and tcp socket failed sometimes #914

nccl-test hung and tcp socket failed sometimes #914

Comments

dbfancier commented Jul 12, 2023 • edited Loading

sjeaugey commented Jul 12, 2023

dbfancier commented Jul 12, 2023 • edited Loading

dbfancier commented Jul 13, 2023 • edited Loading

dbfancier commented Jul 14, 2023 • edited Loading

dbfancier commented Jul 15, 2023 • edited Loading

dbfancier commented Jul 12, 2023 •

edited

Loading

dbfancier commented Jul 12, 2023 •

edited

Loading

dbfancier commented Jul 13, 2023 •

edited

Loading

dbfancier commented Jul 14, 2023 •

edited

Loading

dbfancier commented Jul 15, 2023 •

edited

Loading