Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl-test hung and tcp socket failed sometimes #914

Closed
dbfancier opened this issue Jul 12, 2023 · 5 comments
Closed

nccl-test hung and tcp socket failed sometimes #914

dbfancier opened this issue Jul 12, 2023 · 5 comments

Comments

@dbfancier
Copy link

dbfancier commented Jul 12, 2023

I am using nccl-test to test the performance of the RoCE network, but occasionally I encounter tcp sockets being closed by the peer (this peer is often itself), and the trigger probability increases with the increase in the number of nodes. How do I identify the problem?

NCCL Version: 2.13.4
cudaDriverVersion: 12010

host :22337:22453 [6] NCCL INFO misc/socket.cc:373 -> 2
host:22337:22453 [6] NCCL INFO misc/socket.cc:387 -> 2
host:22337:22453 [6] NCCL INFO transport/net_ib.cc:605 -> 2
host:22337:22453 [6] NCCL INFO include/net.h:26 -> 2
host:22337:22453 [6] NCCL INFO transport/net.cc:515 -> 2
host:22337:22453 [6] NCCL INFO proxy.cc:968 -> 2

host:22337:22453 [6] proxy.cc:1097 NCCL WARN [Proxy Service 14] Failed to execute operation Connect from rank 14, retcode 2

host:22337:22400 [6] misc/socket.cc:521 NCCL WARN Net : Connection closed by remote peer 10.104.148.191<37877>
host:22337:22400 [6] NCCL INFO misc/socket.cc:529 -> 6
host:22337:22400 [6] NCCL INFO misc/socket.cc:541 -> 6
host22337:22400 [6] NCCL INFO proxy.cc:859 -> 6

host:22337:22400 [6] proxy.cc:862 NCCL WARN Proxy Call to rank 14 failed (Connect)
host:22337:22400 [6] NCCL INFO transport/net.cc:269 -> 6
host:22337:22400 [6] NCCL INFO transport.cc:124 -> 6
host22337:22400 [6] NCCL INFO init.cc:784 -> 6
host:22337:22400 [6] NCCL INFO init.cc:1045 -> 6
host:22337:22337 [6] NCCL INFO group.cc:59 -> 6
host: Test NCCL failure common.cu:958 'remote process exited or there was a network error / '
 .. host pid 22337: Test failure common.cu:842
@sjeaugey
Copy link
Member

Could it be the remote process died and this is just a side-effect? Other than that, there isn't much we can do; we need sockets to work properly to exchange IB QP data. Did you try using a different IP interface for out-of-band bootstrap? Like the system Eth NIC, or IP over IB?

@dbfancier
Copy link
Author

dbfancier commented Jul 12, 2023

Could it be the remote process died and this is just a side-effect? Other than that, there isn't much we can do; we need sockets to work properly to exchange IB QP data. Did you try using a different IP interface for out-of-band bootstrap? Like the system Eth NIC, or IP over IB?

I find that the poll revents in getFdState return POLLERR(0x0008) at this time. I now use the bond0 device of the system as NCCL_SOCKET_IFNAME. I once tried to use the net device corresponding to RoCE as NCCL_SOCKET_IFNAME, and there was a similar problem. So I'm wondering if there's some TCP-related configuration I'm doing wrong.

I am currently building an environment using hostNetwork in a container to run nccl-test. This phenomenon also occurs in actual training.

addition >>>
Occasionally POLLOUT POLLHUP POLLERR appears at the same time, which results in EQCHECK (ret == 1 && (pfd.revents & POLLOUT), 0) being normal, but getsockopt getting a value not in 0, EINPROGRESS or ECONNREFUSED. Cause ncclSocketError to be returned

@dbfancier
Copy link
Author

dbfancier commented Jul 13, 2023

Could it be the remote process died and this is just a side-effect? Other than that, there isn't much we can do; we need sockets to work properly to exchange IB QP data. Did you try using a different IP interface for out-of-band bootstrap? Like the system Eth NIC, or IP over IB?

I find that the poll revents in getFdState return POLLERR(0x0008) at this time. I now use the bond0 device of the system as NCCL_SOCKET_IFNAME. I once tried to use the net device corresponding to RoCE as NCCL_SOCKET_IFNAME, and there was a similar problem. So I'm wondering if there's some TCP-related configuration I'm doing wrong.

I am currently building an environment using hostNetwork in a container to run nccl-test. This phenomenon also occurs in actual training.

addition >>> Occasionally POLLOUT POLLHUP POLLERR appears at the same time, which results in EQCHECK (ret == 1 && (pfd.revents & POLLOUT), 0) being normal, but getsockopt getting a value not in 0, EINPROGRESS or ECONNREFUSED. Cause ncclSocketError to be returned

It seems that ncclIbConnect will be called to try to connect to same local device before the problem occurs, and eventually get the POLLERR revent at getFdState, and get ETIMEDOUT errno

@dbfancier
Copy link
Author

dbfancier commented Jul 14, 2023

Could it be the remote process died and this is just a side-effect? Other than that, there isn't much we can do; we need sockets to work properly to exchange IB QP data. Did you try using a different IP interface for out-of-band bootstrap? Like the system Eth NIC, or IP over IB?

It appears that node A wants to access NCCL Service 1 (ncclProxyService) of node B, while the connection is in SynSent state, and it appears that the peer NCCL Service is not returning a SynAck.

I try to trace the thread listening for this port on node B and find that it keeps trying to accept and poll. and both return EAGAIN. I look at the socket fd contained by poll fd and the process holding it and find that one of the threads is in the same process as the thread.

The thread appears to be in the ncclTransportP2pSetup function:

#0  0x00007f0a5e81691e in __libc_recv (fd=276, buf=0x7f0918cc3ba0, len=20640, flags=flags@entry=64) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  0x00007f0a5ea866a2 in recv (__flags=64, __n=<optimized out>, __buf=<optimized out>, __fd=<optimized out>)
    at /usr/include/x86_64-linux-gnu/bits/socket2.h:44
#2  ncclSocketProgressOpt (sock=sock@entry=0x7f0918c0ef10, ptr=ptr@entry=0x7f0918cc3ba0, size=size@entry=20640, offset=offset@entry=0x7f0a15dcc200, 
    closed=closed@entry=0x7f0a15dcc204, block=0, op=1) at misc/socket.cc:517
#3  0x00007f0a5ea88b50 in ncclSocketProgress (op=1, offset=0x7f0a15dcc200, size=20640, ptr=0x7f0918cc3ba0, sock=0x7f0918c0ef10) at misc/socket.cc:542
#4  ncclSocketWait (op=1, offset=0x7f0a15dcc200, size=20640, ptr=0x7f0918cc3ba0, sock=0x7f0918c0ef10) at misc/socket.cc:553
#5  ncclSocketRecv (sock=sock@entry=0x7f0918c0ef10, ptr=ptr@entry=0x7f0918cc3ba0, size=20640) at misc/socket.cc:565
#6  0x00007f0a5ea7d442 in ncclProxyCall (proxyConn=proxyConn@entry=0x7f0918c41fd8, type=<optimized out>, type@entry=4, 
    reqBuff=reqBuff@entry=0x7f0a15dcc400, reqSize=<optimized out>, reqSize@entry=4, respBuff=respBuff@entry=0x7f0918cc3ba0, respSize=<optimized out>, 
    respSize@entry=20640) at proxy.cc:859
#7  0x00007f0a5ea90168 in recvConnect (comm=<optimized out>, connectInfo=0x7f0a15dcc400, nranks=<optimized out>, rank=<optimized out>, recv=0x7f0918c41fd0)
    at transport/net.cc:343
#8  0x00007f0a5ea6c7d6 in ncclTransportP2pSetup (comm=comm@entry=0x7f0918000c20, graph=graph@entry=0x7f0a15dd6770, connIndex=connIndex@entry=0, 
    highestTransportType=highestTransportType@entry=0x0) at transport.cc:134
#9  0x00007f0a5ea5bbf8 in initTransportsRank (comm=comm@entry=0x7f0918000c20, commId=commId@entry=0x7f0a15de6e60) at init.cc:784
#10 0x00007f0a5ea5db12 in ncclCommInitRankFunc (job_=<optimized out>) at init.cc:1045
#11 0x00007f0a5ea75507 in ncclAsyncJobMain (arg=0x5637ac0ce330) at group.cc:41
#12 0x00007f0a5e80c6db in start_thread (arg=0x7f0a15ded700) at pthread_create.c:463
#13 0x00007f0a5de0e61f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The search continued through fd's thread and found that the thread with which he established a tcp connection also seemed to belong to the same process. But I haven't added any more debug messages to NCCL, so it's not clear which thread tried to connect in which process and what state it is in.

@sjeaugey I wonder if you could give me some help. Thanks

@dbfancier
Copy link
Author

dbfancier commented Jul 15, 2023

I'm sorry, that's an almost idiotic question. Our op colleague set the parameter net.ipv4.ip_local_port_range to 20000 65535, but created 3 NortPort services within the port range (30080 32000 30050).

When bind or connect uses a random port that matches an ipvs rule, it does not reply ACK because of seq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants