-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nccl-test hung and tcp socket failed sometimes #914
Comments
Could it be the remote process died and this is just a side-effect? Other than that, there isn't much we can do; we need sockets to work properly to exchange IB QP data. Did you try using a different IP interface for out-of-band bootstrap? Like the system Eth NIC, or IP over IB? |
I find that the poll revents in getFdState return POLLERR(0x0008) at this time. I now use the bond0 device of the system as NCCL_SOCKET_IFNAME. I once tried to use the net device corresponding to RoCE as NCCL_SOCKET_IFNAME, and there was a similar problem. So I'm wondering if there's some TCP-related configuration I'm doing wrong. I am currently building an environment using hostNetwork in a container to run nccl-test. This phenomenon also occurs in actual training. addition >>> |
It seems that ncclIbConnect will be called to try to connect to same local device before the problem occurs, and eventually get the POLLERR revent at getFdState, and get ETIMEDOUT errno |
It appears that node A wants to access NCCL Service 1 (ncclProxyService) of node B, while the connection is in SynSent state, and it appears that the peer NCCL Service is not returning a SynAck. I try to trace the thread listening for this port on node B and find that it keeps trying to accept and poll. and both return EAGAIN. I look at the socket fd contained by poll fd and the process holding it and find that one of the threads is in the same process as the thread. The thread appears to be in the ncclTransportP2pSetup function:
The search continued through fd's thread and found that the thread with which he established a tcp connection also seemed to belong to the same process. But I haven't added any more debug messages to NCCL, so it's not clear which thread tried to connect in which process and what state it is in. @sjeaugey I wonder if you could give me some help. Thanks |
I'm sorry, that's an almost idiotic question. Our op colleague set the parameter net.ipv4.ip_local_port_range to 20000 65535, but created 3 NortPort services within the port range (30080 32000 30050). When bind or connect uses a random port that matches an ipvs rule, it does not reply ACK because of seq |
I am using nccl-test to test the performance of the RoCE network, but occasionally I encounter tcp sockets being closed by the peer (this peer is often itself), and the trigger probability increases with the increase in the number of nodes. How do I identify the problem?
NCCL Version: 2.13.4
cudaDriverVersion: 12010
The text was updated successfully, but these errors were encountered: