Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nccl socketStartConnect: Connect to x.x.x.x<xxxx> failed : Software caused connection abort #1515

Open
913871734 opened this issue Nov 16, 2024 · 13 comments

Comments

@913871734
Copy link

I met a tricky question. when i run a mission, it sometimes report the errors as following:

socketStartConnect: Connect to 10.45.234.83<47527> failed : Software caused connection abort
Image

This problem doesn't occur 100% of the time. It is a high probability that nine out of ten runs will occur.
I have check the basic network, it is ok. I could use nc to connect between the two pods and ping-pass.

What's even more strange is that I tried to modify the misc/socket.cc file and recompile the new libnccl.so to overwrite the previous libnccl.so. However, I found that the error information reported by the task was inconsistent with the information I newly compiled, as if the task did not actually use the libnccl.so I just compiled, but when I ran all_reduce_perf, some log information I compiled could be printed out. Please help me, I don't have any clue anymore...

@kiskra-nvidia
Copy link
Member

The specific message you're seeing (Software caused connection abort) is due to a bug in NCCL that's already been fixed for the next release. However, that bug is most likely a secondary effect here and not the true source of your problems (it's a bug in the error handling code -- but what caused the error in the first place?). Have you tried running NCCL with NCCL_DEBUG=INFO environment variable set? If it's still reproducible then, we'd like to see the debug output it produces. If you can't reproduce it with INFO (which could happen if it's a race condition), try with the significantly less verbose NCCL_DEBUG=WARN.

In principle modifying and recompiling NCCL is easy, and indeed it should be enough to replace the single libnccl.so.2 file with the new version. Given the difficulties you described, I suggest that you make sure that the new version is included in both running pods, check with ldd that the dynamic loader is in fact loading the library from the location you expect (you may want to double-check at run time with something like grep libnccl /proc/pid/maps), and finally make sure that libnccl.so.2 (which is typically a soft link) points to your modified variant.

@913871734
Copy link
Author

913871734 commented Nov 19, 2024

  1. The screenshot above is the total output after setting NCCL_DEBUG=INFO. There is no abnormal message more, I did not observe any other useful messages, which is also a point that bothers me a lot.
  2. I am very curious about what is the bug(Software caused connection abort) you mentioned and how the bug was fixed?
  3. I am very sure that the libnccl.so.2 I compiled replaced the original dynamic link library, because I asked these pods to execute all_reduce_perf before running the task. I observed the output log to confirm that the libnccl.so.2 I compiled was called normally before executing the task.
    Looking at the error stack, it was caused by the upper-level broadcast. Will these operations call other dynamic link libraries?

@kiskra-nvidia
Copy link
Member

Are you saying that setting NCCL_DEBUG=INFO does not generate tons of debug output for you, at least on startup? I don't know the details of your set-up, but it should, so I'm guessing that that output must be going somewhere in your case, maybe simply not where you expect? You could try passing something like NCCL_DEBUG_FILE=$HOME/nccl_debug.%h.%p, which should ensure that the output from each NCCL process goes to a separate file in $HOME.

Unfortunately the fix for the "software caused connection abort" bug is not a one-liner and extracting it from the ~500 lines of changes to misc/socket.cc that we've accumulated for the next release is nontrivial. But the problem was basically that in case of ECONNREFUSED and ETIMEDOUT errors from the first call to connect in socketStartConnect, NCCL should've been closing the socket and opening a new one before retrying. Because it wasn't, on the next call to connect, at least if the socket was nonblocking, connect would fail with ECONNABORTED. The same problem was present in socketPollConnect.

The question is though: why were you getting ECONNREFUSED or ETIMEDOUT in the first place?

@kkkstra
Copy link

kkkstra commented Nov 25, 2024

I met the same problem today, and upgrading NCCL to the latest version solved it…

@gangxie112
Copy link

gangxie112 commented Nov 28, 2024

We hit the same issue recently. According my understanding, only when we try to connect the staled socket, we get ECONNABORTED. But there is no other socket error before ECONNABORTED. ECONNABORTED should not be the first one. NCCL swallows the error?
My nccl version is 2.21.5. @kiskra-nvidia could you share more information about the bug you mentioned?

2024-11-27T11:49:23.229871710Z tj5-cloudml-prod-g8gm402-slave87-20240727:256:547 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [RO]; OOB eth0x:10.113.0.10<0>
2024-11-27T11:49:23.230063420Z tj5-cloudml-prod-g8gm402-slave87-20240727:256:547 [6] NCCL INFO Using non-device net plugin version 0
2024-11-27T11:49:23.230190542Z tj5-cloudml-prod-g8gm402-slave87-20240727:256:547 [6] NCCL INFO Using network IBext
2024-11-27T11:49:23.596757833Z 
2024-11-27T11:49:23.596778645Z tj5-cloudml-prod-g8gm402-slave87-20240727:257:542 [7] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 10.113.0.31<46915> failed : Software caused connection abort
2024-11-27T11:49:23.596790026Z tj5-cloudml-prod-g8gm402-slave87-20240727:257:542 [7] NCCL INFO misc/socket.cc:567 -> 2
2024-11-27T11:49:23.596798241Z tj5-cloudml-prod-g8gm402-slave87-20240727:257:542 [7] NCCL INFO misc/socket.cc:621 -> 2
2024-11-27T11:49:23.596805633Z tj5-cloudml-prod-g8gm402-slave87-20240727:257:542 [7] NCCL INFO bootstrap.cc:285 -> 2

@kiskra-nvidia
Copy link
Member

Yes, the code currently silently retries on ECONNREFUSED and ETIMEDOUT, up to a predefined maximum number of retries (at which point a WARN-level diagnostics is issued). But, as I had said, the retry code has a bug in it... This code will see an overhaul in the upcoming 2.24 release (there will be an additional INFO-level diagnostics on every error, and EHOSTUNREACH will join the list of errors that trigger a retry).

@gangxie112
Copy link

Yes, the code currently silently retries on ECONNREFUSED and ETIMEDOUT, up to a predefined maximum number of retries (at which point a WARN-level diagnostics is issued). But, as I had said, the retry code has a bug in it... This code will see an overhaul in the upcoming 2.24 release (there will be an additional INFO-level diagnostics on every error, and EHOSTUNREACH will join the list of errors that trigger a retry).

That's exactly the cause. According my observation, my errors look like ECONNREFUSED. Because it failed with ECONNABORTED quickly. So, there is another question. Could be there any race condition between the client and server? I mean, when rank0 tries connect rank1, rank1 should be listening any way, right?
Why I'm asking about this, is because after I replated the server, the issue disappeared. but to see if there is a network isse, I did some curl tests between the 2 containers of the 2 ranks. No network routing issue found.

@kiskra-nvidia
Copy link
Member

Sorry for the delay in responding. I don't think there are race conditions possible here because the port number to connect to isn't known until after the listening socket has been created by the other side. I think the reason why we retry on ECONNREFUSED is in case the server gets overloaded with connection requests, which can happen especially when bootstrapping very large communicators (the new ncclCommInitRankScalable API introduces in NCCL 2.23 should address such cases as well).

@gangxie112
Copy link

Sorry for the delay in responding. I don't think there are race conditions possible here because the port number to connect to isn't known until after the listening socket has been created by the other side. I think the reason why we retry on ECONNREFUSED is in case the server gets overloaded with connection requests, which can happen especially when bootstrapping very large communicators (the new ncclCommInitRankScalable API introduces in NCCL 2.23 should address such cases as well).

Thanks for the detailed explanation, @kiskra-nvidia. Back to the original issue of "Software caused connection abort", we hit this again for some times recently. The log make it really hard to find the first failure. So, what's the date of the next release which fixed the issue?

@kiskra-nvidia
Copy link
Member

Well, when it's ready 😉. Given the holidays later this month, probably not until early 2025. FYI, here's a patch against 2.21.5 you can try in the meantime:

--- src/misc/socket.cc.orig     2024-12-03 12:20:12.913833404 -0800
+++ src/misc/socket.cc  2024-12-03 12:15:11.381312457 -0800
@@ -467,6 +467,8 @@ static ncclResult_t socketStartConnect(s
       WARN("socketStartConnect: exceeded retries (%d)", sock->refusedRetries);
       return ncclRemoteError;
     }
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
     if (sock->refusedRetries % 1000 == 0) INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
     return ncclSuccess;
@@ -476,6 +478,8 @@ static ncclResult_t socketStartConnect(s
       WARN("socketStartConnect: exceeded timeouts (%d)", sock->timedOutRetries);
       return ncclRemoteError;
     }
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
     return ncclSuccess;
   } else {
@@ -516,6 +520,8 @@ static ncclResult_t socketPollConnect(st
       WARN("socketPollConnect: exceeded retries (%d)", sock->refusedRetries);
       return ncclRemoteError;
     }
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     if (sock->refusedRetries % 1000 == 0) INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
     usleep(SLEEP_INT);
     sock->state = ncclSocketStateConnecting;
@@ -525,6 +531,8 @@ static ncclResult_t socketPollConnect(st
       WARN("socketPollConnect: exceeded timeouts (%d)", sock->timedOutRetries);
       return ncclRemoteError;
     }
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
     sock->state = ncclSocketStateConnecting;
   } else if (ret != EINPROGRESS) {

Note that it's completely untested (other than that it compiles) and I didn't bother with error checking and some other subtleties, but it may get you going...

@gangxie112
Copy link

Well, when it's ready 😉. Given the holidays later this month, probably not until early 2025. FYI, here's a patch against 2.21.5 you can try in the meantime:

--- src/misc/socket.cc.orig     2024-12-03 12:20:12.913833404 -0800
+++ src/misc/socket.cc  2024-12-03 12:15:11.381312457 -0800
@@ -467,6 +467,8 @@ static ncclResult_t socketStartConnect(s
       WARN("socketStartConnect: exceeded retries (%d)", sock->refusedRetries);
       return ncclRemoteError;
     }
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
     if (sock->refusedRetries % 1000 == 0) INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
     return ncclSuccess;
@@ -476,6 +478,8 @@ static ncclResult_t socketStartConnect(s
       WARN("socketStartConnect: exceeded timeouts (%d)", sock->timedOutRetries);
       return ncclRemoteError;
     }
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
     return ncclSuccess;
   } else {
@@ -516,6 +520,8 @@ static ncclResult_t socketPollConnect(st
       WARN("socketPollConnect: exceeded retries (%d)", sock->refusedRetries);
       return ncclRemoteError;
     }
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     if (sock->refusedRetries % 1000 == 0) INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
     usleep(SLEEP_INT);
     sock->state = ncclSocketStateConnecting;
@@ -525,6 +531,8 @@ static ncclResult_t socketPollConnect(st
       WARN("socketPollConnect: exceeded timeouts (%d)", sock->timedOutRetries);
       return ncclRemoteError;
     }
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
     sock->state = ncclSocketStateConnecting;
   } else if (ret != EINPROGRESS) {

Note that it's completely untested (other than that it compiles) and I didn't bother with error checking and some other subtleties, but it may get you going...

Thanks, let me try it.

@ghtaro
Copy link

ghtaro commented Dec 11, 2024

@kiskra-nvidia Thank you very much for providing the patch file.

I tried the patch, but still got the same error messages...

I would like to run deepspeed training with slurm. My computational environment is:

Slurm

  • login node: m5.xlarge
  • compute nodes: AWS EC2 p5.4x8large x 2 nodes by using DLAMI (one for the latest x86 Ubuntu22.04)

DeepSpeed

  • confirmed that deepspeed zero3 run without errors on p5.48xlarge x 2nodes (no slurm, no login node)

What I have done is to do the following for both the compute nodes and rerun slurm batch file.

  • git clone nccl repo into /home/ubuntu/nccl
  • applied the patch to it (saying succeeded)
  • run make -j src.build again
ip-10-0-29-193: [rank10]:     dist.broadcast(param.data, 0, self.get_dp_process_group())
ip-10-0-29-193: [rank10]:   File "/home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
ip-10-0-29-193: [rank10]:     return func(*args, **kwargs)
ip-10-0-29-193: [rank10]:            ^^^^^^^^^^^^^^^^^^^^^
ip-10-0-29-193: [rank10]:   File "/home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
ip-10-0-29-193: [rank10]:     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
ip-10-0-29-193: [rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ip-10-0-29-193: [rank10]:   File "/home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
ip-10-0-29-193: [rank10]:     return fn(*args, **kwargs)
ip-10-0-29-193: [rank10]:            ^^^^^^^^^^^^^^^^^^^
ip-10-0-29-193: [rank10]:   File "/home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/deepspeed/comm/torch.py", line 200, in broadcast
ip-10-0-29-193: [rank10]:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
ip-10-0-29-193: [rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ip-10-0-29-193: [rank10]:   File "/home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
ip-10-0-29-193: [rank10]:     return func(*args, **kwargs)
ip-10-0-29-193: [rank10]:            ^^^^^^^^^^^^^^^^^^^^^
ip-10-0-29-193: [rank10]:   File "/home/ubuntu/.pyenv/versions/anaconda3-2024.10-1/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2417, in broadcast
ip-10-0-29-193: [rank10]:     work = default_pg.broadcast([tensor], opts)
ip-10-0-29-193: [rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ip-10-0-29-193: [rank10]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ip-10-0-29-193: [rank10]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
ip-10-0-29-193: [rank10]: Last error:
ip-10-0-29-193: [rank10]: socketStartConnect: Connect to 10.0.19.16<58843> failed : Software caused connection abort
ip-10-0-19-16: 2024/12/11 16:12:32 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...

@kiskra-nvidia
Copy link
Member

@ghtaro Strage that you are still seeing these errors, although as I had said I haven't actually tested this patch... Do you get any more messages in the debug log (was this run with NCCL_DEBUG=INFO)? You may want to try the below patch, which adds additional diagnostics:

--- a/src/misc/socket.cc
+++ b/src/misc/socket.cc
@@ -468,8 +468,10 @@ static ncclResult_t socketStartConnect(struct ncclSocket* sock) {
       WARN("socketStartConnect: exceeded retries (%d)", sock->refusedRetries);
       return ncclRemoteError;
     }
+    INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
-    if (sock->refusedRetries % 1000 == 0) INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
     return ncclSuccess;
   } else if (errno == ETIMEDOUT) {
     if (++sock->timedOutRetries == RETRY_TIMEDOUT_TIMES) {
@@ -477,6 +479,9 @@ static ncclResult_t socketStartConnect(struct ncclSocket* sock) {
       WARN("socketStartConnect: exceeded timeouts (%d)", sock->timedOutRetries);
       return ncclRemoteError;
     }
+    INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
     return ncclSuccess;
   } else {
@@ -518,7 +523,9 @@ static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
       WARN("socketPollConnect: exceeded retries (%d)", sock->refusedRetries);
       return ncclRemoteError;
     }
-    if (sock->refusedRetries % 1000 == 0) INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
+    INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
     sock->state = ncclSocketStateConnecting;
   } else if (ret == ETIMEDOUT) {
@@ -527,6 +534,9 @@ static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
       WARN("socketPollConnect: exceeded timeouts (%d)", sock->timedOutRetries);
       return ncclRemoteError;
     }
+    INFO(NCCL_ALL, "Call to connect returned %s, retrying", strerror(errno));
+    close(sock->fd);
+    sock->fd = socket(sock->addr.sa.sa_family, SOCK_STREAM, 0);
     usleep(SLEEP_INT);
     sock->state = ncclSocketStateConnecting;
   } else if (ret != EINPROGRESS) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants