-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nccl socketStartConnect: Connect to x.x.x.x<xxxx> failed : Software caused connection abort #1515
Comments
The specific message you're seeing ( In principle modifying and recompiling NCCL is easy, and indeed it should be enough to replace the single |
|
Are you saying that setting Unfortunately the fix for the "software caused connection abort" bug is not a one-liner and extracting it from the ~500 lines of changes to The question is though: why were you getting |
I met the same problem today, and upgrading NCCL to the latest version solved it… |
We hit the same issue recently. According my understanding, only when we try to connect the staled socket, we get ECONNABORTED. But there is no other socket error before ECONNABORTED. ECONNABORTED should not be the first one. NCCL swallows the error?
|
Yes, the code currently silently retries on |
That's exactly the cause. According my observation, my errors look like ECONNREFUSED. Because it failed with ECONNABORTED quickly. So, there is another question. Could be there any race condition between the client and server? I mean, when rank0 tries connect rank1, rank1 should be listening any way, right? |
Sorry for the delay in responding. I don't think there are race conditions possible here because the port number to connect to isn't known until after the listening socket has been created by the other side. I think the reason why we retry on |
Thanks for the detailed explanation, @kiskra-nvidia. Back to the original issue of "Software caused connection abort", we hit this again for some times recently. The log make it really hard to find the first failure. So, what's the date of the next release which fixed the issue? |
Well, when it's ready 😉. Given the holidays later this month, probably not until early 2025. FYI, here's a patch against 2.21.5 you can try in the meantime:
Note that it's completely untested (other than that it compiles) and I didn't bother with error checking and some other subtleties, but it may get you going... |
Thanks, let me try it. |
@kiskra-nvidia Thank you very much for providing the patch file. I tried the patch, but still got the same error messages... I would like to run deepspeed training with slurm. My computational environment is: Slurm
DeepSpeed
What I have done is to do the following for both the compute nodes and rerun slurm batch file.
|
@ghtaro Strage that you are still seeing these errors, although as I had said I haven't actually tested this patch... Do you get any more messages in the debug log (was this run with
|
I met a tricky question. when i run a mission, it sometimes report the errors as following:
socketStartConnect: Connect to 10.45.234.83<47527> failed : Software caused connection abort
This problem doesn't occur 100% of the time. It is a high probability that nine out of ten runs will occur.
I have check the basic network, it is ok. I could use nc to connect between the two pods and ping-pass.
What's even more strange is that I tried to modify the misc/socket.cc file and recompile the new libnccl.so to overwrite the previous libnccl.so. However, I found that the error information reported by the task was inconsistent with the information I newly compiled, as if the task did not actually use the libnccl.so I just compiled, but when I ran all_reduce_perf, some log information I compiled could be printed out. Please help me, I don't have any clue anymore...
The text was updated successfully, but these errors were encountered: