Added retries to EHOSTUNREACH socket error. #1311

newellz2 · 2024-06-03T16:17:28Z

We've seen jobs encounter EHOSTUNREACH when using IPoIB that could be relaunched immediately. For example, a link flap caused EHOSTUNREACH, and when the job was relaunched, it started and ran successfully. I've added no route retries to NCCL to avoid having to relaunch.

shamisp · 2024-06-03T16:47:51Z

src/include/socket.h

@@ -20,6 +20,7 @@
 #define SLEEP_INT            1000 // connection retry sleep interval in usec
 #define RETRY_REFUSED_TIMES   2e4 // connection refused retry times before reporting a timeout (20 sec)
 #define RETRY_TIMEDOUT_TIMES    3 // connection timed out retry times (each one can take 20s)
+#define RETRY_NO_ROUTE_TIMES    3 // connection no route to host retry times (each one can take 20s)


I think it is better to have configurable

I agree. Maybe something similar to NCCL_IB_RETRY_CNT? What do you think about making RETRY_REFUSED_TIMES and RETRY_TIMEDOUT_TIMES also configurable via environmental variables?

shamisp · 2024-06-03T16:49:28Z

src/misc/socket.cc

@@ -478,6 +478,14 @@ static ncclResult_t socketStartConnect(struct ncclSocket* sock) {
    }
    usleep(SLEEP_INT);
    return ncclSuccess;
+  } else if (errno == EHOSTUNREACH) {
+    if (++sock->noRouteRetries == RETRY_NO_ROUTE_TIMES) {


In a healthy infra you not supposed to see this error.
I would advocate the that error has to be reported on every retry to notify admins about the infra problem.

I think that's a great idea.

@newellz2 - thanks for putting this PR together

@shamisp - I'm the originator of the request for this. Please reach out to Sam Simcoe for details, nvidia support case is 00705873. High-level summary is that the SM takes maybe 6-8 seconds to program the fabric. In the NCCL failure scenarios I'm seeing the connect() attempt occurs during that 6-8 seconds, resulting in EHOSTUNREACH. When connect() is called at other times (when SM is not busy programming the fabric) then NCCL startup proceeds normally. A link flap that causes the SM to program the fabric is of course not ideal, but there are certainly other causes for the SM to program the fabric which are simply part of IB cluster daily life.

sjeaugey · 2024-06-19T08:29:13Z

src/misc/socket.cc

@@ -98,6 +98,21 @@ static int envSocketFamily(void) {
  return family;
 }

+/* Set the number of retries for no route to host*/
+static int envNoRouteRetryCount(void) {


Any reason to not use the standard NCCL param definition:

NCCL_PARAM(NoRouteRetryCount, "NO_ROUTE_RETRY_COUNT", RETRY_NO_ROUTE_TIMES);

Then call ncclParamNoRouteRetryCount() instead of envNoRouteRetryCount().

shamisp reviewed Jun 3, 2024

View reviewed changes

Added retries to EHOSTUNREACH socker error.

0b5dfad

newellz2 force-pushed the master branch from 1f509df to 0b5dfad Compare June 4, 2024 04:12

newellz2 changed the title ~~Added retries to EHOSTUNREACH socker error.~~ Added retries to EHOSTUNREACH socket error. Jun 4, 2024

sjeaugey reviewed Jun 19, 2024

View reviewed changes

Merge branch 'NVIDIA:master' into master

a6979c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added retries to EHOSTUNREACH socket error. #1311

Added retries to EHOSTUNREACH socket error. #1311

newellz2 commented Jun 3, 2024

shamisp Jun 3, 2024

newellz2 Jun 4, 2024 •

edited

Loading

shamisp Jun 3, 2024

newellz2 Jun 4, 2024

gmatthew190 Jun 18, 2024

sjeaugey Jun 19, 2024 •

edited

Loading

Added retries to EHOSTUNREACH socket error. #1311

Are you sure you want to change the base?

Added retries to EHOSTUNREACH socket error. #1311

Conversation

newellz2 commented Jun 3, 2024

shamisp Jun 3, 2024

Choose a reason for hiding this comment

newellz2 Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

shamisp Jun 3, 2024

Choose a reason for hiding this comment

newellz2 Jun 4, 2024

Choose a reason for hiding this comment

gmatthew190 Jun 18, 2024

Choose a reason for hiding this comment

sjeaugey Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

newellz2 Jun 4, 2024 •

edited

Loading

sjeaugey Jun 19, 2024 •

edited

Loading