-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added retries to EHOSTUNREACH socket error. #1311
base: master
Are you sure you want to change the base?
Conversation
@@ -20,6 +20,7 @@ | |||
#define SLEEP_INT 1000 // connection retry sleep interval in usec | |||
#define RETRY_REFUSED_TIMES 2e4 // connection refused retry times before reporting a timeout (20 sec) | |||
#define RETRY_TIMEDOUT_TIMES 3 // connection timed out retry times (each one can take 20s) | |||
#define RETRY_NO_ROUTE_TIMES 3 // connection no route to host retry times (each one can take 20s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to have configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Maybe something similar to NCCL_IB_RETRY_CNT? What do you think about making RETRY_REFUSED_TIMES and RETRY_TIMEDOUT_TIMES also configurable via environmental variables?
src/misc/socket.cc
Outdated
@@ -478,6 +478,14 @@ static ncclResult_t socketStartConnect(struct ncclSocket* sock) { | |||
} | |||
usleep(SLEEP_INT); | |||
return ncclSuccess; | |||
} else if (errno == EHOSTUNREACH) { | |||
if (++sock->noRouteRetries == RETRY_NO_ROUTE_TIMES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a healthy infra you not supposed to see this error.
I would advocate the that error has to be reported on every retry to notify admins about the infra problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a great idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@newellz2 - thanks for putting this PR together
@shamisp - I'm the originator of the request for this. Please reach out to Sam Simcoe for details, nvidia support case is 00705873. High-level summary is that the SM takes maybe 6-8 seconds to program the fabric. In the NCCL failure scenarios I'm seeing the connect() attempt occurs during that 6-8 seconds, resulting in EHOSTUNREACH. When connect() is called at other times (when SM is not busy programming the fabric) then NCCL startup proceeds normally. A link flap that causes the SM to program the fabric is of course not ideal, but there are certainly other causes for the SM to program the fabric which are simply part of IB cluster daily life.
@@ -98,6 +98,21 @@ static int envSocketFamily(void) { | |||
return family; | |||
} | |||
|
|||
/* Set the number of retries for no route to host*/ | |||
static int envNoRouteRetryCount(void) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not use the standard NCCL param definition:
NCCL_PARAM(NoRouteRetryCount, "NO_ROUTE_RETRY_COUNT", RETRY_NO_ROUTE_TIMES);
Then call ncclParamNoRouteRetryCount()
instead of envNoRouteRetryCount()
.
We've seen jobs encounter EHOSTUNREACH when using IPoIB that could be relaunched immediately. For example, a link flap caused EHOSTUNREACH, and when the job was relaunched, it started and ran successfully. I've added no route retries to NCCL to avoid having to relaunch.