Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added retries to EHOSTUNREACH socket error. #1311

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

newellz2
Copy link

@newellz2 newellz2 commented Jun 3, 2024

We've seen jobs encounter EHOSTUNREACH when using IPoIB that could be relaunched immediately. For example, a link flap caused EHOSTUNREACH, and when the job was relaunched, it started and ran successfully. I've added no route retries to NCCL to avoid having to relaunch.

@@ -20,6 +20,7 @@
#define SLEEP_INT 1000 // connection retry sleep interval in usec
#define RETRY_REFUSED_TIMES 2e4 // connection refused retry times before reporting a timeout (20 sec)
#define RETRY_TIMEDOUT_TIMES 3 // connection timed out retry times (each one can take 20s)
#define RETRY_NO_ROUTE_TIMES 3 // connection no route to host retry times (each one can take 20s)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to have configurable

Copy link
Author

@newellz2 newellz2 Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Maybe something similar to NCCL_IB_RETRY_CNT? What do you think about making RETRY_REFUSED_TIMES and RETRY_TIMEDOUT_TIMES also configurable via environmental variables?

@@ -478,6 +478,14 @@ static ncclResult_t socketStartConnect(struct ncclSocket* sock) {
}
usleep(SLEEP_INT);
return ncclSuccess;
} else if (errno == EHOSTUNREACH) {
if (++sock->noRouteRetries == RETRY_NO_ROUTE_TIMES) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a healthy infra you not supposed to see this error.
I would advocate the that error has to be reported on every retry to notify admins about the infra problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a great idea.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@newellz2 - thanks for putting this PR together

@shamisp - I'm the originator of the request for this. Please reach out to Sam Simcoe for details, nvidia support case is 00705873. High-level summary is that the SM takes maybe 6-8 seconds to program the fabric. In the NCCL failure scenarios I'm seeing the connect() attempt occurs during that 6-8 seconds, resulting in EHOSTUNREACH. When connect() is called at other times (when SM is not busy programming the fabric) then NCCL startup proceeds normally. A link flap that causes the SM to program the fabric is of course not ideal, but there are certainly other causes for the SM to program the fabric which are simply part of IB cluster daily life.

@newellz2 newellz2 changed the title Added retries to EHOSTUNREACH socker error. Added retries to EHOSTUNREACH socket error. Jun 4, 2024
@@ -98,6 +98,21 @@ static int envSocketFamily(void) {
return family;
}

/* Set the number of retries for no route to host*/
static int envNoRouteRetryCount(void) {
Copy link
Member

@sjeaugey sjeaugey Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not use the standard NCCL param definition:

NCCL_PARAM(NoRouteRetryCount, "NO_ROUTE_RETRY_COUNT", RETRY_NO_ROUTE_TIMES);

Then call ncclParamNoRouteRetryCount() instead of envNoRouteRetryCount().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants