-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL 2.22.3 core dump when specify NCCL_IB_ROCE_VERSION_NUM #1538
Comments
With more info log:
|
It would look like NCCL is failing to open a file like:
Can you run If you upgrade to a newer NCCL version, NCCL should print an explicit |
Only NCCL_IB_ROCE_VERSION_NUM doesn't work. if I specify NCCL_IB_GID_INDEX=7, it work well. |
That is strange. The internal IB plugin is going through the GID table looking for the ROCE version: static ncclResult_t ncclIbRoceGetVersionNum(const char* deviceName, int portNum, int gidIndex, int* version) {
char gidRoceVerStr[16] = { 0 };
char roceTypePath[PATH_MAX] = { 0 };
sprintf(roceTypePath, "/sys/class/infiniband/%s/ports/%d/gid_attrs/types/%d", deviceName, portNum, gidIndex);
int fd = open(roceTypePath, O_RDONLY);
if (fd == -1) {
return ncclSystemError;
}
int ret = read(fd, gidRoceVerStr, 15);
close(fd);
...
} It looks like the plugin can't find the roceTypePath containing the ROCE version for that port. What is your OS? |
Could you also attach the output of |
|
Nothing strange there. Can you also cat the content of /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/7 and attach it here? |
|
That looks normal too. Can you set |
With NCCL_IB_ROCE_VERSION_NUM=2 ? |
env like: |
Unfortunately, not much info in the logs either. Please try NCCL v2.23 (as Sylvain suggested) since it adds WARN messages when opening and reading the RoCE version file in /sys. At this point it is not clear whether the file can't be opened correctly or it is opened correctly but can't be read. |
NCCL hang when I try 2.23.4 with NCCL_IB_ROCE_VERSION_NUM=2. NCCL_IB_GID_INDEX=7 could work well. show_gids
env like: |
try run with NCCL_NET_PLUGIN=none, it seems you run in container, maybe net plugin not incompatible. In old version plugin, NCCL_IB_GID_INDEX is necessary.
|
env like:
|
I think I catch this error. @GeofferyGeng @gcongiu The top loop will return when any gid get such error. In system there are many gid index(some of them are invalid), so the top loop will return fail which cause the connect fail! |
Yes, especially your gid start from 4, some unexpected error occurred. |
I think this is an normal case, since RoCE will reserve some index |
The call to static bool configuredGid(union ibv_gid* gid) {
const struct in6_addr *a = (struct in6_addr *)gid->raw;
int trailer = (a->s6_addr32[1] | a->s6_addr32[2] | a->s6_addr32[3]);
if (((a->s6_addr32[0] | trailer) == 0UL) || ((a->s6_addr32[0] == htonl(0xfe800000)) && (trailer == 0UL))) {
return false;
}
return true;
}
static bool linkLocalGid(union ibv_gid* gid) {
const struct in6_addr *a = (struct in6_addr *)gid->raw;
if (a->s6_addr32[0] == htonl(0xfe800000) && a->s6_addr32[1] == 0UL) {
return true;
}
return false;
}
static bool validGid(union ibv_gid* gid) {
return (configuredGid(gid) && !linkLocalGid(gid));
} Thus, if the GID is configured NCCL should never return with an error. Instead, both the external and internal plugins are trying (and failing) to read the RoCE version from the /sys filesystem for a correctly configured GID. The WARN log does not say what file has failed to read but it should be WARN("NET/IB: read of %s failed in ncclIbRoceGetVersionNum: %s", roceTypePath, strerror(errno)); I shall add the above to future releases as well. EDIT: link local GIDs are not considered valid by NCCL but are shown by |
@gcongiu Can we use a higher version of the API (ibv_query_gid_ex/ibv_query_gid_table) to handle gid? Using them would be more convenient. I can understand that using the basic API can improve compatibility, but in scenarios where the latest version of NCCL is used, the versions of rdma-core/MLNX_OFED should also include these APIs. If so, I can provide a patch and check it out. |
I try RoCE v2 network with 2 node all reduce(each node has 8 gpu and 8 RoCE v2 NIC). NCCL core dump with NCCL_IB_ROCE_VERSION_NUM=2
when I use NCCL_IB_GID_INDEX instead of NCCL_IB_ROCE_VERSION_NUM, it work well.
NCCL version:
env like:
NCCL core dump like:
The text was updated successfully, but these errors were encountered: