-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL hang issue #394
Comments
It seems we have the logs of all ranks indeed, except rank 0 which only reports the NCCL version and nothing else. This is quite strange, and I'm not sure why that would be the case; at least we should see the beginning of the traces from ncclCommInitRank. The only explanation I can see would be if mpirun is passing That said, it seems all ranks exit from ncclCommInitRank and start calling broadcast, so rank 0 has somehow participated in that init phase otherwise all ranks would be stuck in init as well. And the stack.log you attached confirms that we are stuck in ncclBroadcast, if that stack is taken from rank 0. But it would be good to fix that to make sure the environment is indeed the same on all ranks since that could be a cause for the hang. |
Thanks sjeaugey. Also I forgot to mention that the GPU utility was somehow at 100% during the hang. So if I got it right, all the ranks had to successfully exit from ncclCommInitRank, which means they all finished initialization process, before the broadcast could happen. Based on this, there's no need to check if rank0 was stuck somewhere in initialization, although it is weird that we are not seeing NCCL log showing rank0 finished initialization. Am I right? The environment should be the same as far as I know, but I'll double check and get back if I found anything strange. Thanks! |
Hi Sylvain, For the rank 0 log issue, I double checked some working cases and they don't print log on rank0 either for my case, so I guess it has nothing to do with the hang issue, it might be werid though. I did some more test and just noticed that for those hang cases, I'm getting this message "transport/net_ib.cc:80 NCCL WARN NET/IB : Got async event : GID table change" on some of the ranks. I did some search and found out that it's an IB event " IBV_EVENT_GID_CHANGE" which means the GID index was changed. I'm wondering will this cause a hang problem, or will NCCL handle the cases when GID index was changed, or maybe this event will not even cause a problem? Thanks! |
Same issue, I ran the nccl-tests, the command is:
then the test hang, and the pstack log of rank0 is:
and there are only part of nccl-test logs:
|
This could indeed cause a hang. It would be good to understand what changed in the GID table and why. Did some interfaces appear/disappear ? Or did some IP address change ? |
Unfortunately, I'm not able to root cause the GID index change currently (I didn't see any change by checking files under the path /sys/class/infiniband/mlx5_..., still digging into it though). One thing that I noticed is that I did see some errors under the hw_counters folder: In the meantime, I tried out nccl v2.7.8-1 in my case and the hang issue is gone. Hence, my question is that whether there're differences between v2.5.7-1 and v2.7.8-1 which might cause the behavior? |
I do not see any change between 2.5 and 2.7 which would explain a different behavior upon a GID_INDEX change. |
I have reproduced the hang issue with GDR enabled, it seems that GID_INDEX change isn't the root cause. Environment: 2 nodes with V100 * 8, If I ran the nccl-test with 2 docker containers, 8 GPUs per container, then it can run normally. If I ran the same mode, but disable the GDR by set NCCL_NET_GDR_READ=0 and NCCL_NET_GDR_LEVEL=0, then the nccl-test will run successfully. |
@weberxie it would seem GPU Direct RDMA is not functional on your setup. This could be due to ACS being enabled, or something else causing your PCI switches to not correctly process PCI peer-to-peer requests. |
Actually @weberxie GPU Direct being broken may not explain why it works when launching a single container with 8 GPUs, but not when launching 1 GPU per container. Could you try applying the attached patch on top of 2.7.8 and see if it fixes the issue ? Alternatively, as a workaround, can you try setting NCCL_PROTO=^LL128 ? |
@sjeaugey Thanks. These two ways are all fix this issue. Could you explain that why they work fine? And, the patch may forget to add |
Thanks for the confirmation. Attaching the fixed patch for reference. I forgot one line when porting it back to 2.7. The explanation is a bit complicated, but in short, when using 1 process per node, we may enable LL128 for inter-node communication, but only for the GPUs which are close to the NIC, i.e. use GPU Direct RDMA. So half the ranks would have LL128 enabled and half the others would not, causing a protocol mismatch and a hang. |
Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node.
@nachtsky1077 we are seeing similar issue that GID table changed during training and the process hang.
@sjeaugey Is there any operation in NCCL which would trigger GID to change? |
No, NCCL should not cause GID changes. |
The version upgrade fix was related to GDR issue mentioned by @weberxie, which has nothing to do with GID index table change. I haven't got a chance to dig into cause the GID table change. |
Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix NVIDIA#379 : topology injection failing when using less GPUs than described in the XML. Fix NVIDIA#394 : protocol mismatch causing hangs or crashes when using one GPU per node.
Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix NVIDIA#379 : topology injection failing when using less GPUs than described in the XML. Fix NVIDIA#394 : protocol mismatch causing hangs or crashes when using one GPU per node.
Environment:
NCCL version 2.5.7 + cuda10.0
40 ranks with one GPU per node, each rank is a docker container
Observation:
NCCL hang during initialization process, rank0 didn't finish initialization process.
Attached the pstack log retrieved when hanging and NCCL log:
nccl.log
pstack.log
Any idea about what might cause the hang? Thanks!
The text was updated successfully, but these errors were encountered: