You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am currently using NCCL send/recv for pipeline parallelism (VPP) in training an LLM model (Baichuan). I’d appreciate your advice on something.
When training the model on H100 GPUs, the training hangs after a few steps. Normally, each step in pipeline parallelism should be exactly the same. If I increase the NCCL buffer size, the model can run for more steps before hanging again. It seems that the NCCL buffer is not being released properly on H100 cards. However, the exact same code runs without issues on A100 GPUs. My nccl version is 2.17.
I’d like to ask if there are any known issues like this with H100 GPUs, and if there is any way to monitor NCCL buffer usage in real time.
The text was updated successfully, but these errors were encountered:
I am not aware of any such issues. Having said that, the current NCCL version is 2.23.4, and it features 18 months' worth of improvements and fixes over 2.17. I encourage you to try with the latest version first, and if that doesn't help, you'll need to post a reproducer that we can use to investigate.
Hi, I am currently using NCCL send/recv for pipeline parallelism (VPP) in training an LLM model (Baichuan). I’d appreciate your advice on something.
When training the model on H100 GPUs, the training hangs after a few steps. Normally, each step in pipeline parallelism should be exactly the same. If I increase the NCCL buffer size, the model can run for more steps before hanging again. It seems that the NCCL buffer is not being released properly on H100 cards. However, the exact same code runs without issues on A100 GPUs. My nccl version is 2.17.
I’d like to ask if there are any known issues like this with H100 GPUs, and if there is any way to monitor NCCL buffer usage in real time.
The text was updated successfully, but these errors were encountered: