NCCL buffer might not be releasing properly on H100 cards #1541

AndSonder · 2024-12-13T03:15:40Z

Hi, I am currently using NCCL send/recv for pipeline parallelism (VPP) in training an LLM model (Baichuan). I’d appreciate your advice on something.

When training the model on H100 GPUs, the training hangs after a few steps. Normally, each step in pipeline parallelism should be exactly the same. If I increase the NCCL buffer size, the model can run for more steps before hanging again. It seems that the NCCL buffer is not being released properly on H100 cards. However, the exact same code runs without issues on A100 GPUs. My nccl version is 2.17.

I’d like to ask if there are any known issues like this with H100 GPUs, and if there is any way to monitor NCCL buffer usage in real time.

kiskra-nvidia · 2024-12-16T06:19:54Z

I am not aware of any such issues. Having said that, the current NCCL version is 2.23.4, and it features 18 months' worth of improvements and fixes over 2.17. I encourage you to try with the latest version first, and if that doesn't help, you'll need to post a reproducer that we can use to investigate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL buffer might not be releasing properly on H100 cards #1541

NCCL buffer might not be releasing properly on H100 cards #1541

AndSonder commented Dec 13, 2024

kiskra-nvidia commented Dec 16, 2024

NCCL buffer might not be releasing properly on H100 cards #1541

NCCL buffer might not be releasing properly on H100 cards #1541

Comments

AndSonder commented Dec 13, 2024

kiskra-nvidia commented Dec 16, 2024