Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL buffer might not be releasing properly on H100 cards #1541

Open
AndSonder opened this issue Dec 13, 2024 · 1 comment
Open

NCCL buffer might not be releasing properly on H100 cards #1541

AndSonder opened this issue Dec 13, 2024 · 1 comment

Comments

@AndSonder
Copy link

Hi, I am currently using NCCL send/recv for pipeline parallelism (VPP) in training an LLM model (Baichuan). I’d appreciate your advice on something.

When training the model on H100 GPUs, the training hangs after a few steps. Normally, each step in pipeline parallelism should be exactly the same. If I increase the NCCL buffer size, the model can run for more steps before hanging again. It seems that the NCCL buffer is not being released properly on H100 cards. However, the exact same code runs without issues on A100 GPUs. My nccl version is 2.17.

I’d like to ask if there are any known issues like this with H100 GPUs, and if there is any way to monitor NCCL buffer usage in real time.

@kiskra-nvidia
Copy link
Member

I am not aware of any such issues. Having said that, the current NCCL version is 2.23.4, and it features 18 months' worth of improvements and fixes over 2.17. I encourage you to try with the latest version first, and if that doesn't help, you'll need to post a reproducer that we can use to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants