-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL hang when invoking Reduce and ncclSend/Recv concurrently #1192
Comments
This should probably be an isend. |
@sjeaugey Thanks for your relply, I have updated the code, but I ran this program with two GPU cards, so it won't go into the else branch. So the hang issue still exists. |
Update: Replace the P2P op with batchP2P ops, then the hand issue was resovled. The code is:
So, my question is, why doesn't batchP2P hang, but directly calling the asynchronous P2P interface does? @sjeaugey Do you have any insights on this? Thanks. |
Not sure. Perhaps because now, the reduce is always done first, and the p2p operation second (after the if/elif/else)? |
The code is:
and the launch cmd is:
The stack trace of gdb is:
According to my understanding, if SM resources can make two communication kernels run concurrently, then it will not hang. Am I correct?
Any reply will be appreciated, thanks.
The text was updated successfully, but these errors were encountered: