Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it safe to start p2p send/recv on a communicator while another communicator is being initialized in another thread? #1082

Open
SpiritedAwayCN opened this issue Nov 21, 2023 · 3 comments

Comments

@SpiritedAwayCN
Copy link

My requirement is: The following tasks can be done asynchronously in a single process

  • An expensive batch p2p communication over an old communicator
  • Creating a new communication group with some of old processes removed and new processes joining, which will be used later.

Since these two tasks take considerable overhead, I want them to be executed asynchronously (or even simultaneously). Is it possible to assign them to different threads? Thanks!

@sjeaugey
Copy link
Member

That's a good question. In general I'd think it should work, but there may be CUDA calls during ncclCommInitRank which could cause an implicit inter-device synchronization. If that is the case, then you could end up with a deadlock if:

  • the p2p communication launches on GPU A but not on GPU B
  • the init is blocking the launch on GPU B, waiting for GPU A to complete its CUDA work, including the NCCL operation which is stuck.

@SpiritedAwayCN
Copy link
Author

That's a good question. In general I'd think it should work, but there may be CUDA calls during ncclCommInitRank which could cause an implicit inter-device synchronization. If that is the case, then you could end up with a deadlock if:

  • the p2p communication launches on GPU A but not on GPU B
  • the init is blocking the launch on GPU B, waiting for GPU A to complete its CUDA work, including the NCCL operation which is stuck.

Thank you for your reply! Unfortunately, my batch p2p communication is very complicated, so the deadlock case #1 usually occurs in practice (during the call of ncclCommInitRank). I'm a bit confused about why the initialization will cause inter-device synchronization, shouldn't this initialisation just set network related parameters? And whether there are alternatives to achieve my needs?

@sjeaugey
Copy link
Member

I'm a bit confused about why the initialization will cause inter-device synchronization

In theory it should not, and in NCCL 2.19 we have replaced a lot of CUDA calls to cuMem*, so the situation should improve, but we might still have some calls causing syncs, in particular when we share buffers between CUDA devices and map them on remote GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants