You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My requirement is: The following tasks can be done asynchronously in a single process
An expensive batch p2p communication over an old communicator
Creating a new communication group with some of old processes removed and new processes joining, which will be used later.
Since these two tasks take considerable overhead, I want them to be executed asynchronously (or even simultaneously). Is it possible to assign them to different threads? Thanks!
The text was updated successfully, but these errors were encountered:
That's a good question. In general I'd think it should work, but there may be CUDA calls during ncclCommInitRank which could cause an implicit inter-device synchronization. If that is the case, then you could end up with a deadlock if:
the p2p communication launches on GPU A but not on GPU B
the init is blocking the launch on GPU B, waiting for GPU A to complete its CUDA work, including the NCCL operation which is stuck.
That's a good question. In general I'd think it should work, but there may be CUDA calls during ncclCommInitRank which could cause an implicit inter-device synchronization. If that is the case, then you could end up with a deadlock if:
the p2p communication launches on GPU A but not on GPU B
the init is blocking the launch on GPU B, waiting for GPU A to complete its CUDA work, including the NCCL operation which is stuck.
Thank you for your reply! Unfortunately, my batch p2p communication is very complicated, so the deadlock case #1 usually occurs in practice (during the call of ncclCommInitRank). I'm a bit confused about why the initialization will cause inter-device synchronization, shouldn't this initialisation just set network related parameters? And whether there are alternatives to achieve my needs?
I'm a bit confused about why the initialization will cause inter-device synchronization
In theory it should not, and in NCCL 2.19 we have replaced a lot of CUDA calls to cuMem*, so the situation should improve, but we might still have some calls causing syncs, in particular when we share buffers between CUDA devices and map them on remote GPUs.
My requirement is: The following tasks can be done asynchronously in a single process
Since these two tasks take considerable overhead, I want them to be executed asynchronously (or even simultaneously). Is it possible to assign them to different threads? Thanks!
The text was updated successfully, but these errors were encountered: