You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
//calling NCCL communication API. Group API is required when using
//multiple devices per thread
NCCLCHECK(ncclGroupStart());
for (int i = 0; i < nDev; ++i)
NCCLCHECK(ncclAllReduce((const void*)sendbuff[i], (void*)recvbuff[i], size, ncclFloat, ncclSum,
comms[i], s[i]));
NCCLCHECK(ncclGroupEnd());
The comment says the group api is required in the case as there are 4 (by default) devices used. I luckily have access to the exact setup like this and was able to run the example on it. It turns out that it repeatedly works well no matter if I keep the ncclGroupStart / End pair in place or get rid of it completely. I tried to break it somehow - reveal the expected deadlock - by changing the number of devices used in example 1 and 3, but it just does not appear.
Could you please comment on this? I mean, does this experiment just show that the deadlock is possible, but in this case very rare or it just cannot occur here, so the comment is wrong??
The text was updated successfully, but these errors were encountered:
Hi,
I’ve been experimenting with nccl examples published in the official nccl documentation (NVIDIA Collective Communication Library (NCCL) Documentation — NCCL 2.19.3 documentation).
The part I want to ask about is the following:
The comment says the group api is required in the case as there are 4 (by default) devices used. I luckily have access to the exact setup like this and was able to run the example on it. It turns out that it repeatedly works well no matter if I keep the ncclGroupStart / End pair in place or get rid of it completely. I tried to break it somehow - reveal the expected deadlock - by changing the number of devices used in example 1 and 3, but it just does not appear.
Could you please comment on this? I mean, does this experiment just show that the deadlock is possible, but in this case very rare or it just cannot occur here, so the comment is wrong??
The text was updated successfully, but these errors were encountered: