-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about ncclCommAbort stuck issue #1013
Comments
That's expected if you are using the same stream or different stream but CUDA schedules the work in sequence. At this time, if you abort pg first, it can wait for pg1.allreduce to complete (which will never complete since one rank is dead), so you hang forever. However, if you abort pg1 first, allreduce can exit, then broadcast can exit as well. It is not an NCCL bug to me but an application bug. |
Thanks for your reply. I am wondering when we first abort pg, why it has to wait the collective operation on pg1 to complete? Does it mean that when there are multiple ongoing collective operations and an error happens in a rank, we must abort these operations in the same order as they are enqueued? And when allreduce and broadcast is both executed at pg1, like
Even there is no ongoing operation on |
Because you first issue pg1 allreduce, it cannot complete, which can block pg broadcast. abort will wait for its issued collective to complete before aborting everything. You need to be very careful while using multiple communicators. It is a complex topic. If you enqueue collectives into different streams, and each stream gets resources to issue their workload. Then, abort sequence does not matter. As long as there is one which blocks another, you can easily reach a hang. For the case
If you abort pg1 first, it does not hang? If so, can u provide me the gdb backtrace when it hangs? |
For the second case, yes, if we abort
|
Could you elaborate more about the situation that one collective operation can block another? In my first case, broadcast and allreduce operate on different processgroup and different tensors, so why pg1.allreduce would block pg.broadcast? |
I found the root cause. It is because during abort, NCCL will call
As I explained above, there are major two reasons that pg1 allreduce can block pg broadcast. One is you issue them on the same stream; the other is GPU does not have enough resources, so CUDA runtime decides to schedule them in sequence even if they are on the different stream. |
Thanks for your explanation. Do you have a plan and timeline to solve this issue? |
Need to discuss with the team. Will let you know when we have a plan. |
Can you try |
No, it does not help.
Even we abort |
Another thing is that if we wait until the underlying NCCL watchdog thread captures the timeout and do abort in the watchdog thread, it can abort successfully. But if we want to abort in the main process, it fails. |
I remember watch dog thread is per comm. So watch dog thread won't block each other like main thread since main thread aborts them one by one. Just want to confirm one more point, when you enable NCCL_CUMEM_ENABLE=1 and get stuck, can you show me the backtrace of all threads? |
Thread 7 (Thread 0x7f272e99a640 (LWP 455407) "python3"): Thread 6 (Thread 0x7f273eecc640 (LWP 455403) "python3"): Thread 5 (Thread 0x7f27816b3640 (LWP 455379) "python3"): Thread 4 (Thread 0x7f2781eb4640 (LWP 455378) "python3"): Thread 3 (Thread 0x7f2788d7a640 (LWP 455372) "cuda-EvtHandlr"): Thread 2 (Thread 0x7f278957b640 (LWP 455369) "cuda-EvtHandlr"): Thread 1 (Thread 0x7f281f893740 (LWP 455355) "python3"): #52 0x00007f281f629e40 in __libc_start_main_impl (main=0x5879f0 , argc=3, argv=0x7fff075450e8, init=, fini=, rtld_fini=, stack_end=0x7fff075450d8) at ../csu/libc-start.c:392 #53 0x00000000005878ee in _start () |
Hi, |
Hi, Kaiming. Thanks for your update but it does not work for the aforementioned script (e.g
I have checked the version of
|
It looks like your branch is based on NCCL 2.19.3. But after I build your branch, |
It seems you don't link to the installed NCCL. Can you try to set LD_LIBRARY_PATH to the path where you install NCCL? |
I have exported LD_LIBRARY_PATH as |
Pytorch might link NCCL internally. Maybe this post helps https://discuss.pytorch.org/t/ncc-version-and-pytorch-nccl-version-mismatch/87771 |
@acphile Please set
|
Thanks @KaimingOuyang and team for providing a fix! Wondering if there is a planned release for this fix? And, is |
Great news! There is no performance impact from using |
Hi, I tried re-compile the PyTorch and now
But when we change the abort order, like
It can abort successfully. So look like the abort order needs to match the order of the NCCL collectives. Do you have a fix for that? @KaimingOuyang |
And for this case
Only rank 0 and rank 1 prints |
No, it should abort successfully. Can you provide me the backtrace of every thread in rank 1 and 3? |
You mean for the above two cases you updates should abort successfully? And which case you want the backtrace? |
The case you get hang, i.e.
|
#1013 (comment) For this case, it is a little weird. Initially only rank 0 prints |
Can you gdb into rank 1 and provide me the backtrace when it gets hang? On the other hand, to make sure it is not due to your OS issue, please leave your program there at least 10 seconds after calling abort and see whether all ranks abort successfully (see #992). |
It is for the case here #1013 (comment)
And for rank 3
|
Hmm no idea about the gdb part. |
@acphile I remember exit() won't call abort for pg based on my investigation (#1013 (comment)). Could you please verify all ranks have called abort? |
the exit() rank 2 indeed would enter the
|
If that's the case, what does
translate to? Since you are using the master thread to abort everything. We need all ranks to call abort like
Can your implementation guarantee pg and pg1 abort in the same group for all ranks? |
Does it meet your requirement? Or does it require for every rank, all the communicators from different process groups should be put in a single block like
If so, @wconstab do you have an API to do that? |
That means you only abort pg, which causes the problem. |
we might need to add a 'global abort' API to ProcesGroup. (it would be class-wide and do a start/end group around aborting all the comms for all the PGs). We should open an issue for this (i think @kwen2501 was opening one) so we can discuss API specifics. |
Hi, since we have all commAbort related discussion here, putting one more question for @KaimingOuyang regarding out-of-order commAbort hang, with NCCL 2.17.1/NCCL2.18.3/NCCL2.19.4 (but fixed in NCCL 2.19.3, see more notes below). The hanging program looks like this:
After live debugging, we confirmed all ranks hang in We found that this case doesn't hang in NCCL 2.19.3, is because it explicitly calls In the later releases 2.19.4/ 2.20.3, however, I was wondering, is there any reason why we revert |
Thank Min for digging out the root cause! |
Makes sense. Thanks for the explanation, Kaiming. So looks like we cannot count 2.19.3 as a proper fix :-( |
we recently had some cases multiple ranks on the same host experienced 'cuda failure out of memory' and ncclCommAbort hangs forever on the same host. Some logs from nccls: |
Imported from GitHub PR openxla/xla#13109 This introduces a flag for termination on NCCL async error. With the flag on, XLA will terminate the process on NCCL error. With the flag off, the existing behavior should remain unchanged. The patch is motivated by several problems: - Without this patch, the heartbeat monitor only checks communicators that are currently not use by the running executable (because it obtains the communicators with TryAcquire). Since NCCL errors cause a hang in the running communicator, most failing communicators are locked, so their async errors just go undetected. As a result, XLA often hangs until Grpc timeout even in cases when ncclCommGetAsyncError would report an error. - Ideally we would recover by aborting the faulty communicators, but that seems to be unreliable (aborts can cause hangs if NCCL currently hangs on a different communicator than the one being aborted). NCCL team is aware of this and working on a fix (NVIDIA/nccl#1013). At the moment, there does not seem to be a reliable fast recovery mechanism short of process termination. We propose to expose a flag for terminating the process on failure so that there is some way to detect and recover from a NCCL failure. Once the comm-abort works reliably, we will use that and propagate the error to the API user. The patch is based on a PoC from [email protected] and [email protected]. Copybara import of the project: -- 87bea4695582041f6efae5322185482e934b79b8 by Jaroslav Sevcik <[email protected]>: Add flag for termination on nccl error -- 96532e4462828f0de86664dffec898bbc78859af by Jaroslav Sevcik <[email protected]>: Comment, better name for the checking method Merging this change closes #13109 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13109 from jaro-sevcik:terminate-on-nccl-error 96532e4462828f0de86664dffec898bbc78859af PiperOrigin-RevId: 638198800
Imported from GitHub PR openxla/xla#13109 This introduces a flag for termination on NCCL async error. With the flag on, XLA will terminate the process on NCCL error. With the flag off, the existing behavior should remain unchanged. The patch is motivated by several problems: - Without this patch, the heartbeat monitor only checks communicators that are currently not use by the running executable (because it obtains the communicators with TryAcquire). Since NCCL errors cause a hang in the running communicator, most failing communicators are locked, so their async errors just go undetected. As a result, XLA often hangs until Grpc timeout even in cases when ncclCommGetAsyncError would report an error. - Ideally we would recover by aborting the faulty communicators, but that seems to be unreliable (aborts can cause hangs if NCCL currently hangs on a different communicator than the one being aborted). NCCL team is aware of this and working on a fix (NVIDIA/nccl#1013). At the moment, there does not seem to be a reliable fast recovery mechanism short of process termination. We propose to expose a flag for terminating the process on failure so that there is some way to detect and recover from a NCCL failure. Once the comm-abort works reliably, we will use that and propagate the error to the API user. The patch is based on a PoC from [email protected] and [email protected]. Copybara import of the project: -- 87bea4695582041f6efae5322185482e934b79b8 by Jaroslav Sevcik <[email protected]>: Add flag for termination on nccl error -- 96532e4462828f0de86664dffec898bbc78859af by Jaroslav Sevcik <[email protected]>: Comment, better name for the checking method Merging this change closes #13109 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13109 from jaro-sevcik:terminate-on-nccl-error 96532e4462828f0de86664dffec898bbc78859af PiperOrigin-RevId: 638198800
Imported from GitHub PR openxla/xla#13109 This introduces a flag for termination on NCCL async error. With the flag on, XLA will terminate the process on NCCL error. With the flag off, the existing behavior should remain unchanged. The patch is motivated by several problems: - Without this patch, the heartbeat monitor only checks communicators that are currently not use by the running executable (because it obtains the communicators with TryAcquire). Since NCCL errors cause a hang in the running communicator, most failing communicators are locked, so their async errors just go undetected. As a result, XLA often hangs until Grpc timeout even in cases when ncclCommGetAsyncError would report an error. - Ideally we would recover by aborting the faulty communicators, but that seems to be unreliable (aborts can cause hangs if NCCL currently hangs on a different communicator than the one being aborted). NCCL team is aware of this and working on a fix (NVIDIA/nccl#1013). At the moment, there does not seem to be a reliable fast recovery mechanism short of process termination. We propose to expose a flag for terminating the process on failure so that there is some way to detect and recover from a NCCL failure. Once the comm-abort works reliably, we will use that and propagate the error to the API user. The patch is based on a PoC from [email protected] and [email protected]. Copybara import of the project: -- ab79a15bcbcfa70d76efc69db26e15450340afac by Jaroslav Sevcik <[email protected]>: Add flag for termination on nccl error -- b91e63e3d8f7bacee86a1d641ae42db8e4e390ad by Jaroslav Sevcik <[email protected]>: Comment, better name for the checking method Merging this change closes #13109 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13109 from jaro-sevcik:terminate-on-nccl-error b91e63e3d8f7bacee86a1d641ae42db8e4e390ad PiperOrigin-RevId: 638198800
Imported from GitHub PR openxla/xla#13109 This introduces a flag for termination on NCCL async error. With the flag on, XLA will terminate the process on NCCL error. With the flag off, the existing behavior should remain unchanged. The patch is motivated by several problems: - Without this patch, the heartbeat monitor only checks communicators that are currently not use by the running executable (because it obtains the communicators with TryAcquire). Since NCCL errors cause a hang in the running communicator, most failing communicators are locked, so their async errors just go undetected. As a result, XLA often hangs until Grpc timeout even in cases when ncclCommGetAsyncError would report an error. - Ideally we would recover by aborting the faulty communicators, but that seems to be unreliable (aborts can cause hangs if NCCL currently hangs on a different communicator than the one being aborted). NCCL team is aware of this and working on a fix (NVIDIA/nccl#1013). At the moment, there does not seem to be a reliable fast recovery mechanism short of process termination. We propose to expose a flag for terminating the process on failure so that there is some way to detect and recover from a NCCL failure. Once the comm-abort works reliably, we will use that and propagate the error to the API user. The patch is based on a PoC from [email protected] and [email protected]. Copybara import of the project: -- ab79a15bcbcfa70d76efc69db26e15450340afac by Jaroslav Sevcik <[email protected]>: Add flag for termination on nccl error -- b91e63e3d8f7bacee86a1d641ae42db8e4e390ad by Jaroslav Sevcik <[email protected]>: Comment, better name for the checking method Merging this change closes #13109 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13109 from jaro-sevcik:terminate-on-nccl-error b91e63e3d8f7bacee86a1d641ae42db8e4e390ad PiperOrigin-RevId: 638198800
Imported from GitHub PR openxla/xla#13109 This introduces a flag for termination on NCCL async error. With the flag on, XLA will terminate the process on NCCL error. With the flag off, the existing behavior should remain unchanged. The patch is motivated by several problems: - Without this patch, the heartbeat monitor only checks communicators that are currently not use by the running executable (because it obtains the communicators with TryAcquire). Since NCCL errors cause a hang in the running communicator, most failing communicators are locked, so their async errors just go undetected. As a result, XLA often hangs until Grpc timeout even in cases when ncclCommGetAsyncError would report an error. - Ideally we would recover by aborting the faulty communicators, but that seems to be unreliable (aborts can cause hangs if NCCL currently hangs on a different communicator than the one being aborted). NCCL team is aware of this and working on a fix (NVIDIA/nccl#1013). At the moment, there does not seem to be a reliable fast recovery mechanism short of process termination. We propose to expose a flag for terminating the process on failure so that there is some way to detect and recover from a NCCL failure. Once the comm-abort works reliably, we will use that and propagate the error to the API user. The patch is based on a PoC from [email protected] and [email protected]. Copybara import of the project: -- ab79a15bcbcfa70d76efc69db26e15450340afac by Jaroslav Sevcik <[email protected]>: Add flag for termination on nccl error -- b91e63e3d8f7bacee86a1d641ae42db8e4e390ad by Jaroslav Sevcik <[email protected]>: Comment, better name for the checking method Merging this change closes #13109 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13109 from jaro-sevcik:terminate-on-nccl-error b91e63e3d8f7bacee86a1d641ae42db8e4e390ad PiperOrigin-RevId: 638198800
Imported from GitHub PR openxla/xla#13109 This introduces a flag for termination on NCCL async error. With the flag on, XLA will terminate the process on NCCL error. With the flag off, the existing behavior should remain unchanged. The patch is motivated by several problems: - Without this patch, the heartbeat monitor only checks communicators that are currently not use by the running executable (because it obtains the communicators with TryAcquire). Since NCCL errors cause a hang in the running communicator, most failing communicators are locked, so their async errors just go undetected. As a result, XLA often hangs until Grpc timeout even in cases when ncclCommGetAsyncError would report an error. - Ideally we would recover by aborting the faulty communicators, but that seems to be unreliable (aborts can cause hangs if NCCL currently hangs on a different communicator than the one being aborted). NCCL team is aware of this and working on a fix (NVIDIA/nccl#1013). At the moment, there does not seem to be a reliable fast recovery mechanism short of process termination. We propose to expose a flag for terminating the process on failure so that there is some way to detect and recover from a NCCL failure. Once the comm-abort works reliably, we will use that and propagate the error to the API user. The patch is based on a PoC from [email protected] and [email protected]. Copybara import of the project: -- 858aeacb2d689e4b03f4e3bcc0595223119143d5 by Jaroslav Sevcik <[email protected]>: Add flag for termination on nccl error Merging this change closes #13109 Reverts changelist 637857834 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13109 from jaro-sevcik:terminate-on-nccl-error 858aeacb2d689e4b03f4e3bcc0595223119143d5 PiperOrigin-RevId: 638198800
Imported from GitHub PR openxla/xla#13109 This introduces a flag for termination on NCCL async error. With the flag on, XLA will terminate the process on NCCL error. With the flag off, the existing behavior should remain unchanged. The patch is motivated by several problems: - Without this patch, the heartbeat monitor only checks communicators that are currently not use by the running executable (because it obtains the communicators with TryAcquire). Since NCCL errors cause a hang in the running communicator, most failing communicators are locked, so their async errors just go undetected. As a result, XLA often hangs until Grpc timeout even in cases when ncclCommGetAsyncError would report an error. - Ideally we would recover by aborting the faulty communicators, but that seems to be unreliable (aborts can cause hangs if NCCL currently hangs on a different communicator than the one being aborted). NCCL team is aware of this and working on a fix (NVIDIA/nccl#1013). At the moment, there does not seem to be a reliable fast recovery mechanism short of process termination. We propose to expose a flag for terminating the process on failure so that there is some way to detect and recover from a NCCL failure. Once the comm-abort works reliably, we will use that and propagate the error to the API user. The patch is based on a PoC from [email protected] and [email protected]. Copybara import of the project: -- 858aeacb2d689e4b03f4e3bcc0595223119143d5 by Jaroslav Sevcik <[email protected]>: Add flag for termination on nccl error Merging this change closes #13109 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13109 from jaro-sevcik:terminate-on-nccl-error 858aeacb2d689e4b03f4e3bcc0595223119143d5 PiperOrigin-RevId: 638198800
Imported from GitHub PR #13109 This introduces a flag for termination on NCCL async error. With the flag on, XLA will terminate the process on NCCL error. With the flag off, the existing behavior should remain unchanged. The patch is motivated by several problems: - Without this patch, the heartbeat monitor only checks communicators that are currently not use by the running executable (because it obtains the communicators with TryAcquire). Since NCCL errors cause a hang in the running communicator, most failing communicators are locked, so their async errors just go undetected. As a result, XLA often hangs until Grpc timeout even in cases when ncclCommGetAsyncError would report an error. - Ideally we would recover by aborting the faulty communicators, but that seems to be unreliable (aborts can cause hangs if NCCL currently hangs on a different communicator than the one being aborted). NCCL team is aware of this and working on a fix (NVIDIA/nccl#1013). At the moment, there does not seem to be a reliable fast recovery mechanism short of process termination. We propose to expose a flag for terminating the process on failure so that there is some way to detect and recover from a NCCL failure. Once the comm-abort works reliably, we will use that and propagate the error to the API user. The patch is based on a PoC from [email protected] and [email protected]. Copybara import of the project: -- 858aeac by Jaroslav Sevcik <[email protected]>: Add flag for termination on nccl error Merging this change closes #13109 COPYBARA_INTEGRATE_REVIEW=#13109 from jaro-sevcik:terminate-on-nccl-error 858aeac PiperOrigin-RevId: 640085317
Imported from GitHub PR openxla/xla#13109 This introduces a flag for termination on NCCL async error. With the flag on, XLA will terminate the process on NCCL error. With the flag off, the existing behavior should remain unchanged. The patch is motivated by several problems: - Without this patch, the heartbeat monitor only checks communicators that are currently not use by the running executable (because it obtains the communicators with TryAcquire). Since NCCL errors cause a hang in the running communicator, most failing communicators are locked, so their async errors just go undetected. As a result, XLA often hangs until Grpc timeout even in cases when ncclCommGetAsyncError would report an error. - Ideally we would recover by aborting the faulty communicators, but that seems to be unreliable (aborts can cause hangs if NCCL currently hangs on a different communicator than the one being aborted). NCCL team is aware of this and working on a fix (NVIDIA/nccl#1013). At the moment, there does not seem to be a reliable fast recovery mechanism short of process termination. We propose to expose a flag for terminating the process on failure so that there is some way to detect and recover from a NCCL failure. Once the comm-abort works reliably, we will use that and propagate the error to the API user. The patch is based on a PoC from [email protected] and [email protected]. Copybara import of the project: -- 858aeacb2d689e4b03f4e3bcc0595223119143d5 by Jaroslav Sevcik <[email protected]>: Add flag for termination on nccl error Merging this change closes #13109 PiperOrigin-RevId: 640085317
…oup" Thanks eqy for reminding me of this RFC: #119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: NVIDIA/nccl#1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](#119797) targeting [the hang issue in multi-comm case](NVIDIA/nccl#1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Thanks eqy for reminding me of this RFC: #119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: NVIDIA/nccl#1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](#119797) targeting [the hang issue in multi-comm case](NVIDIA/nccl#1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…oup" Thanks eqy for reminding me of this RFC: #119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: NVIDIA/nccl#1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](#119797) targeting [the hang issue in multi-comm case](NVIDIA/nccl#1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Thanks eqy for reminding me of this RFC: #119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: NVIDIA/nccl#1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](#119797) targeting [the hang issue in multi-comm case](NVIDIA/nccl#1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Thanks @eqy for reminding me of this RFC: #119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: NVIDIA/nccl#1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](#119797) targeting [the hang issue in multi-comm case](NVIDIA/nccl#1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). Pull Request resolved: #132291 Approved by: https://github.com/eqy
Thanks @eqy for reminding me of this RFC: #119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: NVIDIA/nccl#1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](#119797) targeting [the hang issue in multi-comm case](NVIDIA/nccl#1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). Pull Request resolved: #132291 Approved by: https://github.com/eqy
Hi, I find an issue that
ncclCommAbort
hangs when there are multiple ProcessGroups.Here is a simple example on 1 node 4 rank:
In this case, we can find that the process would be stuck at the first
_abort()
. By gdb, we can find that it hangs atncclCommAbort
. However, if we change the order of two_abort()
(abortpg1
first thenpg
) , then the process can exit successfully.Even if all the two collective operations happen at
pg1
, when we first try to abortpg
, the process would be stuck.So is there any bug related to
ncclCommAbort
?NCCL version=2.18.3
The text was updated successfully, but these errors were encountered: