-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL Hang with CUDA_LAUNCH_BLOCKING=1 #750
Comments
cc. @kwen2501 |
Confirmed working with 2.10 + CUDA_LAUNCH_BLOCKING. ALso confirmed 2.13 + CUDA_LAUNCH_BLOCKING + NCCL_BLOCKING_WAIT doesnt' work |
I just notice this issue. Thanks @xw285cornell for reporting it. @sjeaugey This may seem like a corner case but:
If You can ignore |
We're looking into the regression, but in general I'm not sure what we can guarantee with respect to CUDA_LAUNCH_BLOCKING. In particular I don't see how it could work in the case of one process managing multiple GPUs and having to launch a kernel on each GPU: the first one would block and we'd never be able to launch the second one. Basically, given our kernels block waiting for other GPUs, as soon as some call related to another GPU blocks, we have a deadlock. |
We (PyTorch) are dropping one-thread multi-GPU support in c10d. So no need to support |
@sjeaugey yeah if you can guarantee CUDA_LAUNCH_BLOCKING for single GPU per process that'll provide a huge value to us already |
In 2.13 we moved away from cudaLaunchCooperativeKernelMultiDevice as it is being deprecated and have moved to a loop around cudaLaunchKernel. This could explain the hang if there are multiple devices per process. It isn't clear from the initial report, are we dealing with multi device per process? |
Can you verify if the hang goes away on single-node but still multi-gpu and most importantly: no network. I believe the hang is due to our not pushing work to our proxy thread (manages nic traffic) until after cudaLaunchKernel returns which creates a circular dependency between the gpu and network progress. |
@jbachan how do we disable network? NCCL_SOCKET_IFNAME=lo? |
Just running a multi-process, multi-GPU job on a single node should not invoke the network proxies |
Thanks @jbachan @AddyLaddy! We have done the experiments and confirmed what you said: single host: no env variables, passes two hosts: NCCL_P2P_DISABLE=1 + one host: stuck |
@jbachan @AddyLaddy @sjeaugey any update for this? |
Whoah! I was just directed here after I was trying to get help for a crashing setup that I was trying to debug with So how do we debug missing tracebacks now? That was the whole point of thank you! |
The easiest thing to do for now might be to revert back to 2.10 or before (not sure about 2.11-2.12), until this issue is fixed |
Thank you @xw285cornell that clearly implicates the kernel launch before proxy thread launch as the culprit. I plan to make a fix available early January. |
@xw285cornell please try PR #774 |
Thanks @jbachan ! Let me try it out! |
@xw285cornell were you able to confirm that PR #744 helps? Thanks. |
@spotluri tried and confirmed it works! |
This PR updates NCCL submodule to 2.17.1. Closes NVIDIA/nccl#750 Pull Request resolved: #97407 Approved by: https://github.com/ngimel, https://github.com/ptrblck, https://github.com/malfet
closing as fixed. |
It seems to be a regression in v2.13 (2.10 should work) that when I run a simple MLP model (with all reduce) will hang when CUDA_LAUNCH_BLOCKING is present. The hang happens on all ranks with 100% GPU spinning on something. The backstack from gdb is this. It looks this ncclStrongStreamLaunchKernel is new in v2.13, so wondering if this is a new regression
#0 0x00007fec9b630c50 in ?? () from lib/libcuda.so.1
#1 0x00007fec9b49dcf7 in ?? () from lib/libcuda.so.1
#2 0x00007fec9b3f8b9f in ?? () from lib/libcuda.so.1
#3 0x00007fec9b3fa1b8 in ?? () from lib/libcuda.so.1
#4 0x00007fec9b3e301f in ?? () from /lib/libcuda.so.1
#5 0x00007fec9b636830 in ?? () from lib/libcuda.so.1
#6 0x00007fec9b370bd6 in ?? () from lib/libcuda.so.1
#7 0x00007fec9b372bc2 in ?? () from libcuda.so.1
#8 0x00007fec9b40d4d5 in ?? () from lib/libcuda.so.1
#9 0x00007fec9ae1405c in ?? () from /lib/libcudart.so.11.4.108
#10 0x00007fec9ae67f06 in cudaLaunchKernel () from /lib/libcudart.so.11.4.108
#11 0x00007fecdf7d6cf9 in ncclStrongStreamLaunchKernel(ncclCudaGraph, ncclStrongStream*, void*, dim3, dim3, void**, unsigned long) ()
#12 0x00007fecdf7a56c8 in ncclLaunchKernel(ncclComm*, ncclKernelPlan*) ()
#13 0x00007fecdf7c494d in ncclGroupEndInternal() ()
#14 0x00007fecdf7c5301 in ncclGroupEnd () from
#15 0x00007fecdf794f4c in torch::cuda::nccl::AutoNcclGroup::~AutoNcclGroup() ()
#16 0x00007fecdf725655 in c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >&, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllgatherOptions const&) ()
#17 0x00007feccf159640 in c10d::(anonymous namespace)::allgather_(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long) ()
#18 0x00007feccf163e8c in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > ()(std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long), std::tuple<std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > >, c10::guts::typelist::typelist<std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long> >, std::tuple<std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > (std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long)>::call(c10::OperatorKernel, c10::DispatchKeySet, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long) ()
The text was updated successfully, but these errors were encountered: