-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL hung witih NCCL_P2P_USE_CUDA_MEMCPY=1 by pytorch #1509
Comments
Do not set |
I want to use cudaMemcpy for higher performance, I should how to debug this bug |
Why do you think setting |
I test nccl in 8 GPUs(L20) by nccl-test, if set NCCL_P2P_USE_CUDA_MEMCPY=1, NCCL_SHM_USE_CUDA_MEMCPY=1, and NCCL_SHM_MEMCPY_MODE=1 will get higher bandwidth, Besides, if set NCCL_P2P_USE_CUDA_MEMCPY=1, NCCL_SHM_USE_CUDA_MEMCPY=1, and NCCL_SHM_MEMCPY_MODE=1, nccl-test can work, but pytorch is hung in nccl. |
In general, I would advise against setting NCCL_P2P_USE_CUDA_MEMCPY=1. It will increase the base latency very significantly, and can cause hangs (as you experienced). Now, if you find that it makes a significant performance difference (not just 10% .. more like 2x) and large sizes is what your application really relies on, then you can try to set it, together with NCCL_CREATE_THREAD_CONTEXT=1. Not 100% guaranteed to work though. |
Hi
NCCL version: v2.21.5
when I set NCCL_P2P_USE_CUDA_MEMCPY=1,train a resnet model using pytorch with two GPUs in same NUMA, NCCL will hung,pytorch timeout crash
pytorch error:
`[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, NumelIn=42516042, NumelOut=42516042, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 7, last enqueued NCCL work: 7, last completed NCCL work: 6.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, NumelIn=42516042, NumelOut=42516042, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7f4a66d98e89 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7f4a03482121 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4a034894e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7f4a0348a3ff in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f4a668b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f4a727eaac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f4a7287c850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
`
I try to debug ,found hung in p2pSendProxyProgress ,and sub->transmitted=7, sub->done =0,I think this problem is cudaMemcpyAsync still not finish, why cudaMemcpyAsync not finish???? I try write demo but not face this problem, @sjeaugey can give me some advice thanks?mainCIFAR10.txt
The text was updated successfully, but these errors were encountered: