Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL hung witih NCCL_P2P_USE_CUDA_MEMCPY=1 by pytorch #1509

Open
adofirst2018 opened this issue Nov 8, 2024 · 5 comments
Open

NCCL hung witih NCCL_P2P_USE_CUDA_MEMCPY=1 by pytorch #1509

adofirst2018 opened this issue Nov 8, 2024 · 5 comments

Comments

@adofirst2018
Copy link

adofirst2018 commented Nov 8, 2024

Hi
NCCL version: v2.21.5

when I set NCCL_P2P_USE_CUDA_MEMCPY=1,train a resnet model using pytorch with two GPUs in same NUMA, NCCL will hung,pytorch timeout crash
pytorch error:
`[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, NumelIn=42516042, NumelOut=42516042, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 7, last enqueued NCCL work: 7, last completed NCCL work: 6.

[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.

[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, NumelIn=42516042, NumelOut=42516042, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.

Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):

frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7f4a66d98e89 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)

frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7f4a03482121 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)

frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4a034894e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)

frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7f4a0348a3ff in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)

frame #4: + 0xdc253 (0x7f4a668b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)

frame #5: + 0x94ac3 (0x7f4a727eaac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)

frame #6: + 0x126850 (0x7f4a7287c850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'

`
I try to debug ,found hung in p2pSendProxyProgress ,and sub->transmitted=7, sub->done =0,I think this problem is cudaMemcpyAsync still not finish, why cudaMemcpyAsync not finish???? I try write demo but not face this problem, @sjeaugey can give me some advice thanks?mainCIFAR10.txt

@sjeaugey
Copy link
Member

sjeaugey commented Nov 8, 2024

@sjeaugey can give me some advice thanks?

Do not set NCCL_P2P_USE_CUDA_MEMCPY=1. More generally, do not set environment variables not listed in the system configuration section unless you really know what you're doing.

@adofirst2018
Copy link
Author

@sjeaugey can give me some advice thanks?

Do not set NCCL_P2P_USE_CUDA_MEMCPY=1. More generally, do not set environment variables not listed in the system configuration section unless you really know what you're doing.

I want to use cudaMemcpy for higher performance, I should how to debug this bug

@sjeaugey
Copy link
Member

sjeaugey commented Nov 8, 2024

I want to use cudaMemcpy for higher performance,I should how to debug this bug

Why do you think setting NCCL_P2P_USE_CUDA_MEMCPY=1 is going to increase performance?

@adofirst2018
Copy link
Author

adofirst2018 commented Nov 10, 2024

I want to use cudaMemcpy for higher performance,I should how to debug this bug

Why do you think setting NCCL_P2P_USE_CUDA_MEMCPY=1 is going to increase performance?

I test nccl in 8 GPUs(L20) by nccl-test, if set NCCL_P2P_USE_CUDA_MEMCPY=1, NCCL_SHM_USE_CUDA_MEMCPY=1, and NCCL_SHM_MEMCPY_MODE=1 will get higher bandwidth, Besides, if set NCCL_P2P_USE_CUDA_MEMCPY=1, NCCL_SHM_USE_CUDA_MEMCPY=1, and NCCL_SHM_MEMCPY_MODE=1, nccl-test can work, but pytorch is hung in nccl.

@sjeaugey
Copy link
Member

In general, I would advise against setting NCCL_P2P_USE_CUDA_MEMCPY=1. It will increase the base latency very significantly, and can cause hangs (as you experienced). Now, if you find that it makes a significant performance difference (not just 10% .. more like 2x) and large sizes is what your application really relies on, then you can try to set it, together with NCCL_CREATE_THREAD_CONTEXT=1. Not 100% guaranteed to work though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants