Installation questions from beginners #1100

MLS2021 · 2023-12-02T13:52:57Z

I've only recently come across distributed training. I want to install nccl in my server environment. But I don't have root access, and there are only sudo installation tutorials on the nccl website. How do I proceed, hope someone can help me.

sjeaugey · 2023-12-04T08:13:10Z

You can just build with make as shown in README then instead of installing on the system, export LD_LIBRARY_PATH=$PWD/build/lib:$LD_LIBRARY_PATH and that would make your programs use the newly built NCCL.

MLS2021 · 2023-12-07T09:12:45Z

您可以按照 README 中所示的方式使用 make 进行构建，而不是在系统上安装，export LD_LIBRARY_PATH=$PWD/build/lib:$LD_LIBRARY_PATH这将使您的程序使用新构建的 NCCL。

Thank you for your answer. I ran into connection timeouts after rebuilding nccl. Here's the gdb prompt to run:
[New Thread 0x7fff756f7640 (LWP 1525768)]
[New Thread 0x7fff40484640 (LWP 1525770)]
[New Thread 0x7fff3fc83640 (LWP 1525771)]
[Detaching after fork from child process 1525772]
[Detaching after fork from child process 1525773]
[Detaching after fork from child process 1525774]
Running DDP example on rank 0.
Running DDP example on rank 1.
initial finished
Rank 1/2 initialized
initial finished
Rank 0/2 initialized
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600674 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600674 milliseconds before timing out.
Exception raised from checkTimeout at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x231 (0x7fffd50ec8f1 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x1bd (0x7fffd50f046d in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x130 (0x7fffd50f1030 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600674 milliseconds before timing out.
Exception raised from checkTimeout at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x231 (0x7fffd50ec8f1 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x1bd (0x7fffd50f046d in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x130 (0x7fffd50f1030 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe54a61 (0x7fffd4e54a61 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600679 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600679 milliseconds before timing out.
Exception raised from checkTimeout at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x231 (0x7fffd50ec8f1 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x1bd (0x7fffd50f046d in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x130 (0x7fffd50f1030 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600679 milliseconds before timing out.
Exception raised from checkTimeout at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x231 (0x7fffd50ec8f1 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x1bd (0x7fffd50f046d in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x130 (0x7fffd50f1030 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe54a61 (0x7fffd4e54a61 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

Traceback (most recent call last):
File "/home/l-z/MyCode/Unimatch/test.py", line 66, in
main()
File "/home/l-z/MyCode/Unimatch/test.py", line 63, in main
mp.spawn(demo_ddp, args=(world_size,), nprocs=world_size, join=True)
File "/home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
[Thread 0x7fff3fc83640 (LWP 1525771) exited]
[Thread 0x7fff40484640 (LWP 1525770) exited]
[Thread 0x7fff756f7640 (LWP 1525768) exited]
[Inferior 1 (process 1525718) exited with code 01]
(gdb)
我在调试代码时，发现卡在了这里：
ddp_model = DDP(model, device_ids=[rank])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installation questions from beginners #1100

Installation questions from beginners #1100

MLS2021 commented Dec 2, 2023

sjeaugey commented Dec 4, 2023

MLS2021 commented Dec 7, 2023

Installation questions from beginners #1100

Installation questions from beginners #1100

Comments

MLS2021 commented Dec 2, 2023

sjeaugey commented Dec 4, 2023

MLS2021 commented Dec 7, 2023