Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation questions from beginners #1100

Open
MLS2021 opened this issue Dec 2, 2023 · 2 comments
Open

Installation questions from beginners #1100

MLS2021 opened this issue Dec 2, 2023 · 2 comments

Comments

@MLS2021
Copy link

MLS2021 commented Dec 2, 2023

I've only recently come across distributed training. I want to install nccl in my server environment. But I don't have root access, and there are only sudo installation tutorials on the nccl website. How do I proceed, hope someone can help me.

@sjeaugey
Copy link
Member

sjeaugey commented Dec 4, 2023

You can just build with make as shown in README then instead of installing on the system, export LD_LIBRARY_PATH=$PWD/build/lib:$LD_LIBRARY_PATH and that would make your programs use the newly built NCCL.

@MLS2021
Copy link
Author

MLS2021 commented Dec 7, 2023

您可以按照 README 中所示的方式使用 make 进行构建,而不是在系统上安装,export LD_LIBRARY_PATH=$PWD/build/lib:$LD_LIBRARY_PATH这将使您的程序使用新构建的 NCCL。

Thank you for your answer. I ran into connection timeouts after rebuilding nccl. Here's the gdb prompt to run:
[New Thread 0x7fff756f7640 (LWP 1525768)]
[New Thread 0x7fff40484640 (LWP 1525770)]
[New Thread 0x7fff3fc83640 (LWP 1525771)]
[Detaching after fork from child process 1525772]
[Detaching after fork from child process 1525773]
[Detaching after fork from child process 1525774]
Running DDP example on rank 0.
Running DDP example on rank 1.
initial finished
Rank 1/2 initialized
initial finished
Rank 0/2 initialized
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600674 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600674 milliseconds before timing out.
Exception raised from checkTimeout at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x231 (0x7fffd50ec8f1 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x1bd (0x7fffd50f046d in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x130 (0x7fffd50f1030 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600674 milliseconds before timing out.
Exception raised from checkTimeout at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x231 (0x7fffd50ec8f1 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x1bd (0x7fffd50f046d in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x130 (0x7fffd50f1030 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe54a61 (0x7fffd4e54a61 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600679 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600679 milliseconds before timing out.
Exception raised from checkTimeout at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x231 (0x7fffd50ec8f1 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x1bd (0x7fffd50f046d in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x130 (0x7fffd50f1030 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600679 milliseconds before timing out.
Exception raised from checkTimeout at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x231 (0x7fffd50ec8f1 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x1bd (0x7fffd50f046d in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x130 (0x7fffd50f1030 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7ffff539f1ac in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe54a61 (0x7fffd4e54a61 in /home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fffeeaf0e95 in /home/l-z/miniconda3/envs/ENV02/bin/../lib/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7ffff7c94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126a40 (0x7ffff7d26a40 in /lib/x86_64-linux-gnu/libc.so.6)

Traceback (most recent call last):
File "/home/l-z/MyCode/Unimatch/test.py", line 66, in
main()
File "/home/l-z/MyCode/Unimatch/test.py", line 63, in main
mp.spawn(demo_ddp, args=(world_size,), nprocs=world_size, join=True)
File "/home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/l-z/miniconda3/envs/ENV02/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
[Thread 0x7fff3fc83640 (LWP 1525771) exited]
[Thread 0x7fff40484640 (LWP 1525770) exited]
[Thread 0x7fff756f7640 (LWP 1525768) exited]
[Inferior 1 (process 1525718) exited with code 01]
(gdb)
我在调试代码时,发现卡在了这里:
ddp_model = DDP(model, device_ids=[rank])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants