-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Installation questions from beginners #1100
Comments
You can just build with make as shown in README then instead of installing on the system, |
Thank you for your answer. I ran into connection timeouts after rebuilding nccl. Here's the gdb prompt to run: terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600679 milliseconds before timing out. terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from ncclCommWatchdog at /home/l-z/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): Traceback (most recent call last): |
I've only recently come across distributed training. I want to install nccl in my server environment. But I don't have root access, and there are only sudo installation tutorials on the nccl website. How do I proceed, hope someone can help me.
The text was updated successfully, but these errors were encountered: