Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Using Different GPUs for Two Containers on the Same Node #1529

Open
cyberpunk-admin opened this issue Dec 2, 2024 · 6 comments
Open

Comments

@cyberpunk-admin
Copy link

1. Description

I attempted LoRA fine-tuning using llama-factory and employed kubedl for deployment. The setup consisted of two pods on the same node, with each pod allocated four GPUs.
Image
Image

some info:
used host network
NCCL version 2.21.5+cuda12.4
two containers on different nodes can work

1.1 Log

Converting format of dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 78346.22 examples/s]
[rank0]:[W1202 20:46:22.851750131 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6895:6895 [0] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6895:6895 NCCL CALL ncclGetUniqueId(0xc928078d70c6fe2c)
ucpe-resource033018034100:6895:6895 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
ucpe-resource033018034100:6897:6897 [2] NCCL INFO cudaDriverVersion 12040
ucpe-resource033018034100:6896:6896 [1] NCCL INFO cudaDriverVersion 12040
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6898:6898 [3] NCCL INFO cudaDriverVersion 12040
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6895:6895 [0] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7fc24a600000
ucpe-resource033018034100:6897:6897 [2] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6896:6896 [1] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6898:6898 [3] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6897:6897 [2] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7ff906600000
ucpe-resource033018034100:6896:6896 [1] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7f2282600000
ucpe-resource033018034100:6898:6898 [3] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7fbc02600000
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6895:7101 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6895:7101 [0] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Using network Socket
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6897:7102 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6897:7102 [2] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Using network Socket
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6896:7103 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6898:7104 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6896:7103 [1] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6898:7104 [3] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Using network Socket
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Using network Socket
ucpe-resource033018034100:6898:7104 [3] NCCL INFO ncclCommInitRank comm 0xf5c0b50 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId d2000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6895:7101 [0] NCCL INFO ncclCommInitRank comm 0x101b6e80 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 89000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6896:7103 [1] NCCL INFO ncclCommInitRank comm 0xf1486b0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8a000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6897:7102 [2] NCCL INFO ncclCommInitRank comm 0xdbe5cd0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId d1000 commId 0xc928078d70c6fe2c - Init START

ucpe-resource033018034100:6896:7103 [1] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6896:7103 [1] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6896:6896 [1] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6896:6896 [1] NCCL INFO init.cc:1929 -> 2

ucpe-resource033018034100:6895:7101 [0] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6895:7101 [0] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6895:6895 [0] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6895:6895 [0] NCCL INFO init.cc:1929 -> 2

ucpe-resource033018034100:6898:7104 [3] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6898:7104 [3] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO group.cc:64 -> 2 [Async thread]

ucpe-resource033018034100:6897:7102 [2] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6897:7102 [2] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6897:6897 [2] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6898:6898 [3] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6897:6897 [2] NCCL INFO init.cc:1929 -> 2
ucpe-resource033018034100:6898:6898 [3] NCCL INFO init.cc:1929 -> 2
[rank1]: Traceback (most recent call last):
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank1]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 266, in get_dataset
[rank1]: with training_args.main_process_first(desc="load dataset"):
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/contextlib.py", line 137, in enter
[rank1]: return next(self.gen)
[rank1]: ^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/training_args.py", line 2460, in main_process_first
[rank1]: dist.barrier()
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank1]: work = group.barrier(opts=opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1729647378361/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
[rank0]: Traceback (most recent call last):
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank0]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 266, in get_dataset
[rank0]: with training_args.main_process_first(desc="load dataset"):
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/contextlib.py", line 144, in exit
[rank0]: next(self.gen)
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/training_args.py", line 2469, in main_process_first
[rank0]: dist.barrier()
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1729647378361/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
[rank3]: Traceback (most recent call last):
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 23, in
[rank3]: launch()
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 19, in launch
[rank3]: run_exp()
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank3]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 266, in get_dataset
[rank3]: with training_args.main_process_first(desc="load dataset"):
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/contextlib.py", line 137, in enter
[rank3]: return next(self.gen)
[rank3]: ^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/training_args.py", line 2460, in main_process_first
[rank3]: dist.barrier()
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank3]: work = group.barrier(opts=opts)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1729647378361/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5

@cyberpunk-admin
Copy link
Author

cyberpunk-admin commented Dec 9, 2024

When using torch==2.1.0+cu121 with NCCL version 2.18.1+cuda12.1 on NVIDIA A800 GPUs, distributed training on two containers located on the same node workes well.

Each container has 2 GPUs. As the logs show, GPU 3 and GPU 0 connect via NET/Socket/0, GPU 0 and GPU 1 connect via P2P/IPC/read, and GPU 3 and GPU 2 connect via P2P/IPC/read. This forms a ring topology among the 4 GPUs.

Logs

NCCL version 2.18.1+cuda12.1
resource033052047223:3343:3343 [1] NCCL INFO cudaDriverVersion 12040
resource033052047223:3343:3343 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
resource033052047223:3343:3343 [1] NCCL INFO Bootstrap : Using eth0:33.52.47.223<0>
resource033052047223:3343:3343 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
resource033052047223:3343:3343 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
resource033052047223:3342:3547 [0] NCCL INFO Failed to open libibverbs.so[.1]
resource033052047223:3342:3547 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
resource033052047223:3342:3547 [0] NCCL INFO NET/Socket : Using [0]eth0:33.52.47.223<0>
resource033052047223:3342:3547 [0] NCCL INFO Using network Socket
resource033052047223:3343:3548 [1] NCCL INFO Failed to open libibverbs.so[.1]
resource033052047223:3343:3548 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
resource033052047223:3343:3548 [1] NCCL INFO NET/Socket : Using [0]eth0:33.52.47.223<0>
resource033052047223:3343:3548 [1] NCCL INFO Using network Socket

resource033052047223:3343:3548 [1] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

resource033052047223:3343:3548 [1] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

resource033052047223:3342:3547 [0] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

resource033052047223:3342:3547 [0] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
resource033052047223:3343:3548 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff,00000000
resource033052047223:3342:3547 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
resource033052047223:3343:3548 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
resource033052047223:3343:3548 [1] NCCL INFO P2P Chunksize set to 131072
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/02 : 0 1 2 3
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/02 : 0 1 2 3
resource033052047223:3342:3547 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
resource033052047223:3342:3547 [0] NCCL INFO P2P Chunksize set to 131072
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/0 : 3[c2000] -> 0[4b000] [receive] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/0 : 3[c2000] -> 0[4b000] [receive] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/0 : 0[4b000] -> 1[d6000] via P2P/IPC/read
resource033052047223:3343:3548 [1] NCCL INFO Channel 00/0 : 1[d6000] -> 2[63000] [send] via NET/Socket/0
resource033052047223:3343:3548 [1] NCCL INFO Channel 01/0 : 1[d6000] -> 2[63000] [send] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/0 : 0[4b000] -> 1[d6000] via P2P/IPC/read
resource033052047223:3342:3547 [0] NCCL INFO Connected all rings
resource033052047223:3343:3548 [1] NCCL INFO Connected all rings
resource033052047223:3343:3548 [1] NCCL INFO Channel 00/0 : 1[d6000] -> 0[4b000] via P2P/IPC/read
resource033052047223:3343:3548 [1] NCCL INFO Channel 01/0 : 1[d6000] -> 0[4b000] via P2P/IPC/read
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/0 : 2[63000] -> 0[4b000] [receive] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/0 : 2[63000] -> 0[4b000] [receive] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/0 : 0[4b000] -> 2[63000] [send] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/0 : 0[4b000] -> 2[63000] [send] via NET/Socket/0
resource033052047223:3343:3548 [1] NCCL INFO Connected all trees
resource033052047223:3343:3548 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
resource033052047223:3343:3548 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
resource033052047223:3342:3547 [0] NCCL INFO Connected all trees
resource033052047223:3342:3547 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
resource033052047223:3342:3547 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
resource033052047223:3343:3548 [1] NCCL INFO comm 0xe02e3b0 rank 1 nranks 4 cudaDev 1 busId d6000 commId 0xb9b810dc9b9a2f09 - Init COMPLETE
resource033052047223:3342:3547 [0] NCCL INFO comm 0x10542a10 rank 0 nranks 4 cudaDev 0 busId 4b000 commId 0xb9b810dc9b9a2f09 - Init COMPLETE

@sjeaugey
Copy link
Member

sjeaugey commented Dec 9, 2024

The errors you're getting seem to relate to the container not being configured correctly and not reporting information NCCL needs, like the PCI topology. Is /sys mounted inside your container?

@cyberpunk-admin
Copy link
Author

Thank you for your help. My containers do not have the /sys directory mounted. I will try mounting /sys and then proceed with testing. Could you tell me where in /sys I can find the PCI topology information? I noticed that the contents of the /dev/devices/ directory are the same in both containers, even without the /sys directory mounted.

@cyberpunk-admin
Copy link
Author

I tried mounting the /sys directory and ran the tests again, but the problem is still there.

Thank you for your help. My containers do not have the /sys directory mounted. I will try mounting /sys and then proceed with testing. Could you tell me where in /sys I can find the PCI topology information? I noticed that the contents of the /dev/devices/ directory are the same in both containers, even without the /sys directory mounted.

@sjeaugey
Copy link
Member

You may need to mount other location. I was not sure what caused this:

 [2] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

But this:

[0] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.

Was clearly coming from /sys not being mounted. There could be other places which are missing.

In any case, with your current configuration, NVML is broken so NCCL will not be able to operate. In NCCL 2.18 we had a fallback when NVML was broken to rely on CUDA only, but that fallback was not robust so we removed it, considering NVML should not be broken, as it could have other negative consequences on performance.

@cyberpunk-admin
Copy link
Author

Perhaps we can set a certain environment variable to enable network connections between isolated containers, even if there is NVLINK between GPUs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants