Error Using Different GPUs for Two Containers on the Same Node #1529

cyberpunk-admin · 2024-12-02T12:54:20Z

1. Description

I attempted LoRA fine-tuning using llama-factory and employed kubedl for deployment. The setup consisted of two pods on the same node, with each pod allocated four GPUs.

some info:
used host network
NCCL version 2.21.5+cuda12.4
two containers on different nodes can work

1.1 Log

Converting format of dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 78346.22 examples/s]
[rank0]:[W1202 20:46:22.851750131 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6895:6895 [0] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6895:6895 NCCL CALL ncclGetUniqueId(0xc928078d70c6fe2c)
ucpe-resource033018034100:6895:6895 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
ucpe-resource033018034100:6897:6897 [2] NCCL INFO cudaDriverVersion 12040
ucpe-resource033018034100:6896:6896 [1] NCCL INFO cudaDriverVersion 12040
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6898:6898 [3] NCCL INFO cudaDriverVersion 12040
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6895:6895 [0] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7fc24a600000
ucpe-resource033018034100:6897:6897 [2] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6896:6896 [1] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6898:6898 [3] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6897:6897 [2] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7ff906600000
ucpe-resource033018034100:6896:6896 [1] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7f2282600000
ucpe-resource033018034100:6898:6898 [3] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7fbc02600000
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6895:7101 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6895:7101 [0] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Using network Socket
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6897:7102 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6897:7102 [2] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Using network Socket
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6896:7103 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6898:7104 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6896:7103 [1] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6898:7104 [3] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Using network Socket
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Using network Socket
ucpe-resource033018034100:6898:7104 [3] NCCL INFO ncclCommInitRank comm 0xf5c0b50 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId d2000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6895:7101 [0] NCCL INFO ncclCommInitRank comm 0x101b6e80 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 89000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6896:7103 [1] NCCL INFO ncclCommInitRank comm 0xf1486b0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8a000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6897:7102 [2] NCCL INFO ncclCommInitRank comm 0xdbe5cd0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId d1000 commId 0xc928078d70c6fe2c - Init START

ucpe-resource033018034100:6896:7103 [1] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6896:7103 [1] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6896:6896 [1] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6896:6896 [1] NCCL INFO init.cc:1929 -> 2

ucpe-resource033018034100:6895:7101 [0] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6895:7101 [0] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6895:6895 [0] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6895:6895 [0] NCCL INFO init.cc:1929 -> 2

ucpe-resource033018034100:6898:7104 [3] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6898:7104 [3] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO group.cc:64 -> 2 [Async thread]

ucpe-resource033018034100:6897:7102 [2] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6897:7102 [2] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6897:6897 [2] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6898:6898 [3] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6897:6897 [2] NCCL INFO init.cc:1929 -> 2
ucpe-resource033018034100:6898:6898 [3] NCCL INFO init.cc:1929 -> 2
[rank1]: Traceback (most recent call last):
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank1]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 266, in get_dataset
[rank1]: with training_args.main_process_first(desc="load dataset"):
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/contextlib.py", line 137, in enter
[rank1]: return next(self.gen)
[rank1]: ^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/training_args.py", line 2460, in main_process_first
[rank1]: dist.barrier()
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank1]: work = group.barrier(opts=opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1729647378361/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
[rank0]: Traceback (most recent call last):
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank0]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 266, in get_dataset
[rank0]: with training_args.main_process_first(desc="load dataset"):
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/contextlib.py", line 144, in exit
[rank0]: next(self.gen)
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/training_args.py", line 2469, in main_process_first
[rank0]: dist.barrier()
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1729647378361/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
[rank3]: Traceback (most recent call last):
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 23, in
[rank3]: launch()
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 19, in launch
[rank3]: run_exp()
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank3]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 266, in get_dataset
[rank3]: with training_args.main_process_first(desc="load dataset"):
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/contextlib.py", line 137, in enter
[rank3]: return next(self.gen)
[rank3]: ^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/training_args.py", line 2460, in main_process_first
[rank3]: dist.barrier()
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank3]: work = group.barrier(opts=opts)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1729647378361/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5

cyberpunk-admin · 2024-12-09T12:28:19Z

When using torch==2.1.0+cu121 with NCCL version 2.18.1+cuda12.1 on NVIDIA A800 GPUs, distributed training on two containers located on the same node workes well.

Each container has 2 GPUs. As the logs show, GPU 3 and GPU 0 connect via NET/Socket/0, GPU 0 and GPU 1 connect via P2P/IPC/read, and GPU 3 and GPU 2 connect via P2P/IPC/read. This forms a ring topology among the 4 GPUs.

Logs

NCCL version 2.18.1+cuda12.1
resource033052047223:3343:3343 [1] NCCL INFO cudaDriverVersion 12040
resource033052047223:3343:3343 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
resource033052047223:3343:3343 [1] NCCL INFO Bootstrap : Using eth0:33.52.47.223<0>
resource033052047223:3343:3343 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
resource033052047223:3343:3343 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
resource033052047223:3342:3547 [0] NCCL INFO Failed to open libibverbs.so[.1]
resource033052047223:3342:3547 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
resource033052047223:3342:3547 [0] NCCL INFO NET/Socket : Using [0]eth0:33.52.47.223<0>
resource033052047223:3342:3547 [0] NCCL INFO Using network Socket
resource033052047223:3343:3548 [1] NCCL INFO Failed to open libibverbs.so[.1]
resource033052047223:3343:3548 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
resource033052047223:3343:3548 [1] NCCL INFO NET/Socket : Using [0]eth0:33.52.47.223<0>
resource033052047223:3343:3548 [1] NCCL INFO Using network Socket

resource033052047223:3343:3548 [1] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

resource033052047223:3342:3547 [0] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

resource033052047223:3342:3547 [0] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
resource033052047223:3343:3548 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff,00000000
resource033052047223:3342:3547 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
resource033052047223:3343:3548 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
resource033052047223:3343:3548 [1] NCCL INFO P2P Chunksize set to 131072
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/02 : 0 1 2 3
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/02 : 0 1 2 3
resource033052047223:3342:3547 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
resource033052047223:3342:3547 [0] NCCL INFO P2P Chunksize set to 131072
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/0 : 3[c2000] -> 0[4b000] [receive] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/0 : 3[c2000] -> 0[4b000] [receive] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/0 : 0[4b000] -> 1[d6000] via P2P/IPC/read
resource033052047223:3343:3548 [1] NCCL INFO Channel 00/0 : 1[d6000] -> 2[63000] [send] via NET/Socket/0
resource033052047223:3343:3548 [1] NCCL INFO Channel 01/0 : 1[d6000] -> 2[63000] [send] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/0 : 0[4b000] -> 1[d6000] via P2P/IPC/read
resource033052047223:3342:3547 [0] NCCL INFO Connected all rings
resource033052047223:3343:3548 [1] NCCL INFO Connected all rings
resource033052047223:3343:3548 [1] NCCL INFO Channel 00/0 : 1[d6000] -> 0[4b000] via P2P/IPC/read
resource033052047223:3343:3548 [1] NCCL INFO Channel 01/0 : 1[d6000] -> 0[4b000] via P2P/IPC/read
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/0 : 2[63000] -> 0[4b000] [receive] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/0 : 2[63000] -> 0[4b000] [receive] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 00/0 : 0[4b000] -> 2[63000] [send] via NET/Socket/0
resource033052047223:3342:3547 [0] NCCL INFO Channel 01/0 : 0[4b000] -> 2[63000] [send] via NET/Socket/0
resource033052047223:3343:3548 [1] NCCL INFO Connected all trees
resource033052047223:3343:3548 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
resource033052047223:3343:3548 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
resource033052047223:3342:3547 [0] NCCL INFO Connected all trees
resource033052047223:3342:3547 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
resource033052047223:3342:3547 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
resource033052047223:3343:3548 [1] NCCL INFO comm 0xe02e3b0 rank 1 nranks 4 cudaDev 1 busId d6000 commId 0xb9b810dc9b9a2f09 - Init COMPLETE
resource033052047223:3342:3547 [0] NCCL INFO comm 0x10542a10 rank 0 nranks 4 cudaDev 0 busId 4b000 commId 0xb9b810dc9b9a2f09 - Init COMPLETE

sjeaugey · 2024-12-09T13:40:50Z

The errors you're getting seem to relate to the container not being configured correctly and not reporting information NCCL needs, like the PCI topology. Is /sys mounted inside your container?

cyberpunk-admin · 2024-12-10T02:23:23Z

Thank you for your help. My containers do not have the /sys directory mounted. I will try mounting /sys and then proceed with testing. Could you tell me where in /sys I can find the PCI topology information? I noticed that the contents of the /dev/devices/ directory are the same in both containers, even without the /sys directory mounted.

cyberpunk-admin · 2024-12-10T02:41:35Z

I tried mounting the /sys directory and ran the tests again, but the problem is still there.

Thank you for your help. My containers do not have the /sys directory mounted. I will try mounting /sys and then proceed with testing. Could you tell me where in /sys I can find the PCI topology information? I noticed that the contents of the /dev/devices/ directory are the same in both containers, even without the /sys directory mounted.

sjeaugey · 2024-12-10T09:06:05Z

You may need to mount other location. I was not sure what caused this:

 [2] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

But this:

[0] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.

Was clearly coming from /sys not being mounted. There could be other places which are missing.

In any case, with your current configuration, NVML is broken so NCCL will not be able to operate. In NCCL 2.18 we had a fallback when NVML was broken to rely on CUDA only, but that fallback was not robust so we removed it, considering NVML should not be broken, as it could have other negative consequences on performance.

cyberpunk-admin · 2024-12-12T02:35:09Z

Perhaps we can set a certain environment variable to enable network connections between isolated containers, even if there is NVLINK between GPUs？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Using Different GPUs for Two Containers on the Same Node #1529

Error Using Different GPUs for Two Containers on the Same Node #1529

cyberpunk-admin commented Dec 2, 2024

cyberpunk-admin commented Dec 9, 2024 •

edited

Loading

sjeaugey commented Dec 9, 2024

cyberpunk-admin commented Dec 10, 2024

cyberpunk-admin commented Dec 10, 2024

sjeaugey commented Dec 10, 2024

cyberpunk-admin commented Dec 12, 2024

Error Using Different GPUs for Two Containers on the Same Node #1529

Error Using Different GPUs for Two Containers on the Same Node #1529

Comments

cyberpunk-admin commented Dec 2, 2024

1. Description

1.1 Log

cyberpunk-admin commented Dec 9, 2024 • edited Loading

Logs

sjeaugey commented Dec 9, 2024

cyberpunk-admin commented Dec 10, 2024

cyberpunk-admin commented Dec 10, 2024

sjeaugey commented Dec 10, 2024

cyberpunk-admin commented Dec 12, 2024

cyberpunk-admin commented Dec 9, 2024 •

edited

Loading