-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error Using Different GPUs for Two Containers on the Same Node #1529
Comments
When using torch==2.1.0+cu121 with NCCL version 2.18.1+cuda12.1 on NVIDIA A800 GPUs, distributed training on two containers located on the same node workes well. Each container has 2 GPUs. As the logs show, GPU 3 and GPU 0 connect via NET/Socket/0, GPU 0 and GPU 1 connect via P2P/IPC/read, and GPU 3 and GPU 2 connect via P2P/IPC/read. This forms a ring topology among the 4 GPUs. LogsNCCL version 2.18.1+cuda12.1 resource033052047223:3343:3548 [1] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found resource033052047223:3343:3548 [1] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found resource033052047223:3342:3547 [0] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found resource033052047223:3342:3547 [0] misc/nvmlwrap.cc:183 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found |
The errors you're getting seem to relate to the container not being configured correctly and not reporting information NCCL needs, like the PCI topology. Is /sys mounted inside your container? |
Thank you for your help. My containers do not have the /sys directory mounted. I will try mounting /sys and then proceed with testing. Could you tell me where in /sys I can find the PCI topology information? I noticed that the contents of the /dev/devices/ directory are the same in both containers, even without the /sys directory mounted. |
I tried mounting the /sys directory and ran the tests again, but the problem is still there.
|
You may need to mount other location. I was not sure what caused this:
But this:
Was clearly coming from /sys not being mounted. There could be other places which are missing. In any case, with your current configuration, NVML is broken so NCCL will not be able to operate. In NCCL 2.18 we had a fallback when NVML was broken to rely on CUDA only, but that fallback was not robust so we removed it, considering NVML should not be broken, as it could have other negative consequences on performance. |
Perhaps we can set a certain environment variable to enable network connections between isolated containers, even if there is NVLINK between GPUs? |
1. Description
I attempted LoRA fine-tuning using llama-factory and employed kubedl for deployment. The setup consisted of two pods on the same node, with each pod allocated four GPUs.
some info:
used host network
NCCL version 2.21.5+cuda12.4
two containers on different nodes can work
1.1 Log
Converting format of dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 78346.22 examples/s]
[rank0]:[W1202 20:46:22.851750131 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6895:6895 [0] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6895:6895 [0] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6895:6895 NCCL CALL ncclGetUniqueId(0xc928078d70c6fe2c)
ucpe-resource033018034100:6895:6895 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
ucpe-resource033018034100:6897:6897 [2] NCCL INFO cudaDriverVersion 12040
ucpe-resource033018034100:6896:6896 [1] NCCL INFO cudaDriverVersion 12040
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6898:6898 [3] NCCL INFO cudaDriverVersion 12040
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NCCL_SOCKET_IFNAME set to eth
ucpe-resource033018034100:6895:6895 [0] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7fc24a600000
ucpe-resource033018034100:6897:6897 [2] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6896:6896 [1] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6898:6898 [3] NCCL INFO Bootstrap : Using eth0:33.18.34.100<0>
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6897:6897 [2] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6896:6896 [1] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
ucpe-resource033018034100:6898:6898 [3] NCCL INFO NET/Plugin: Using internal network plugin.
ucpe-resource033018034100:6897:6897 [2] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7ff906600000
ucpe-resource033018034100:6896:6896 [1] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7f2282600000
ucpe-resource033018034100:6898:6898 [3] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7fbc02600000
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6895:7101 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6895:7101 [0] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6895:7101 [0] NCCL INFO Using network Socket
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6897:7102 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6897:7102 [2] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6897:7102 [2] NCCL INFO Using network Socket
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6896:7103 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Failed to open libibverbs.so[.1]
ucpe-resource033018034100:6898:7104 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
ucpe-resource033018034100:6896:7103 [1] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6898:7104 [3] NCCL INFO NET/Socket : Using [0]eth0:33.18.34.100<0>
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6896:7103 [1] NCCL INFO Using network Socket
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Using non-device net plugin version 0
ucpe-resource033018034100:6898:7104 [3] NCCL INFO Using network Socket
ucpe-resource033018034100:6898:7104 [3] NCCL INFO ncclCommInitRank comm 0xf5c0b50 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId d2000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6895:7101 [0] NCCL INFO ncclCommInitRank comm 0x101b6e80 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 89000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6896:7103 [1] NCCL INFO ncclCommInitRank comm 0xf1486b0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8a000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6897:7102 [2] NCCL INFO ncclCommInitRank comm 0xdbe5cd0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId d1000 commId 0xc928078d70c6fe2c - Init START
ucpe-resource033018034100:6896:7103 [1] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6896:7103 [1] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6896:7103 [1] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6896:6896 [1] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6896:6896 [1] NCCL INFO init.cc:1929 -> 2
ucpe-resource033018034100:6895:7101 [0] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6895:7101 [0] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6895:7101 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6895:6895 [0] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6895:6895 [0] NCCL INFO init.cc:1929 -> 2
ucpe-resource033018034100:6898:7104 [3] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6898:7104 [3] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6898:7104 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6897:7102 [2] misc/nvmlwrap.cc:187 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
ucpe-resource033018034100:6897:7102 [2] NCCL INFO graph/xml.cc:850 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO graph/topo.cc:696 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO init.cc:1012 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO init.cc:1548 -> 2
ucpe-resource033018034100:6897:7102 [2] NCCL INFO group.cc:64 -> 2 [Async thread]
ucpe-resource033018034100:6897:6897 [2] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6898:6898 [3] NCCL INFO group.cc:418 -> 2
ucpe-resource033018034100:6897:6897 [2] NCCL INFO init.cc:1929 -> 2
ucpe-resource033018034100:6898:6898 [3] NCCL INFO init.cc:1929 -> 2
[rank1]: Traceback (most recent call last):
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank1]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 266, in get_dataset
[rank1]: with training_args.main_process_first(desc="load dataset"):
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/contextlib.py", line 137, in enter
[rank1]: return next(self.gen)
[rank1]: ^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/training_args.py", line 2460, in main_process_first
[rank1]: dist.barrier()
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank1]: work = group.barrier(opts=opts)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1729647378361/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
[rank0]: Traceback (most recent call last):
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank0]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 266, in get_dataset
[rank0]: with training_args.main_process_first(desc="load dataset"):
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/contextlib.py", line 144, in exit
[rank0]: next(self.gen)
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/training_args.py", line 2469, in main_process_first
[rank0]: dist.barrier()
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1729647378361/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: nvmlDeviceGetHandleByPciBusId() failed: Not Found
[rank3]: Traceback (most recent call last):
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 23, in
[rank3]: launch()
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/launcher.py", line 19, in launch
[rank3]: run_exp()
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/train/sft/workflow.py", line 47, in run_sft
[rank3]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/llamafactory/data/loader.py", line 266, in get_dataset
[rank3]: with training_args.main_process_first(desc="load dataset"):
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/contextlib.py", line 137, in enter
[rank3]: return next(self.gen)
[rank3]: ^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/transformers/training_args.py", line 2460, in main_process_first
[rank3]: dist.barrier()
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/miniconda3/envs/llama-factory/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4166, in barrier
[rank3]: work = group.barrier(opts=opts)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1729647378361/work/torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
The text was updated successfully, but these errors were encountered: