Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not enable P2P between devices #1516

Open
ZhiyiHu1999 opened this issue Nov 18, 2024 · 1 comment
Open

Could not enable P2P between devices #1516

ZhiyiHu1999 opened this issue Nov 18, 2024 · 1 comment

Comments

@ZhiyiHu1999
Copy link

ZhiyiHu1999 commented Nov 18, 2024

Hello! I am doing all-to-all communication using ncclSend() and ncclReccv() between 4 GPUs on two nodes, with 2 GPUs per node. However, it seems that GPUs on the same node cannot do this P2P communication and here is the debug info. Could you help with telling me why this is the case. Thanks a lot!

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-c2927e9a-08b2-a1cd-17a1-092a418bac6e)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-9d7f82f3-c503-8ee9-26f2-424ddefe47e2)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-96b59d15-054d-c99f-68bf-406414f8d051)
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-cdb8590d-7b7f-deec-f4f5-8c132c2bbfed)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-4bab2c5f-3d91-e162-3057-f8068c36d803)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-f5a5d6fc-28e6-6fda-a6fe-0c13357c69bb)
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-c2927e9a-08b2-a1cd-17a1-092a418bac6e)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-9d7f82f3-c503-8ee9-26f2-424ddefe47e2)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-96b59d15-054d-c99f-68bf-406414f8d051)
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-cdb8590d-7b7f-deec-f4f5-8c132c2bbfed)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-4bab2c5f-3d91-e162-3057-f8068c36d803)
GPU 2: NVIDIA GeForce RTX 3090 (UUID: GPU-f5a5d6fc-28e6-6fda-a6fe-0c13357c69bb)
The local rank is: 0
ault43:1696338:1696338 [0] NCCL INFO Bootstrap : Using ib0:192.168.0.43<0>
ault43:1696338:1696338 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ault43:1696338:1696338 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ault43:1696338:1696338 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.17.1+cuda11.8
The local rank is: 0
ault44:1424218:1424218 [0] NCCL INFO cudaDriverVersion 12010
ault44:1424218:1424218 [0] NCCL INFO Bootstrap : Using ib0:192.168.0.44<0>
ault44:1424218:1424218 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ault44:1424218:1424218 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ault44:1424218:1424218 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/RoCE [RO]; OOB ib0:192.168.0.44<0>
ault44:1424218:1424218 [0] NCCL INFO Using network IB
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/RoCE [RO]; OOB ib0:192.168.0.43<0>
ault43:1696338:1696338 [0] NCCL INFO Using network IB
ault43:1696338:1696338 [0] NCCL INFO NCCL_TOPO_DUMP_FILE set by environment to ./results/Topology_Intra_Node.txt
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO NCCL_GRAPH_DUMP_FILE set by environment to ./results/Graph.txt
ault43:1696338:1696338 [0] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 2.
ault43:1696338:1696338 [0] NCCL INFO Channel 00/02 :    0   1   2   3
ault43:1696338:1696338 [0] NCCL INFO Channel 01/02 :    0   1   2   3
ault43:1696338:1696338 [0] NCCL INFO Ring 00 : 3 -> 0 -> 1
ault43:1696338:1696338 [0] NCCL INFO Ring 01 : 3 -> 0 -> 1
ault43:1696338:1696338 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
ault43:1696338:1696338 [0] NCCL INFO P2P Chunksize set to 131072
ault43:1696338:1696338 [0] NCCL INFO Channel 00/0 : 3[81000] -> 0[41000] [receive] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Channel 01/0 : 3[81000] -> 0[41000] [receive] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Channel 00 : 0[41000] -> 1[81000] via SHM/direct/direct
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Channel 01 : 0[41000] -> 1[81000] via SHM/direct/direct
ault43:1696338:1696338 [0] NCCL INFO Connected all rings
ault43:1696338:1696338 [0] NCCL INFO Channel 00/0 : 2[41000] -> 0[41000] [receive] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Channel 01/0 : 2[41000] -> 0[41000] [receive] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Channel 00/0 : 0[41000] -> 2[41000] [send] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Channel 01/0 : 0[41000] -> 2[41000] [send] via NET/IB/0
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696338 [0] NCCL INFO Connected all trees
ault43:1696338:1696338 [0] NCCL INFO NCCL_PROTO set by environment to LL
ault43:1696338:1696338 [0] NCCL INFO NCCL_ALGO set by environment to Tree
ault43:1696338:1696338 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ault43:1696338:1696338 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ault43:1696338:1696338 [0] NCCL INFO comm 0x666cdb0 rank 0 nranks 4 cudaDev 0 busId 41000 commId 0xecdec238786580e6 - Init COMPLETE
ault43:1696338:1696401 [0] NCCL INFO Channel 00/1 : 3[81000] -> 0[41000] [receive] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Channel 01/1 : 3[81000] -> 0[41000] [receive] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696401 [0] NCCL INFO Channel 00 : 0[41000] -> 1[81000] via SHM/direct/direct
ault43:1696338:1696401 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696338:1696401 [0] NCCL INFO Channel 01 : 0[41000] -> 1[81000] via SHM/direct/direct
ault43:1696338:1696401 [0] NCCL INFO Channel 00/1 : 2[41000] -> 0[41000] [receive] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Channel 01/1 : 2[41000] -> 0[41000] [receive] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Channel 00/1 : 0[41000] -> 2[41000] [send] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Channel 01/1 : 0[41000] -> 2[41000] [send] via NET/IB/0/Shared
ault43:1696338:1696401 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 2.
ault44:1424218:1424218 [0] NCCL INFO Ring 00 : 1 -> 2 -> 3
ault44:1424218:1424218 [0] NCCL INFO Ring 01 : 1 -> 2 -> 3
ault44:1424218:1424218 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
ault44:1424218:1424218 [0] NCCL INFO P2P Chunksize set to 131072
ault44:1424218:1424218 [0] NCCL INFO Channel 00/0 : 1[81000] -> 2[41000] [receive] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Channel 01/0 : 1[81000] -> 2[41000] [receive] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Channel 00 : 2[41000] -> 3[81000] via SHM/direct/direct
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Channel 01 : 2[41000] -> 3[81000] via SHM/direct/direct
ault44:1424218:1424218 [0] NCCL INFO Connected all rings
ault44:1424218:1424218 [0] NCCL INFO Channel 00/0 : 0[41000] -> 2[41000] [receive] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Channel 01/0 : 0[41000] -> 2[41000] [receive] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Channel 00/0 : 2[41000] -> 0[41000] [send] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Channel 01/0 : 2[41000] -> 0[41000] [send] via NET/IB/0
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424218 [0] NCCL INFO Connected all trees
ault44:1424218:1424218 [0] NCCL INFO NCCL_PROTO set by environment to LL
ault44:1424218:1424218 [0] NCCL INFO NCCL_ALGO set by environment to Tree
ault44:1424218:1424218 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ault44:1424218:1424218 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ault44:1424218:1424218 [0] NCCL INFO comm 0x666d2e0 rank 2 nranks 4 cudaDev 0 busId 41000 commId 0xecdec238786580e6 - Init COMPLETE
ault44:1424218:1424272 [0] NCCL INFO Channel 00/1 : 1[81000] -> 2[41000] [receive] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Channel 01/1 : 1[81000] -> 2[41000] [receive] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424272 [0] NCCL INFO Channel 00 : 2[41000] -> 3[81000] via SHM/direct/direct
ault44:1424218:1424272 [0] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424218:1424272 [0] NCCL INFO Channel 01 : 2[41000] -> 3[81000] via SHM/direct/direct
ault44:1424218:1424272 [0] NCCL INFO Channel 00/1 : 0[41000] -> 2[41000] [receive] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Channel 01/1 : 0[41000] -> 2[41000] [receive] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Channel 00/1 : 2[41000] -> 0[41000] [send] via NET/IB/0/Shared
ault44:1424218:1424272 [0] NCCL INFO Channel 01/1 : 2[41000] -> 0[41000] [send] via NET/IB/0/Shared
The local rank is: 1
ault43:1696335:1696335 [1] NCCL INFO cudaDriverVersion 12010
ault43:1696335:1696335 [1] NCCL INFO Bootstrap : Using ib0:192.168.0.43<0>
ault43:1696335:1696335 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ault43:1696335:1696335 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ault43:1696335:1696335 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/RoCE [RO]; OOB ib0:192.168.0.43<0>
ault43:1696335:1696335 [1] NCCL INFO Using network IB
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault43:1696335:1696335 [1] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 2.
ault43:1696335:1696335 [1] NCCL INFO Ring 00 : 0 -> 1 -> 2
ault43:1696335:1696335 [1] NCCL INFO Ring 01 : 0 -> 1 -> 2
ault43:1696335:1696335 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
ault43:1696335:1696335 [1] NCCL INFO P2P Chunksize set to 131072
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Channel 00/0 : 1[81000] -> 2[41000] [send] via NET/IB/0
ault43:1696335:1696335 [1] NCCL INFO Channel 01/0 : 1[81000] -> 2[41000] [send] via NET/IB/0
ault43:1696335:1696335 [1] NCCL INFO Connected all rings
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Channel 00 : 1[81000] -> 0[41000] via SHM/direct/direct
ault43:1696335:1696335 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696335 [1] NCCL INFO Channel 01 : 1[81000] -> 0[41000] via SHM/direct/direct
ault43:1696335:1696335 [1] NCCL INFO Connected all trees
ault43:1696335:1696335 [1] NCCL INFO NCCL_PROTO set by environment to LL
ault43:1696335:1696335 [1] NCCL INFO NCCL_ALGO set by environment to Tree
ault43:1696335:1696335 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ault43:1696335:1696335 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ault43:1696335:1696335 [1] NCCL INFO comm 0x666c240 rank 1 nranks 4 cudaDev 1 busId 81000 commId 0xecdec238786580e6 - Init COMPLETE
ault43:1696335:1696402 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696402 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696402 [1] NCCL INFO Channel 00/1 : 1[81000] -> 2[41000] [send] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 01/1 : 1[81000] -> 2[41000] [send] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 00/1 : 3[81000] -> 1[81000] [receive] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 01/1 : 3[81000] -> 1[81000] [receive] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 00/1 : 1[81000] -> 3[81000] [send] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 01/1 : 1[81000] -> 3[81000] [send] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 00/1 : 2[41000] -> 1[81000] [receive] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Channel 01/1 : 2[41000] -> 1[81000] [receive] via NET/IB/1/Shared
ault43:1696335:1696402 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696402 [1] NCCL INFO Channel 00 : 1[81000] -> 0[41000] via SHM/direct/direct
ault43:1696335:1696402 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault43:1696335:1696402 [1] NCCL INFO Channel 01 : 1[81000] -> 0[41000] via SHM/direct/direct

The local rank is: 1
ault44:1424216:1424216 [1] NCCL INFO cudaDriverVersion 12010
ault44:1424216:1424216 [1] NCCL INFO Bootstrap : Using ib0:192.168.0.44<0>
ault44:1424216:1424216 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ault44:1424216:1424216 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ault44:1424216:1424216 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/RoCE [RO]; OOB ib0:192.168.0.44<0>
ault44:1424216:1424216 [1] NCCL INFO Using network IB
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 0(=41000) and dev 1(=81000)
ault44:1424216:1424216 [1] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 2.
ault44:1424216:1424216 [1] NCCL INFO Ring 00 : 2 -> 3 -> 0
ault44:1424216:1424216 [1] NCCL INFO Ring 01 : 2 -> 3 -> 0
ault44:1424216:1424216 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
ault44:1424216:1424216 [1] NCCL INFO P2P Chunksize set to 131072
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Channel 00/0 : 3[81000] -> 0[41000] [send] via NET/IB/0
ault44:1424216:1424216 [1] NCCL INFO Channel 01/0 : 3[81000] -> 0[41000] [send] via NET/IB/0
ault44:1424216:1424216 [1] NCCL INFO Connected all rings
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Channel 00 : 3[81000] -> 2[41000] via SHM/direct/direct
ault44:1424216:1424216 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424216 [1] NCCL INFO Channel 01 : 3[81000] -> 2[41000] via SHM/direct/direct
ault44:1424216:1424216 [1] NCCL INFO Connected all trees
ault44:1424216:1424216 [1] NCCL INFO NCCL_PROTO set by environment to LL
ault44:1424216:1424216 [1] NCCL INFO NCCL_ALGO set by environment to Tree
ault44:1424216:1424216 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
ault44:1424216:1424216 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ault44:1424216:1424216 [1] NCCL INFO comm 0x666e930 rank 3 nranks 4 cudaDev 1 busId 81000 commId 0xecdec238786580e6 - Init COMPLETE
ault44:1424216:1424273 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424273 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424273 [1] NCCL INFO Channel 00/1 : 3[81000] -> 0[41000] [send] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 01/1 : 3[81000] -> 0[41000] [send] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 00/1 : 1[81000] -> 3[81000] [receive] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 01/1 : 1[81000] -> 3[81000] [receive] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 00/1 : 3[81000] -> 1[81000] [send] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 01/1 : 3[81000] -> 1[81000] [send] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 00/1 : 0[41000] -> 3[81000] [receive] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Channel 01/1 : 0[41000] -> 3[81000] [receive] via NET/IB/1/Shared
ault44:1424216:1424273 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424273 [1] NCCL INFO Channel 00 : 3[81000] -> 2[41000] via SHM/direct/direct
ault44:1424216:1424273 [1] NCCL INFO Could not enable P2P between dev 1(=81000) and dev 0(=41000)
ault44:1424216:1424273 [1] NCCL INFO Channel 01 : 3[81000] -> 2[41000] via SHM/direct/direct
@sjeaugey
Copy link
Member

I believe Geforce cards are not P2P-capable. Now, it may not be a huge deal, if you only have 2 GPUs per node and they're not connected through a PCI switch but directly to the CPU. In that case, going through memory can give better performance than P2P.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants