You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our Kubernetes (k8s) environment, each node is equipped with a single IB card with a bandwidth of 100Gbps. When 8 containers are scheduled on the same node, NCCL communication over IB is significantly slower compared to when the 8 containers are distributed across different nodes, where performance is much faster. What could be the cause of this issue, and how can it be resolved?
I am confident that it indicates I am using the IB network.
hostname-0:29:29 [0] NCCL INFO cudaDriverVersion 12020
hostname-0:29:29 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond1,bond0,eth0
hostname-0:29:29 [0] NCCL INFO Bootstrap : Using eth0:10.252.129.107<0>
hostname-0:29:29 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname-0:29:65 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
hostname-0:29:65 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond1,bond0,eth0
hostname-0:29:65 [0] NCCL INFO NCCL_IB_HCA set to mlx5
hostname-0:29:65 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.252.129.107<0>
hostname-0:29:65 [0] NCCL INFO Using non-device net plugin version 0
hostname-0:29:65 [0] NCCL INFO Using network IB
hostname-0:29:65 [0] NCCL INFO comm 0xbc17190 rank 1 nranks 8 cudaDev 0 nvmlDev 0 busId 47000 commId 0x829dc54d307a72a5 - Init START
hostname-0:29:65 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to LOC
hostname-0:29:65 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
hostname-0:29:65 [0] NCCL INFO comm 0xbc17190 rank 1 nRanks 8 nNodes 8 localRanks 1 localRank 0 MNNVL 0
hostname-0:29:65 [0] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 2/0/-1->1->3
hostname-0:29:65 [0] NCCL INFO P2P Chunksize set to 131072
hostname-0:29:65 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 00/0 : 1[0] -> 2[0] [send] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 1[0] -> 2[0] [send] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Connected all rings
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 1[0] -> 3[0] [send] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 3[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 00/0 : 2[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 2[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Connected all trees
hostname-0:29:65 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hostname-0:29:65 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
hostname-0:29:65 [0] NCCL INFO comm 0xbc17190 rank 1 nranks 8 cudaDev 0 nvmlDev 0 busId 47000 commId 0x829dc54d307a72a5 - Init COMPLETE
The text was updated successfully, but these errors were encountered:
I'm not sure I understand where your surprise comes from. If you have 8 containers on the same node sharing a single NIC, then performance should be worse than when the 8 containers are on 8 different nodes using 8 different NICs, right?
I'm not sure I understand where your surprise comes from. If you have 8 containers on the same node sharing a single NIC, then performance should be worse than when the 8 containers are on 8 different nodes using 8 different NICs, right?
In our Kubernetes (k8s) environment, each node is equipped with a single IB card with a bandwidth of 100Gbps. When 8 containers are scheduled on the same node, NCCL communication over IB is significantly slower compared to when the 8 containers are distributed across different nodes, where performance is much faster. What could be the cause of this issue, and how can it be resolved?
I am confident that it indicates I am using the IB network.
The text was updated successfully, but these errors were encountered: