k8s nccl ib slow #1536

JunjieLl · 2024-12-08T06:14:02Z

In our Kubernetes (k8s) environment, each node is equipped with a single IB card with a bandwidth of 100Gbps. When 8 containers are scheduled on the same node, NCCL communication over IB is significantly slower compared to when the 8 containers are distributed across different nodes, where performance is much faster. What could be the cause of this issue, and how can it be resolved?

I am confident that it indicates I am using the IB network.



hostname-0:29:29 [0] NCCL INFO cudaDriverVersion 12020
hostname-0:29:29 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond1,bond0,eth0
hostname-0:29:29 [0] NCCL INFO Bootstrap : Using eth0:10.252.129.107<0>
hostname-0:29:29 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hostname-0:29:65 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
hostname-0:29:65 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond1,bond0,eth0
hostname-0:29:65 [0] NCCL INFO NCCL_IB_HCA set to mlx5
hostname-0:29:65 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [RO]; OOB eth0:10.252.129.107<0>
hostname-0:29:65 [0] NCCL INFO Using non-device net plugin version 0
hostname-0:29:65 [0] NCCL INFO Using network IB
hostname-0:29:65 [0] NCCL INFO comm 0xbc17190 rank 1 nranks 8 cudaDev 0 nvmlDev 0 busId 47000 commId 0x829dc54d307a72a5 - Init START
hostname-0:29:65 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to LOC
hostname-0:29:65 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
hostname-0:29:65 [0] NCCL INFO comm 0xbc17190 rank 1 nRanks 8 nNodes 8 localRanks 1 localRank 0 MNNVL 0
hostname-0:29:65 [0] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 2/0/-1->1->3
hostname-0:29:65 [0] NCCL INFO P2P Chunksize set to 131072
hostname-0:29:65 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 00/0 : 1[0] -> 2[0] [send] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 1[0] -> 2[0] [send] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Connected all rings
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 1[0] -> 3[0] [send] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 3[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 00/0 : 2[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 2[0] -> 1[0] [receive] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IB/0
hostname-0:29:65 [0] NCCL INFO Connected all trees
hostname-0:29:65 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hostname-0:29:65 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
hostname-0:29:65 [0] NCCL INFO comm 0xbc17190 rank 1 nranks 8 cudaDev 0 nvmlDev 0 busId 47000 commId 0x829dc54d307a72a5 - Init COMPLETE

The text was updated successfully, but these errors were encountered:

sjeaugey · 2024-12-09T08:03:01Z

I'm not sure I understand where your surprise comes from. If you have 8 containers on the same node sharing a single NIC, then performance should be worse than when the 8 containers are on 8 different nodes using 8 different NICs, right?

JunjieLl · 2024-12-09T08:04:39Z

I'm not sure I understand where your surprise comes from. If you have 8 containers on the same node sharing a single NIC, then performance should be worse than when the 8 containers are on 8 different nodes using 8 different NICs, right?

yes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s nccl ib slow #1536

k8s nccl ib slow #1536

JunjieLl commented Dec 8, 2024

sjeaugey commented Dec 9, 2024

JunjieLl commented Dec 9, 2024

k8s nccl ib slow #1536

k8s nccl ib slow #1536

Comments

JunjieLl commented Dec 8, 2024

sjeaugey commented Dec 9, 2024

JunjieLl commented Dec 9, 2024