You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have two servers each with 8 gpu cards, two roce cards:mlx5_0:1, mlx5_2:1 and one ethernet card:ens5f0np0. the two ROCE cards connecte with the same fast switch.
my test command is:
Are you sure that the two RoCE NICs can communicate with each other (i.e., that sending via mlx5_0 and receiving via mlx5_2 works)? You should verify that with low-level tests such as ib-write-bw (see https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#networking-issues), not only between the two nodes but also between two NICs on the same node, because it looks like NCCL thinks that communicating between the GPUs attached to two different CPUs on the same node will be faster over the network than using shared memory. You can try running with NCCL_CROSS_NIC=0 to avoid cross-NIC traffic.
I have two servers each with 8 gpu cards, two roce cards:mlx5_0:1, mlx5_2:1 and one ethernet card:ens5f0np0. the two ROCE cards connecte with the same fast switch.
my test command is:
log2.txt
how to use two roce nics ? I run across the same issue when I am running deepspeed training. thanks.
The text was updated successfully, but these errors were encountered: