You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NCCL crashes. And here is the call stack by loading the core dump file to gdb. It looks it is caused by the NET plugin lib (libnccl-net.so).
gdb) bt
#0 0x0000000000000000 in ?? () #1 0x00007fefc64b890f in nccl_p2p_ib_init (num_devs=0x7fefc64cca38 , ncclIbDevs=, ncclIbIfName=0x7fefc64ef090 "ibs22", ncclIbIfAddr=0x7fefc64ef070 ,
ncclIbAsyncThread=0x7fefc64ef020 , logFunction=) at p2p_plugin.c:315 #2 0x00007ff2369daf80 in ncclNet_v6_as_v7_init (logfn=) at net.cc:54 #3 0x00007ff2369db85a in netGetState (state=, i=0) at net.cc:322 #4 ncclNetInit (comm=comm@entry=0x557d0e33d530) at net.cc:351 #5 0x00007ff2369ca19c in commAlloc (comm=comm@entry=0x557d0e33d530, parent=parent@entry=0x0, ndev=, rank=) at init.cc:334 #6 0x00007ff2369d8ef8 in ncclCommInitRankFunc (job_=0x557d0e341290) at init.cc:1387 #7 0x00007ff2369c661c in ncclAsyncJobMain (arg=0x557d0e341290) at group.cc:62 #8 0x00007ff2364d7ac3 in start_thread (arg=) at ./nptl/pthread_create.c:442 #9 0x00007ff236569a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
It looks this problem only happens if the number of the NICs installed on the server exceeds 16. When I get one NIC down, it can work. It would be highly appreciated if any idea. Thanks.
xxxx@xxxx:~$ ibdev2netdev
mlx5_0 port 1 ==> ens13f0np0 (Up)
mlx5_1 port 1 ==> ens13f1np1 (Up)
mlx5_10 port 1 ==> ens18f0np0 (Up)
mlx5_11 port 1 ==> ens18f1np1 (Up)
mlx5_12 port 1 ==> ibs22 (Up)
mlx5_13 port 1 ==> ens16f0np0 (Up)
mlx5_14 port 1 ==> ens16f1np1 (Up)
mlx5_15 port 1 ==> ens15f0np0 (Up)
mlx5_16 port 1 ==> ens15f1np1 (Up)
mlx5_2 port 1 ==> ens14f0np0 (Up)
mlx5_3 port 1 ==> ens14f1np1 (Up)
mlx5_4 port 1 ==> ens12f0np0 (Up)
mlx5_5 port 1 ==> ens12f1np1 (Up)
mlx5_6 port 1 ==> ens11f0np0 (Up)
mlx5_7 port 1 ==> ens11f1np1 (Up)
mlx5_8 port 1 ==> ens17f0np0 (Up)
mlx5_9 port 1 ==> ens17f1np1 (Up)
The text was updated successfully, but these errors were encountered:
NCCL crashes. And here is the call stack by loading the core dump file to gdb. It looks it is caused by the NET plugin lib (libnccl-net.so).
gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007fefc64b890f in nccl_p2p_ib_init (num_devs=0x7fefc64cca38 , ncclIbDevs=, ncclIbIfName=0x7fefc64ef090 "ibs22", ncclIbIfAddr=0x7fefc64ef070 ,
ncclIbAsyncThread=0x7fefc64ef020 , logFunction=) at p2p_plugin.c:315
#2 0x00007ff2369daf80 in ncclNet_v6_as_v7_init (logfn=) at net.cc:54
#3 0x00007ff2369db85a in netGetState (state=, i=0) at net.cc:322
#4 ncclNetInit (comm=comm@entry=0x557d0e33d530) at net.cc:351
#5 0x00007ff2369ca19c in commAlloc (comm=comm@entry=0x557d0e33d530, parent=parent@entry=0x0, ndev=, rank=) at init.cc:334
#6 0x00007ff2369d8ef8 in ncclCommInitRankFunc (job_=0x557d0e341290) at init.cc:1387
#7 0x00007ff2369c661c in ncclAsyncJobMain (arg=0x557d0e341290) at group.cc:62
#8 0x00007ff2364d7ac3 in start_thread (arg=) at ./nptl/pthread_create.c:442
#9 0x00007ff236569a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
It looks this problem only happens if the number of the NICs installed on the server exceeds 16. When I get one NIC down, it can work. It would be highly appreciated if any idea. Thanks.
xxxx@xxxx:~$ ibdev2netdev
mlx5_0 port 1 ==> ens13f0np0 (Up)
mlx5_1 port 1 ==> ens13f1np1 (Up)
mlx5_10 port 1 ==> ens18f0np0 (Up)
mlx5_11 port 1 ==> ens18f1np1 (Up)
mlx5_12 port 1 ==> ibs22 (Up)
mlx5_13 port 1 ==> ens16f0np0 (Up)
mlx5_14 port 1 ==> ens16f1np1 (Up)
mlx5_15 port 1 ==> ens15f0np0 (Up)
mlx5_16 port 1 ==> ens15f1np1 (Up)
mlx5_2 port 1 ==> ens14f0np0 (Up)
mlx5_3 port 1 ==> ens14f1np1 (Up)
mlx5_4 port 1 ==> ens12f0np0 (Up)
mlx5_5 port 1 ==> ens12f1np1 (Up)
mlx5_6 port 1 ==> ens11f0np0 (Up)
mlx5_7 port 1 ==> ens11f1np1 (Up)
mlx5_8 port 1 ==> ens17f0np0 (Up)
mlx5_9 port 1 ==> ens17f1np1 (Up)
The text was updated successfully, but these errors were encountered: