Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Crashes when do NET initialization #1091

Open
yanminjia opened this issue Nov 27, 2023 · 2 comments
Open

NCCL Crashes when do NET initialization #1091

yanminjia opened this issue Nov 27, 2023 · 2 comments

Comments

@yanminjia
Copy link

yanminjia commented Nov 27, 2023

NCCL crashes. And here is the call stack by loading the core dump file to gdb. It looks it is caused by the NET plugin lib (libnccl-net.so).

gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007fefc64b890f in nccl_p2p_ib_init (num_devs=0x7fefc64cca38 , ncclIbDevs=, ncclIbIfName=0x7fefc64ef090 "ibs22", ncclIbIfAddr=0x7fefc64ef070 ,
ncclIbAsyncThread=0x7fefc64ef020 , logFunction=) at p2p_plugin.c:315
#2 0x00007ff2369daf80 in ncclNet_v6_as_v7_init (logfn=) at net.cc:54
#3 0x00007ff2369db85a in netGetState (state=, i=0) at net.cc:322
#4 ncclNetInit (comm=comm@entry=0x557d0e33d530) at net.cc:351
#5 0x00007ff2369ca19c in commAlloc (comm=comm@entry=0x557d0e33d530, parent=parent@entry=0x0, ndev=, rank=) at init.cc:334
#6 0x00007ff2369d8ef8 in ncclCommInitRankFunc (job_=0x557d0e341290) at init.cc:1387
#7 0x00007ff2369c661c in ncclAsyncJobMain (arg=0x557d0e341290) at group.cc:62
#8 0x00007ff2364d7ac3 in start_thread (arg=) at ./nptl/pthread_create.c:442
#9 0x00007ff236569a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

It looks this problem only happens if the number of the NICs installed on the server exceeds 16. When I get one NIC down, it can work. It would be highly appreciated if any idea. Thanks.

xxxx@xxxx:~$ ibdev2netdev
mlx5_0 port 1 ==> ens13f0np0 (Up)
mlx5_1 port 1 ==> ens13f1np1 (Up)
mlx5_10 port 1 ==> ens18f0np0 (Up)
mlx5_11 port 1 ==> ens18f1np1 (Up)
mlx5_12 port 1 ==> ibs22 (Up)
mlx5_13 port 1 ==> ens16f0np0 (Up)
mlx5_14 port 1 ==> ens16f1np1 (Up)
mlx5_15 port 1 ==> ens15f0np0 (Up)
mlx5_16 port 1 ==> ens15f1np1 (Up)
mlx5_2 port 1 ==> ens14f0np0 (Up)
mlx5_3 port 1 ==> ens14f1np1 (Up)
mlx5_4 port 1 ==> ens12f0np0 (Up)
mlx5_5 port 1 ==> ens12f1np1 (Up)
mlx5_6 port 1 ==> ens11f0np0 (Up)
mlx5_7 port 1 ==> ens11f1np1 (Up)
mlx5_8 port 1 ==> ens17f0np0 (Up)
mlx5_9 port 1 ==> ens17f1np1 (Up)

@sjeaugey
Copy link
Member

What version of NCCL are you using? We've increased the maximum to 32 some time ago. Can you check with a newer version of NCCL?

If you don't want to upgrade, as a workaround, you can use NCCL_IB_HCA==mlx5_0,mlx5_1,... to restrict NCCL to the interfaces you really need.

@yanminjia
Copy link
Author

Many thanks for your prompt response. we are using nccl 2.19.3. It looks the problem is caused by libnccl-net.so but not nccl code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants