-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FasterTransformer NcclAllReduceSum with 4 GPUs hangs #901
Comments
@sjeaugey Please help me answer the question, thank you |
I'm not sure which question.. I don't see any. Assuming you just want help debugging the hang, getting the backtrace is a good idea indeed, but we need all threads. Other than that, you should look at the log and look for any |
I am 4-GPU with a total of 4 processes. Executed commands:
Note: I put CUDA_ LAUNCH_ Setting BLOCKING to 1 is convenient for you to analyze. Below is the option not to set CUDA_ LAUNCH_ BLOCKING ========rank0=================
========rank1=================
========rank2=================
========rank3=================
======nccl warn log=========
@sjeaugey thank you |
this is not set CUDA_LAUNCH_BLOCKING = 1 Program Stack and Logs ===========rank 0 ===============
===========rank 1 ===============
===========rank 2===============
===========rank 3===============
======nccl warn log=========
@sjeaugey thank you |
Ok, nothing in the backtrace nor in the log. It would look like the CUDA kernels are stuck. I see two ncclProxyService and ncclProxyProgress threads on each process. That indicates there are two NCCL communicators. Could it be some operation has been launched on another communicator and was launched before the allreduceSum on some ranks but after the allreduceSum on other ranks, causing a deadlock? |
Thank you very much for your analysis. Do you have any good suggestions or ideas on how to troubleshoot the cause of deadlocks? This is not a fixed occurrence, but occasionally occurs, making it difficult to troubleshoot. I hope you can provide some suggestions. Or do you need me to provide something else 。Thank you @sjeaugey |
You may want to:
|
According to my code logic, I should only have one NCCL communicators,this is my code template |
Thank you very much. I have found the issue with the deadlock |
So what's the root cause in your case? |
Please note that this issue of hang or stuck behavior during NcclAllReduceSum is not consistently reproducible. It may occur after running hundreds of iterations.
The stack trace is the same for all four GPUs.
(gdb) bt
#0 0x00007fff84b8b6f4 in ?? ()
#1 0x00007fff84b8b954 in clock_gettime ()
#2 0x00007feff38a20b5 in __GI___clock_gettime (clock_id=4, tp=0x7fff84aa8880)
at ../sysdeps/unix/sysv/linux/clock_gettime.c:38
#3 0x00007feff1fe3aef in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4 0x00007feff1f7dd83 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5 0x00007feff21a960f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6 0x00007feff1f377bc in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7 0x00007feff213c750 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8 0x00007feff1ed949f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#9 0x00007feff1edbbaf in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#10 0x00007feff1f947f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#11 0x00007feff3e3710b in _cudart803 () from /usr/local/lib/libnccl.so.2
#12 0x00007feff3e91fe6 in cudaLaunchKernel () from /usr/local/lib/libnccl.so.2
#13 0x00007feff3d9c7bd in ncclLaunchKernel (comm=comm@entry=0x55f0b72f6150, plan=plan@entry=0x55f0f7d4f768)
at enqueue.cc:1092
#14 0x00007feff3da1dd3 in doLaunches (head=) at group.cc:163
#15 groupLaunch (job=) at group.cc:325
#16 0x00007feff3da2908 in ncclGroupEndInternal () at group.cc:406
#17 ncclGroupEndInternal () at group.cc:361
#18 0x00007feff3da304b in ncclGroupEnd () at group.cc:96
#19 0x000055f0ac43c98f in void fastertransformer::ftNcclAllReduceSum<__half>(__half const*, __half*, int, fastertransformer::NcclParam, CUstream_st*) ()
nccl log
Total ranks: 4.
Device NVIDIA A800-SXM4-80GB
P1 is runing with 1 GPU.
Device NVIDIA A800-SXM4-80GB
P2 is runing with 2 GPU.
Device NVIDIA A800-SXM4-80GB
P3 is runing with 3 GPU.
Device NVIDIA A800-SXM4-80GB
P0 is runing with 0 GPU.
qygpu047:411929:411929 [1] NCCL INFO Bootstrap : Using lan2:10.178.8.57<0>
qygpu047:411931:411931 [3] NCCL INFO Bootstrap : Using lan2:10.178.8.57<0>
qygpu047:411930:411930 [2] NCCL INFO Bootstrap : Using lan2:10.178.8.57<0>
qygpu047:411928:411928 [0] NCCL INFO Bootstrap : Using lan2:10.178.8.57<0>
qygpu047:411929:411929 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
qygpu047:411929:411929 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
qygpu047:411929:411929 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
qygpu047:411929:411929 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
qygpu047:411931:411931 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
qygpu047:411931:411931 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
qygpu047:411931:411931 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
qygpu047:411931:411931 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
qygpu047:411930:411930 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
qygpu047:411930:411930 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
qygpu047:411930:411930 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
qygpu047:411930:411930 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
qygpu047:411928:411928 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
qygpu047:411928:411928 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
qygpu047:411928:411928 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
qygpu047:411928:411928 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
qygpu047:411928:411928 [0] NCCL INFO cudaDriverVersion 11060
NCCL version 2.18.3+cuda11.6
qygpu047:411930:411930 [2] NCCL INFO cudaDriverVersion 11060
qygpu047:411931:411931 [3] NCCL INFO cudaDriverVersion 11060
qygpu047:411929:411929 [1] NCCL INFO cudaDriverVersion 11060
qygpu047:411930:411930 [2] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f471e800000
qygpu047:411928:411928 [0] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f0f84800000
qygpu047:411929:411929 [1] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fef92800000
qygpu047:411930:411930 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qygpu047:411930:411930 [2] NCCL INFO P2P plugin IBext
qygpu047:411928:411928 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qygpu047:411928:411928 [0] NCCL INFO P2P plugin IBext
qygpu047:411929:411929 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qygpu047:411929:411929 [1] NCCL INFO P2P plugin IBext
qygpu047:411930:411930 [2] NCCL INFO NET/IB : No device found.
qygpu047:411930:411930 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qygpu047:411928:411928 [0] NCCL INFO NET/IB : No device found.
qygpu047:411928:411928 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qygpu047:411929:411929 [1] NCCL INFO NET/IB : No device found.
qygpu047:411929:411929 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qygpu047:411928:411928 [0] NCCL INFO NET/IB : No device found.
qygpu047:411930:411930 [2] NCCL INFO NET/IB : No device found.
qygpu047:411929:411929 [1] NCCL INFO NET/IB : No device found.
qygpu047:411928:411928 [0] NCCL INFO NET/Socket : Using [0]lan2:10.178.8.57<0> [1]lan3:10.178.8.117<0> [2]lan4:10.178.8.181<0> [3]lan5:10.178.8.245<0>
qygpu047:411928:411928 [0] NCCL INFO Using network Socket
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : Using [0]lan2:10.178.8.57<0> [1]lan3:10.178.8.117<0> [2]lan4:10.178.8.181<0> [3]lan5:10.178.8.245<0>
qygpu047:411929:411929 [1] NCCL INFO Using network Socket
qygpu047:411930:411930 [2] NCCL INFO NET/Socket : Using [0]lan2:10.178.8.57<0> [1]lan3:10.178.8.117<0> [2]lan4:10.178.8.181<0> [3]lan5:10.178.8.245<0>
qygpu047:411930:411930 [2] NCCL INFO Using network Socket
qygpu047:411931:411931 [3] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f7e16800000
qygpu047:411931:411931 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qygpu047:411931:411931 [3] NCCL INFO P2P plugin IBext
qygpu047:411931:411931 [3] NCCL INFO NET/IB : No device found.
qygpu047:411931:411931 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qygpu047:411931:411931 [3] NCCL INFO NET/IB : No device found.
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : Using [0]lan2:10.178.8.57<0> [1]lan3:10.178.8.117<0> [2]lan4:10.178.8.181<0> [3]lan5:10.178.8.245<0>
qygpu047:411931:411931 [3] NCCL INFO Using network Socket
qygpu047:411930:411930 [2] NCCL INFO comm 0x55c878bd84a0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 49000 commId 0x85bd0489545ac5c4 - Init START
qygpu047:411931:411931 [3] NCCL INFO comm 0x55e41389f340 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 4f000 commId 0x85bd0489545ac5c4 - Init START
qygpu047:411929:411929 [1] NCCL INFO comm 0x55f0b72f6150 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 13000 commId 0x85bd0489545ac5c4 - Init START
qygpu047:411928:411928 [0] NCCL INFO comm 0x55bdc4245080 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId e000 commId 0x85bd0489545ac5c4 - Init START
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lan2'
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 1 'lan3'
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lan2'
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 2 'lan4'
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 1 'lan3'
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 3 'lan5'
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 2 'lan4'
qygpu047:411930:411930 [2] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lan2'
qygpu047:411929:411929 [1] NCCL INFO transport/p2p.cc:163 Cuda Alloc Size 2097152 pointer 0x7fef92a00000
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 3 'lan5'
qygpu047:411930:411930 [2] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 1 'lan3'
qygpu047:411929:411929 [1] NCCL INFO === System : maxBw 160.0 totalBw 160.0 ===
qygpu047:411929:411929 [1] NCCL INFO CPU/3 (1/2/-1)
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/1
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/7
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/5
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/1000 (1000c01010000000)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/C000 (1000c01010de13b8)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - GPU/E000 (0)
qygpu047:411929:411929 [1] NCCL INFO + NVL[160.0] - NVS/0
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/11000 (1000c01010de13b8)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - GPU/13000 (1)
qygpu047:411929:411929 [1] NCCL INFO + NVL[160.0] - NVS/0
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - NIC/5000
qygpu047:411929:411929 [1] NCCL INFO CPU/1 (1/2/-1)
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/3
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/7
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/5
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/3C000 (1000c01010000000)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/47000 (1000c01010de13b8)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - GPU/49000 (2)
qygpu047:411929:411929 [1] NCCL INFO + NVL[160.0] - NVS/0
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - NIC/46000
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/4D000 (1000c01010de13b8)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - GPU/4F000 (3)
qygpu047:411929:411929 [1] NCCL INFO + NVL[160.0] - NVS/0
qygpu047:411929:411929 [1] NCCL INFO CPU/7 (1/2/-1)
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/3
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/1
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/5
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/7B000 (1000c01010000000)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - NIC/8C000
qygpu047:411929:411929 [1] NCCL INFO CPU/5 (1/2/-1)
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/3
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/1
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/7
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/C7000 (1000c01010000000)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - NIC/D4000
qygpu047:411929:411929 [1] NCCL INFO ==========================================
qygpu047:411929:411929 [1] NCCL INFO GPU/E000 :GPU/E000 (0/5000.000000/LOC) GPU/13000 (2/160.000000/NVL) GPU/49000 (2/160.000000/NVL) GPU/4F000 (2/160.000000/NVL) NVS/0 (1/160.000000/NVL) CPU/3 (3/24.000000/PHB) CPU/1 (4/24.000000/SYS) CPU/7 (4/24.000000/SYS) CPU/5 (4/24.000000/SYS)
qygpu047:411929:411929 [1] NCCL INFO GPU/13000 :GPU/E000 (2/160.000000/NVL) GPU/13000 (0/5000.000000/LOC) GPU/49000 (2/160.000000/NVL) GPU/4F000 (2/160.000000/NVL) NVS/0 (1/160.000000/NVL) CPU/3 (3/24.000000/PHB) CPU/1 (4/24.000000/SYS) CPU/7 (4/24.000000/SYS) CPU/5 (4/24.000000/SYS)
qygpu047:411929:411929 [1] NCCL INFO GPU/49000 :GPU/E000 (2/160.000000/NVL) GPU/13000 (2/160.000000/NVL) GPU/49000 (0/5000.000000/LOC) GPU/4F000 (2/160.000000/NVL) NVS/0 (1/160.000000/NVL) CPU/3 (4/24.000000/SYS) CPU/1 (3/24.000000/PHB) CPU/7 (4/24.000000/SYS) CPU/5 (4/24.000000/SYS)
qygpu047:411929:411929 [1] NCCL INFO GPU/4F000 :GPU/E000 (2/160.000000/NVL) GPU/13000 (2/160.000000/NVL) GPU/49000 (2/160.000000/NVL) GPU/4F000 (0/5000.000000/LOC) NVS/0 (1/160.000000/NVL) CPU/3 (4/24.000000/SYS) CPU/1 (3/24.000000/PHB) CPU/7 (4/24.000000/SYS) CPU/5 (4/24.000000/SYS)
qygpu047:411929:411929 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
qygpu047:411928:411928 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lan2'
qygpu047:411929:411929 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 8, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
qygpu047:411929:411929 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 1 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 2 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 3 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 4 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 5 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 6 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 7 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 8, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
qygpu047:411929:411929 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
....
qygpu047:411930:411930 [2] NCCL INFO AllGather: opCount 0 sendbuff 0xa0486fa00 recvbuff 0xa04826200 count 150528 datatype 0 op 0 root 0 comm 0x55c878bd84a0 [nranks=4] stream 0x55c87ccdf400
qygpu047:411931:411931 [3] NCCL INFO AllReduce: opCount 0 sendbuff 0xa04814200 recvbuff 0xa04814200 count 12288 datatype 6 op 0 root 0 comm 0x55e41389f340 [nranks=4] stream 0x55e415d64210
qygpu047:411928:411928 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0xa04814200 recvbuff 0xa04814200 count 12288 datatype 6 op 0 root 0 comm 0x55bdc4245080 [nranks=4] stream 0x55bdc8091da0
qygpu047:411930:411930 [2] NCCL INFO AllReduce: opCount 0 sendbuff 0xa048be400 recvbuff 0xa048be400 count 122880 datatype 6 op 0 root 0 comm 0x55c878bd84a0 [nranks=4] stream 0x55c87ccdf400
qygpu047:411929:411929 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0xa048be400 recvbuff 0xa048be400 count 122880 datatype 6 op 0 root 0 comm 0x55f0b72f6150 [nranks=4] stream 0x55f0bb474ce0
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 48-63,176-191 3
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 48-63,176-191 3
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 SYS PXB SYS SYS 16-31,144-159 1
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 SYS PXB SYS SYS 16-31,144-159 1
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS PXB SYS 112-127,240-254 7
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS PXB SYS 112-127,240-254 7
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS PXB 80-95,208-223 5
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS PXB 80-95,208-223 5
mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS SYS SYS
mlx5_1 SYS SYS PXB PXB SYS SYS SYS SYS SYS X SYS SYS
mlx5_2 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS X SYS
mlx5_3 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
code:
if (tensor_para_.world_size_ > 1) {
if (!use_custom_all_reduce_kernel) {
ftNcclAllReduceSum(attention_out,
attention_out,
batch_size * hidden_units,
tensor_para_,
GlmDecoderSelfAttentionLayer::stream_);
}
else {
custom_all_reduce_comm_->customAllReduce(batch_size * hidden_units, GlmDecoderSelfAttentionLayer::stream_);
}
sync_check_cuda_error();
}
template
void ftNcclAllReduceSum(const T* send_buf, T* recv_buf, const int data_size, NcclParam nccl_param, cudaStream_t stream)
{
#ifdef BUILD_MULTI_GPU
ncclDataType_t nccl_data_type = getNcclDataType();
NCCLCHECK(ncclGroupStart());
NCCLCHECK(ncclAllReduce(
(const void*)send_buf, (void*)recv_buf, data_size, nccl_data_type, ncclSum, nccl_param.nccl_comm_, stream));
NCCLCHECK(ncclGroupEnd());
#endif
}
The text was updated successfully, but these errors were encountered: