Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FasterTransformer NcclAllReduceSum with 4 GPUs hangs #901

Closed
junior-zsy opened this issue Jun 29, 2023 · 10 comments
Closed

FasterTransformer NcclAllReduceSum with 4 GPUs hangs #901

junior-zsy opened this issue Jun 29, 2023 · 10 comments

Comments

@junior-zsy
Copy link

Please note that this issue of hang or stuck behavior during NcclAllReduceSum is not consistently reproducible. It may occur after running hundreds of iterations.

The stack trace is the same for all four GPUs.
(gdb) bt
#0 0x00007fff84b8b6f4 in ?? ()
#1 0x00007fff84b8b954 in clock_gettime ()
#2 0x00007feff38a20b5 in __GI___clock_gettime (clock_id=4, tp=0x7fff84aa8880)
at ../sysdeps/unix/sysv/linux/clock_gettime.c:38
#3 0x00007feff1fe3aef in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4 0x00007feff1f7dd83 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5 0x00007feff21a960f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6 0x00007feff1f377bc in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7 0x00007feff213c750 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8 0x00007feff1ed949f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#9 0x00007feff1edbbaf in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#10 0x00007feff1f947f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#11 0x00007feff3e3710b in _cudart803 () from /usr/local/lib/libnccl.so.2
#12 0x00007feff3e91fe6 in cudaLaunchKernel () from /usr/local/lib/libnccl.so.2
#13 0x00007feff3d9c7bd in ncclLaunchKernel (comm=comm@entry=0x55f0b72f6150, plan=plan@entry=0x55f0f7d4f768)
at enqueue.cc:1092
#14 0x00007feff3da1dd3 in doLaunches (head=) at group.cc:163
#15 groupLaunch (job
=) at group.cc:325
#16 0x00007feff3da2908 in ncclGroupEndInternal () at group.cc:406
#17 ncclGroupEndInternal () at group.cc:361
#18 0x00007feff3da304b in ncclGroupEnd () at group.cc:96
#19 0x000055f0ac43c98f in void fastertransformer::ftNcclAllReduceSum<__half>(__half const*, __half*, int, fastertransformer::NcclParam, CUstream_st*) ()

nccl log
Total ranks: 4.
Device NVIDIA A800-SXM4-80GB
P1 is runing with 1 GPU.
Device NVIDIA A800-SXM4-80GB
P2 is runing with 2 GPU.
Device NVIDIA A800-SXM4-80GB
P3 is runing with 3 GPU.
Device NVIDIA A800-SXM4-80GB
P0 is runing with 0 GPU.
qygpu047:411929:411929 [1] NCCL INFO Bootstrap : Using lan2:10.178.8.57<0>
qygpu047:411931:411931 [3] NCCL INFO Bootstrap : Using lan2:10.178.8.57<0>
qygpu047:411930:411930 [2] NCCL INFO Bootstrap : Using lan2:10.178.8.57<0>
qygpu047:411928:411928 [0] NCCL INFO Bootstrap : Using lan2:10.178.8.57<0>
qygpu047:411929:411929 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
qygpu047:411929:411929 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
qygpu047:411929:411929 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
qygpu047:411929:411929 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
qygpu047:411931:411931 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
qygpu047:411931:411931 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
qygpu047:411931:411931 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
qygpu047:411931:411931 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
qygpu047:411930:411930 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
qygpu047:411930:411930 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
qygpu047:411930:411930 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
qygpu047:411930:411930 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
qygpu047:411928:411928 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
qygpu047:411928:411928 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
qygpu047:411928:411928 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
qygpu047:411928:411928 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
qygpu047:411928:411928 [0] NCCL INFO cudaDriverVersion 11060
NCCL version 2.18.3+cuda11.6
qygpu047:411930:411930 [2] NCCL INFO cudaDriverVersion 11060
qygpu047:411931:411931 [3] NCCL INFO cudaDriverVersion 11060
qygpu047:411929:411929 [1] NCCL INFO cudaDriverVersion 11060
qygpu047:411930:411930 [2] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f471e800000
qygpu047:411928:411928 [0] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f0f84800000
qygpu047:411929:411929 [1] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7fef92800000
qygpu047:411930:411930 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qygpu047:411930:411930 [2] NCCL INFO P2P plugin IBext
qygpu047:411928:411928 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qygpu047:411928:411928 [0] NCCL INFO P2P plugin IBext
qygpu047:411929:411929 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qygpu047:411929:411929 [1] NCCL INFO P2P plugin IBext
qygpu047:411930:411930 [2] NCCL INFO NET/IB : No device found.
qygpu047:411930:411930 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qygpu047:411928:411928 [0] NCCL INFO NET/IB : No device found.
qygpu047:411928:411928 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qygpu047:411929:411929 [1] NCCL INFO NET/IB : No device found.
qygpu047:411929:411929 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qygpu047:411928:411928 [0] NCCL INFO NET/IB : No device found.
qygpu047:411930:411930 [2] NCCL INFO NET/IB : No device found.
qygpu047:411929:411929 [1] NCCL INFO NET/IB : No device found.
qygpu047:411928:411928 [0] NCCL INFO NET/Socket : Using [0]lan2:10.178.8.57<0> [1]lan3:10.178.8.117<0> [2]lan4:10.178.8.181<0> [3]lan5:10.178.8.245<0>
qygpu047:411928:411928 [0] NCCL INFO Using network Socket
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : Using [0]lan2:10.178.8.57<0> [1]lan3:10.178.8.117<0> [2]lan4:10.178.8.181<0> [3]lan5:10.178.8.245<0>
qygpu047:411929:411929 [1] NCCL INFO Using network Socket
qygpu047:411930:411930 [2] NCCL INFO NET/Socket : Using [0]lan2:10.178.8.57<0> [1]lan3:10.178.8.117<0> [2]lan4:10.178.8.181<0> [3]lan5:10.178.8.245<0>
qygpu047:411930:411930 [2] NCCL INFO Using network Socket
qygpu047:411931:411931 [3] NCCL INFO init.cc:1584 Cuda Host Alloc Size 4 pointer 0x7f7e16800000
qygpu047:411931:411931 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qygpu047:411931:411931 [3] NCCL INFO P2P plugin IBext
qygpu047:411931:411931 [3] NCCL INFO NET/IB : No device found.
qygpu047:411931:411931 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qygpu047:411931:411931 [3] NCCL INFO NET/IB : No device found.
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : Using [0]lan2:10.178.8.57<0> [1]lan3:10.178.8.117<0> [2]lan4:10.178.8.181<0> [3]lan5:10.178.8.245<0>
qygpu047:411931:411931 [3] NCCL INFO Using network Socket
qygpu047:411930:411930 [2] NCCL INFO comm 0x55c878bd84a0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 49000 commId 0x85bd0489545ac5c4 - Init START
qygpu047:411931:411931 [3] NCCL INFO comm 0x55e41389f340 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 4f000 commId 0x85bd0489545ac5c4 - Init START
qygpu047:411929:411929 [1] NCCL INFO comm 0x55f0b72f6150 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 13000 commId 0x85bd0489545ac5c4 - Init START
qygpu047:411928:411928 [0] NCCL INFO comm 0x55bdc4245080 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId e000 commId 0x85bd0489545ac5c4 - Init START
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lan2'
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 1 'lan3'
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lan2'
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 2 'lan4'
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 1 'lan3'
qygpu047:411929:411929 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 3 'lan5'
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 2 'lan4'
qygpu047:411930:411930 [2] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lan2'
qygpu047:411929:411929 [1] NCCL INFO transport/p2p.cc:163 Cuda Alloc Size 2097152 pointer 0x7fef92a00000
qygpu047:411931:411931 [3] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 3 'lan5'
qygpu047:411930:411930 [2] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 1 'lan3'
qygpu047:411929:411929 [1] NCCL INFO === System : maxBw 160.0 totalBw 160.0 ===
qygpu047:411929:411929 [1] NCCL INFO CPU/3 (1/2/-1)
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/1
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/7
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/5
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/1000 (1000c01010000000)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/C000 (1000c01010de13b8)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - GPU/E000 (0)
qygpu047:411929:411929 [1] NCCL INFO + NVL[160.0] - NVS/0
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/11000 (1000c01010de13b8)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - GPU/13000 (1)
qygpu047:411929:411929 [1] NCCL INFO + NVL[160.0] - NVS/0
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - NIC/5000
qygpu047:411929:411929 [1] NCCL INFO CPU/1 (1/2/-1)
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/3
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/7
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/5
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/3C000 (1000c01010000000)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/47000 (1000c01010de13b8)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - GPU/49000 (2)
qygpu047:411929:411929 [1] NCCL INFO + NVL[160.0] - NVS/0
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - NIC/46000
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/4D000 (1000c01010de13b8)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - GPU/4F000 (3)
qygpu047:411929:411929 [1] NCCL INFO + NVL[160.0] - NVS/0
qygpu047:411929:411929 [1] NCCL INFO CPU/7 (1/2/-1)
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/3
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/1
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/5
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/7B000 (1000c01010000000)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - NIC/8C000
qygpu047:411929:411929 [1] NCCL INFO CPU/5 (1/2/-1)
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/3
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/1
qygpu047:411929:411929 [1] NCCL INFO + SYS[5000.0] - CPU/7
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - PCI/C7000 (1000c01010000000)
qygpu047:411929:411929 [1] NCCL INFO + PCI[24.0] - NIC/D4000
qygpu047:411929:411929 [1] NCCL INFO ==========================================
qygpu047:411929:411929 [1] NCCL INFO GPU/E000 :GPU/E000 (0/5000.000000/LOC) GPU/13000 (2/160.000000/NVL) GPU/49000 (2/160.000000/NVL) GPU/4F000 (2/160.000000/NVL) NVS/0 (1/160.000000/NVL) CPU/3 (3/24.000000/PHB) CPU/1 (4/24.000000/SYS) CPU/7 (4/24.000000/SYS) CPU/5 (4/24.000000/SYS)
qygpu047:411929:411929 [1] NCCL INFO GPU/13000 :GPU/E000 (2/160.000000/NVL) GPU/13000 (0/5000.000000/LOC) GPU/49000 (2/160.000000/NVL) GPU/4F000 (2/160.000000/NVL) NVS/0 (1/160.000000/NVL) CPU/3 (3/24.000000/PHB) CPU/1 (4/24.000000/SYS) CPU/7 (4/24.000000/SYS) CPU/5 (4/24.000000/SYS)
qygpu047:411929:411929 [1] NCCL INFO GPU/49000 :GPU/E000 (2/160.000000/NVL) GPU/13000 (2/160.000000/NVL) GPU/49000 (0/5000.000000/LOC) GPU/4F000 (2/160.000000/NVL) NVS/0 (1/160.000000/NVL) CPU/3 (4/24.000000/SYS) CPU/1 (3/24.000000/PHB) CPU/7 (4/24.000000/SYS) CPU/5 (4/24.000000/SYS)
qygpu047:411929:411929 [1] NCCL INFO GPU/4F000 :GPU/E000 (2/160.000000/NVL) GPU/13000 (2/160.000000/NVL) GPU/49000 (2/160.000000/NVL) GPU/4F000 (0/5000.000000/LOC) NVS/0 (1/160.000000/NVL) CPU/3 (4/24.000000/SYS) CPU/1 (3/24.000000/PHB) CPU/7 (4/24.000000/SYS) CPU/5 (4/24.000000/SYS)
qygpu047:411929:411929 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
qygpu047:411928:411928 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lan2'
qygpu047:411929:411929 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 8, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
qygpu047:411929:411929 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 1 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 2 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 3 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 4 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 5 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 6 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO 7 : GPU/0 GPU/1 GPU/2 GPU/3
qygpu047:411929:411929 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 8, bw 20.000000/20.000000, type NVL/PIX, sameChannels 1
qygpu047:411929:411929 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3
....
qygpu047:411930:411930 [2] NCCL INFO AllGather: opCount 0 sendbuff 0xa0486fa00 recvbuff 0xa04826200 count 150528 datatype 0 op 0 root 0 comm 0x55c878bd84a0 [nranks=4] stream 0x55c87ccdf400
qygpu047:411931:411931 [3] NCCL INFO AllReduce: opCount 0 sendbuff 0xa04814200 recvbuff 0xa04814200 count 12288 datatype 6 op 0 root 0 comm 0x55e41389f340 [nranks=4] stream 0x55e415d64210
qygpu047:411928:411928 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0xa04814200 recvbuff 0xa04814200 count 12288 datatype 6 op 0 root 0 comm 0x55bdc4245080 [nranks=4] stream 0x55bdc8091da0
qygpu047:411930:411930 [2] NCCL INFO AllReduce: opCount 0 sendbuff 0xa048be400 recvbuff 0xa048be400 count 122880 datatype 6 op 0 root 0 comm 0x55c878bd84a0 [nranks=4] stream 0x55c87ccdf400
qygpu047:411929:411929 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0xa048be400 recvbuff 0xa048be400 count 122880 datatype 6 op 0 root 0 comm 0x55f0b72f6150 [nranks=4] stream 0x55f0bb474ce0

nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 48-63,176-191 3
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 PXB SYS SYS SYS 48-63,176-191 3
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 SYS PXB SYS SYS 16-31,144-159 1
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 SYS PXB SYS SYS 16-31,144-159 1
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS PXB SYS 112-127,240-254 7
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS PXB SYS 112-127,240-254 7
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS PXB 80-95,208-223 5
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS PXB 80-95,208-223 5
mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS SYS SYS
mlx5_1 SYS SYS PXB PXB SYS SYS SYS SYS SYS X SYS SYS
mlx5_2 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS X SYS
mlx5_3 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

code:

if (tensor_para_.world_size_ > 1) {
if (!use_custom_all_reduce_kernel) {
ftNcclAllReduceSum(attention_out,
attention_out,
batch_size * hidden_units,
tensor_para_,
GlmDecoderSelfAttentionLayer::stream_);
}
else {
custom_all_reduce_comm_->customAllReduce(batch_size * hidden_units, GlmDecoderSelfAttentionLayer::stream_);
}
sync_check_cuda_error();
}

template
void ftNcclAllReduceSum(const T* send_buf, T* recv_buf, const int data_size, NcclParam nccl_param, cudaStream_t stream)
{
#ifdef BUILD_MULTI_GPU
ncclDataType_t nccl_data_type = getNcclDataType();
NCCLCHECK(ncclGroupStart());
NCCLCHECK(ncclAllReduce(
(const void*)send_buf, (void*)recv_buf, data_size, nccl_data_type, ncclSum, nccl_param.nccl_comm_, stream));
NCCLCHECK(ncclGroupEnd());
#endif
}

@junior-zsy junior-zsy changed the title NcclAllReduceSum with 4 GPUs hangs FasterTransformer NcclAllReduceSum with 4 GPUs hangs Jun 29, 2023
@junior-zsy
Copy link
Author

@sjeaugey Please help me answer the question, thank you

@sjeaugey
Copy link
Member

I'm not sure which question.. I don't see any.

Assuming you just want help debugging the hang, getting the backtrace is a good idea indeed, but we need all threads. thread apply all bt would give us that; bt by itself only gets the backtrace of one thread and it doesn't tell us much.

Other than that, you should look at the log and look for any NCCL WARN. That could indicate what went wrong.

@junior-zsy
Copy link
Author

junior-zsy commented Jun 30, 2023

I am 4-GPU with a total of 4 processes.

Executed commands:

export NCCL_DEBUG=WARN
export NCCL_DEBUG_SUBSYS=ALL
export CUDA_LAUNCH_BLOCKING=1

mpirun -n 4 --allow-run-as-root ./bin/glm_example

Note: I put CUDA_ LAUNCH_ Setting BLOCKING to 1 is convenient for you to analyze. Below is the option not to set CUDA_ LAUNCH_ BLOCKING

========rank0=================

(gdb) thread apply all bt

Thread 10 (Thread 0x7f322e3fa000 (LWP 496156)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f322e7fb158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f322e7fb108, cond=0x7f322e7fb130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f322e7fb130, mutex=mutex@entry=0x7f322e7fb108) at pthread_cond_wait.c:647
#3  0x00007f326f3d9740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x5603cb96d1e0) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x5603cb96d1e0) at proxy.cc:868
#5  0x00007f326f33f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f326ef18133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f32517fd000 (LWP 496152)):
#0  0x00007f326ef0b99f in __GI___poll (fds=fds@entry=0x7f32517f09e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f326f3d9da7 in poll (__timeout=500, __nfds=65, __fds=0x7f32517f09e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x5603cb96d1e0) at proxy.cc:1437
#3  0x00007f326f33f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f326ef18133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f322f3fd000 (LWP 496146)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f322f7fe158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f322f7fe108, cond=0x7f322f7fe130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f322f7fe130, mutex=mutex@entry=0x7f322f7fe108) at pthread_cond_wait.c:647
#3  0x00007f326f3d9740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x5603cceffa40) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x5603cceffa40) at proxy.cc:868
#5  0x00007f326f33f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f326ef18133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f3251ffe000 (LWP 496143)):
#0  0x00007f326ef0b99f in __GI___poll (fds=fds@entry=0x7f3251ff19e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f326f3d9da7 in poll (__timeout=500, __nfds=65, __fds=0x7f3251ff19e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x5603cceffa40) at proxy.cc:1437
#3  0x00007f326f33f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f326ef18133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f322ffff000 (LWP 496138)):
#0  0x00007f326ef0b99f in __GI___poll (fds=0x7f3228000c10, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
--Type <RET> for more, q to quit, c to continue without paging--
#1  0x00007f326d617ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f326d62237a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f326d613606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f326f33f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f326ef18133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f3264a0f000 (LWP 496132)):
#0  0x00007f326ef1846e in epoll_wait (epfd=55, events=events@entry=0x7f3264a04680, maxevents=16, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f32670d7400 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f3264a047cc, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7f32670b99d0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f3264a047d0) at sys/event_set.c:198
#2  0x00007f32670b9b62 in ucs_async_thread_func (arg=0x5603c898c0c0) at async/thread.c:130
#3  0x00007f326f33f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f326ef18133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f3265d0e000 (LWP 496128)):
#0  0x00007f326ef0b99f in __GI___poll (fds=0x5603c8900250, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f326d617ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f326d62237a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f326d613606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f326f33f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f326ef18133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f326c957000 (LWP 496123)):
#0  0x00007f326ef1846e in epoll_wait (epfd=12, events=events@entry=0x5603c87af8a0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f326ec99541 in epoll_dispatch (base=0x5603c87af5f0, tv=<optimized out>) at epoll.c:407
#2  0x00007f326ec9c92d in opal_libevent2022_event_base_loop (base=0x5603c87af5f0, flags=flags@entry=1) at event.c:1630
#3  0x00007f326caa6666 in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007f326f33f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f326ef18133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f326d346000 (LWP 496118)):
#0  0x00007f326ef1846e in epoll_wait (epfd=8, events=events@entry=0x5603c878a960, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f326ec99541 in epoll_dispatch (base=0x5603c878a6b0, tv=<optimized out>) at epoll.c:407
--Type <RET> for more, q to quit, c to continue without paging--
#2  0x00007f326ec9c92d in opal_libevent2022_event_base_loop (base=0x5603c878a6b0, flags=flags@entry=1) at event.c:1630
#3  0x00007f326ec54ff6 in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f326f33f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f326ef18133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f326ebf0000 (LWP 496113)):
#0  0x00007f326d6c84c0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007f326d55d75f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f326d7dd102 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f326d75d126 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f326d75e3b1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007f326d5b1b56 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007f326d7dd60f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007f326d56b7bc in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007f326d770750 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#9  0x00007f326d50d49f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#10 0x00007f326d50fbaf in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#11 0x00007f326d5c87f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#12 0x00007f326f46b10b in __cudart803 () from /usr/local/lib/libnccl.so.2
#13 0x00007f326f4c5fe6 in cudaLaunchKernel () from /usr/local/lib/libnccl.so.2
#14 0x00007f326f3d07bd in ncclLaunchKernel (comm=comm@entry=0x5603c8f33f80, plan=plan@entry=0x56040e7b2b80) at enqueue.cc:1092
#15 0x00007f326f3d5dd3 in doLaunches (head=<optimized out>) at group.cc:163
#16 groupLaunch (job_=<optimized out>) at group.cc:325
#17 0x00007f326f3d6908 in ncclGroupEndInternal () at group.cc:406
#18 ncclGroupEndInternal () at group.cc:361
#19 0x00007f326f3d704b in ncclGroupEnd () at group.cc:96
#20 0x00005603bdd9aa7f in void fastertransformer::ftNcclAllReduceSum<__half>(__half const*, __half*, int, fastertransformer::NcclParam, CUstream_st*) ()
#21 0x00005603bdc54210 in fastertransformer::Glm<__half>::encode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) ()
#22 0x00005603bdc3d8c5 in void glm_example<__half>(INIReader) ()
--Type <RET> for more, q to quit, c to continue without paging--
#23 0x00005603bdc22287 in main ()

========rank1=================

(gdb)  thread apply all bt

Thread 10 (Thread 0x7f052dfff000 (LWP 496155)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f054cbfc158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f054cbfc108, cond=0x7f054cbfc130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f054cbfc130, mutex=mutex@entry=0x7f054cbfc108) at pthread_cond_wait.c:647
#3  0x00007f0565281740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x5646d1c9ad80) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x5646d1c9ad80) at proxy.cc:868
#5  0x00007f05651e7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f0564dc0133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f0559f96000 (LWP 496151)):
#0  0x00007f0564db399f in __GI___poll (fds=fds@entry=0x7f0559f899e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f0565281da7 in poll (__timeout=500, __nfds=65, __fds=0x7f0559f899e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x5646d1c9ad80) at proxy.cc:1437
#3  0x00007f05651e7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f0564dc0133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f054d7fe000 (LWP 496148)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f05586aa158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f05586aa108, cond=0x7f05586aa130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f05586aa130, mutex=mutex@entry=0x7f05586aa108) at pthread_cond_wait.c:647
#3  0x00007f0565281740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x5646d3f518c0) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x5646d3f518c0) at proxy.cc:868
#5  0x00007f05651e7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f0564dc0133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f054dfff000 (LWP 496145)):
#0  0x00007f0564db399f in __GI___poll (fds=fds@entry=0x7f054dff29e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f0565281da7 in poll (__timeout=500, __nfds=65, __fds=0x7f054dff29e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x5646d3f518c0) at proxy.cc:1437
#3  0x00007f05651e7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f0564dc0133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f055937c000 (LWP 496139)):
#0  0x00007f0564db399f in __GI___poll (fds=0x7f0528000c10, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f05634bfca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f05634ca37a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f05634bb606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f05651e7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0564dc0133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f055a81b000 (LWP 496130)):
#0  0x00007f0564dc046e in epoll_wait (epfd=55, events=events@entry=0x7f055a810680, maxevents=16, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0560f87400 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f055a8107cc, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7f0560f699d0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f055a8107d0) at sys/event_set.c:198
#2  0x00007f0560f69b62 in ucs_async_thread_func (arg=0x5646cf9df180) at async/thread.c:130
#3  0x00007f05651e7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x00007f0564dc0133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f0560b68000 (LWP 496126)):
#0  0x00007f0564db399f in __GI___poll (fds=0x5646cf953240, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f05634bfca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f05634ca37a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f05634bb606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f05651e7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0564dc0133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f05627ff000 (LWP 496121)):
#0  0x00007f0564dc046e in epoll_wait (epfd=12, events=events@entry=0x5646cf8038a0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0564b41541 in epoll_dispatch (base=0x5646cf8035f0, tv=<optimized out>) at epoll.c:407
#2  0x00007f0564b4492d in opal_libevent2022_event_base_loop (base=0x5646cf8035f0, flags=flags@entry=1) at event.c:1630
#3  0x00007f056294e666 in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007f05651e7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0564dc0133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f05631ee000 (LWP 496117)):
#0  0x00007f0564dc046e in epoll_wait (epfd=8, events=events@entry=0x5646cf7de960, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0564b41541 in epoll_dispatch (base=0x5646cf7de6b0, tv=<optimized out>) at epoll.c:407
#2  0x00007f0564b4492d in opal_libevent2022_event_base_loop (base=0x5646cf7de6b0, flags=flags@entry=1) at event.c:1630
#3  0x00007f0564afcff6 in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f05651e7609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0564dc0133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f0564a98000 (LWP 496114)):
#0  0x00007ffe0477f6cb in ?? ()
#1  0x00007ffe0477f954 in clock_gettime ()
#2  0x00007f0564d7e0b5 in __GI___clock_gettime (clock_id=4, tp=0x7ffe0470ff50) at ../sysdeps/unix/sysv/linux/clock_gettime.c:38
#3  0x00007f05634bfaef in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f0563459d83 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007f056368560f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007f05634137bc in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007f0563618750 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007f05633b549f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#9  0x00007f05633b7baf in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#10 0x00007f05634707f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#11 0x00007f056531310b in __cudart803 () from /usr/local/lib/libnccl.so.2
#12 0x00007f056536dfe6 in cudaLaunchKernel () from /usr/local/lib/libnccl.so.2
#13 0x00007f05652787bd in ncclLaunchKernel (comm=comm@entry=0x5646cff85c00, plan=plan@entry=0x564715812b00) at enqueue.cc:1092
#14 0x00007f056527ddd3 in doLaunches (head=<optimized out>) at group.cc:163
#15 groupLaunch (job_=<optimized out>) at group.cc:325
#16 0x00007f056527e908 in ncclGroupEndInternal () at group.cc:406
#17 ncclGroupEndInternal () at group.cc:361
--Type <RET> for more, q to quit, c to continue without paging--
#18 0x00007f056527f04b in ncclGroupEnd () at group.cc:96
#19 0x00005646c59cca7f in void fastertransformer::ftNcclAllReduceSum<__half>(__half const*, __half*, int, fastertransformer::NcclParam, CUstream_st*) ()
#20 0x00005646c5883601 in fastertransformer::Glm<__half>::decode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*, int, bool) ()
#21 0x00005646c5886867 in fastertransformer::Glm<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) ()
#22 0x00005646c586f8da in void glm_example<__half>(INIReader) ()
#23 0x00005646c5854287 in main ()

========rank2=================

(gdb) thread apply all bt

Thread 10 (Thread 0x7f1681fff000 (LWP 496157)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f16887fb158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f16887fb108, cond=0x7f16887fb130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f16887fb130, mutex=mutex@entry=0x7f16887fb108) at pthread_cond_wait.c:647
#3  0x00007f16a0b96740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x562477a4b840) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x562477a4b840) at proxy.cc:868
#5  0x00007f16a0afc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f16a06d5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f1695819000 (LWP 496153)):
#0  0x00007f16a06c899f in __GI___poll (fds=fds@entry=0x7f169580c9e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f16a0b96da7 in poll (__timeout=500, __nfds=65, __fds=0x7f169580c9e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x562477a4b840) at proxy.cc:1437
#3  0x00007f16a0afc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f16a06d5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f16893fd000 (LWP 496149)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f16897fe158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f16897fe108, cond=0x7f16897fe130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f16897fe130, mutex=mutex@entry=0x7f16897fe108) at pthread_cond_wait.c:647
#3  0x00007f16a0b96740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x562479c5d2c0) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x562479c5d2c0) at proxy.cc:868
#5  0x00007f16a0afc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f16a06d5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f1689fff000 (LWP 496144)):
#0  0x00007f16a06c899f in __GI___poll (fds=fds@entry=0x7f1689ff29e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f16a0b96da7 in poll (__timeout=500, __nfds=65, __fds=0x7f1689ff29e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x562479c5d2c0) at proxy.cc:1437
#3  0x00007f16a0afc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f16a06d5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f1695018000 (LWP 496141)):
#0  0x00007f16a06c899f in __GI___poll (fds=0x7f1660000c10, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f169edd4ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f169eddf37a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f169edd0606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f16a0afc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f16a06d5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f169601a000 (LWP 496131)):
#0  0x00007f16a06d546e in epoll_wait (epfd=55, events=events@entry=0x7f169600f680, maxevents=16, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f169c89c400 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f169600f7cc, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7f169c87e9d0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f169600f7d0) at sys/event_set.c:198
#2  0x00007f169c87eb62 in ucs_async_thread_func (arg=0x5624756ea160) at async/thread.c:130
#3  0x00007f16a0afc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x00007f16a06d5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f169704d000 (LWP 496127)):
#0  0x00007f16a06c899f in __GI___poll (fds=0x56247565e220, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f169edd4ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f169eddf37a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f169edd0606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f16a0afc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f16a06d5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f169e114000 (LWP 496124)):
#0  0x00007f16a06d546e in epoll_wait (epfd=12, events=events@entry=0x56247550e8a0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f16a0456541 in epoll_dispatch (base=0x56247550e5f0, tv=<optimized out>) at epoll.c:407
#2  0x00007f16a045992d in opal_libevent2022_event_base_loop (base=0x56247550e5f0, flags=flags@entry=1) at event.c:1630
#3  0x00007f169e263666 in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007f16a0afc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f16a06d5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f169eb03000 (LWP 496119)):
#0  0x00007f16a06d546e in epoll_wait (epfd=8, events=events@entry=0x5624754e9960, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f16a0456541 in epoll_dispatch (base=0x5624754e96b0, tv=<optimized out>) at epoll.c:407
#2  0x00007f16a045992d in opal_libevent2022_event_base_loop (base=0x5624754e96b0, flags=flags@entry=1) at event.c:1630
#3  0x00007f16a0411ff6 in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f16a0afc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f16a06d5133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f16a03ad000 (LWP 496115)):
#0  0x00007f169ee85534 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007f169ed1a75f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f169ef9a102 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f169ef1a126 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f169ef1b3b1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007f169ed6eb56 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007f169ef9a60f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007f169ed287bc in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007f169ef2d750 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#9  0x00007f169ecca49f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#10 0x00007f169ecccbaf in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#11 0x00007f169ed857f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#12 0x00007f16a0c2810b in __cudart803 () from /usr/local/lib/libnccl.so.2
#13 0x00007f16a0c82fe6 in cudaLaunchKernel () from /usr/local/lib/libnccl.so.2
#14 0x00007f16a0b8d7bd in ncclLaunchKernel (comm=comm@entry=0x562475c915f0, plan=plan@entry=0x5624bb5223f0) at enqueue.cc:1092
#15 0x00007f16a0b92dd3 in doLaunches (head=<optimized out>) at group.cc:163
#16 groupLaunch (job_=<optimized out>) at group.cc:325
#17 0x00007f16a0b93908 in ncclGroupEndInternal () at group.cc:406
--Type <RET> for more, q to quit, c to continue without paging--
#18 ncclGroupEndInternal () at group.cc:361
#19 0x00007f16a0b9404b in ncclGroupEnd () at group.cc:96
#20 0x000056246badaa7f in void fastertransformer::ftNcclAllReduceSum<__half>(__half const*, __half*, int, fastertransformer::NcclParam, CUstream_st*) ()
#21 0x000056246b991601 in fastertransformer::Glm<__half>::decode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*, int, bool) ()
#22 0x000056246b994867 in fastertransformer::Glm<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) ()
#23 0x000056246b97d8da in void glm_example<__half>(INIReader) ()
#24 0x000056246b962287 in main ()

========rank3=================

(gdb) thread apply all bt

Thread 10 (Thread 0x7f3152ffc000 (LWP 496154)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f31533fd158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f31533fd108, cond=0x7f31533fd130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f31533fd130, mutex=mutex@entry=0x7f31533fd108) at pthread_cond_wait.c:647
#3  0x00007f31a7690740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x555a36c34180) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x555a36c34180) at proxy.cc:868
#5  0x00007f31a75f6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f31a71cf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f3191fff000 (LWP 496150)):
#0  0x00007f31a71c299f in __GI___poll (fds=fds@entry=0x7f3191ff29e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f31a7690da7 in poll (__timeout=500, __nfds=65, __fds=0x7f3191ff29e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x555a36c34180) at proxy.cc:1437
#3  0x00007f31a75f6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f31a71cf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f3153fff000 (LWP 496147)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f31905fc158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f31905fc108, cond=0x7f31905fc130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f31905fc130, mutex=mutex@entry=0x7f31905fc108) at pthread_cond_wait.c:647
#3  0x00007f31a7690740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x555a381055e0) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x555a381055e0) at proxy.cc:868
#5  0x00007f31a75f6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f31a71cf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f3190dfd000 (LWP 496142)):
#0  0x00007f31a71c299f in __GI___poll (fds=fds@entry=0x7f3190df09e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f31a7690da7 in poll (__timeout=500, __nfds=65, __fds=0x7f3190df09e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x555a381055e0) at proxy.cc:1437
#3  0x00007f31a75f6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f31a71cf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f31917fe000 (LWP 496140)):
#0  0x00007f31a71c299f in __GI___poll (fds=0x7f3174000c10, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f31a58ceca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f31a58d937a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f31a58ca606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f31a75f6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f31a71cf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f319ccc5000 (LWP 496129)):
#0  0x00007f31a71cf46e in epoll_wait (epfd=55, events=events@entry=0x7f319ccba680, maxevents=16, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f319f395400 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f319ccba7cc, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7f319f3779d0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f319ccba7d0) at sys/event_set.c:198
#2  0x00007f319f377b62 in ucs_async_thread_func (arg=0x555a33b99080) at async/thread.c:130
#3  0x00007f31a75f6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x00007f31a71cf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f319dfc4000 (LWP 496125)):
#0  0x00007f31a71c299f in __GI___poll (fds=0x555a33b0d220, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f31a58ceca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f31a58d937a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f31a58ca606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f31a75f6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f31a71cf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f31a4c0e000 (LWP 496122)):
#0  0x00007f31a71cf46e in epoll_wait (epfd=12, events=events@entry=0x555a339bd8a0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f31a6f50541 in epoll_dispatch (base=0x555a339bd5f0, tv=<optimized out>) at epoll.c:407
#2  0x00007f31a6f5392d in opal_libevent2022_event_base_loop (base=0x555a339bd5f0, flags=flags@entry=1) at event.c:1630
#3  0x00007f31a4d5d666 in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007f31a75f6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f31a71cf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f31a55fd000 (LWP 496120)):
#0  0x00007f31a71cf46e in epoll_wait (epfd=8, events=events@entry=0x555a33998960, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f31a6f50541 in epoll_dispatch (base=0x555a339986b0, tv=<optimized out>) at epoll.c:407
#2  0x00007f31a6f5392d in opal_libevent2022_event_base_loop (base=0x555a339986b0, flags=flags@entry=1) at event.c:1630
#3  0x00007f31a6f0bff6 in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f31a75f6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f31a71cf133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f31a6ea7000 (LWP 496116)):
#0  0x00007f31a58147ed in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007f31a5a94102 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f31a5a14126 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f31a5a153b1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f31a5868b56 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007f31a5a9460f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007f31a58227bc in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007f31a5a27750 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007f31a57c449f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#9  0x00007f31a57c6baf in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#10 0x00007f31a587f7f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#11 0x00007f31a772210b in __cudart803 () from /usr/local/lib/libnccl.so.2
#12 0x00007f31a777cfe6 in cudaLaunchKernel () from /usr/local/lib/libnccl.so.2
#13 0x00007f31a76877bd in ncclLaunchKernel (comm=comm@entry=0x555a34139b10, plan=plan@entry=0x555a79a14c30) at enqueue.cc:1092
#14 0x00007f31a768cdd3 in doLaunches (head=<optimized out>) at group.cc:163
#15 groupLaunch (job_=<optimized out>) at group.cc:325
#16 0x00007f31a768d908 in ncclGroupEndInternal () at group.cc:406
#17 ncclGroupEndInternal () at group.cc:361
--Type <RET> for more, q to quit, c to continue without paging--
#18 0x00007f31a768e04b in ncclGroupEnd () at group.cc:96
#19 0x0000555a29028a7f in void fastertransformer::ftNcclAllReduceSum<__half>(__half const*, __half*, int, fastertransformer::NcclParam, CUstream_st*) ()
#20 0x0000555a28edf601 in fastertransformer::Glm<__half>::decode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*, int, bool) ()
#21 0x0000555a28ee2867 in fastertransformer::Glm<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) ()
#22 0x0000555a28ecb8da in void glm_example<__half>(INIReader) ()
#23 0x0000555a28eb0287 in main ()

======nccl warn log=========

Total ranks: 4.
Device NVIDIA A800-SXM4-80GB
P1 is runing with 1 GPU.
Device NVIDIA A800-SXM4-80GB
P3 is runing with 3 GPU.
Device NVIDIA A800-SXM4-80GB
P2 is runing with 2 GPU.
Device NVIDIA A800-SXM4-80GB
P0 is runing with 0 GPU.
NCCL version 2.18.3+cuda11.6
NCCL version 2.18.3+cuda11.6
NCCL version 2.18.3+cuda11.6
NCCL version 2.18.3+cuda11.6

@sjeaugey thank you

@junior-zsy
Copy link
Author

junior-zsy commented Jun 30, 2023

this is not set CUDA_LAUNCH_BLOCKING = 1 Program Stack and Logs

===========rank 0 ===============

(gdb) thread apply all bt

Thread 10 (Thread 0x7f535e7fb000 (LWP 500879)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f535ebfc158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f535ebfc108, cond=0x7f535ebfc130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f535ebfc130, mutex=mutex@entry=0x7f535ebfc108) at pthread_cond_wait.c:647
#3  0x00007f539f454740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x55d847bc00c0) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x55d847bc00c0) at proxy.cc:868
#5  0x00007f539f3ba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f539ef93133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f53817fd000 (LWP 500875)):
#0  0x00007f539ef8699f in __GI___poll (fds=fds@entry=0x7f53817f09e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f539f454da7 in poll (__timeout=500, __nfds=65, __fds=0x7f53817f09e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x55d847bc00c0) at proxy.cc:1437
#3  0x00007f539f3ba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f539ef93133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f535f7fe000 (LWP 500870)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f538043c158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f538043c108, cond=0x7f538043c130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f538043c130, mutex=mutex@entry=0x7f538043c108) at pthread_cond_wait.c:647
#3  0x00007f539f454740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x55d849152920) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x55d849152920) at proxy.cc:868
#5  0x00007f539f3ba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f539ef93133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f5381ffe000 (LWP 500866)):
#0  0x00007f539ef8699f in __GI___poll (fds=fds@entry=0x7f5381ff19e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f539f454da7 in poll (__timeout=500, __nfds=65, __fds=0x7f5381ff19e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x55d849152920) at proxy.cc:1437
#3  0x00007f539f3ba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f539ef93133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f535ffff000 (LWP 500861)):
#0  0x00007f539ef8699f in __GI___poll (fds=0x7f5358000c10, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f539d692ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f539d69d37a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f539d68e606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f539f3ba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f539ef93133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f5394a89000 (LWP 500855)):
#0  0x00007f539ef9346e in epoll_wait (epfd=55, events=events@entry=0x7f5394a7e680, maxevents=16, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f539711f400 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f5394a7e7cc, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7f53971019d0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f5394a7e7d0) at sys/event_set.c:198
#2  0x00007f5397101b62 in ucs_async_thread_func (arg=0x55d844bdf0f0) at async/thread.c:130
#3  0x00007f539f3ba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x00007f539ef93133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f5395d88000 (LWP 500852)):
#0  0x00007f539ef8699f in __GI___poll (fds=0x55d844b53290, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f539d692ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f539d69d37a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f539d68e606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f539f3ba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f539ef93133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f539c9d2000 (LWP 500847)):
#0  0x00007f539ef9346e in epoll_wait (epfd=12, events=events@entry=0x55d844a028a0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f539ed14541 in epoll_dispatch (base=0x55d844a025f0, tv=<optimized out>) at epoll.c:407
#2  0x00007f539ed1792d in opal_libevent2022_event_base_loop (base=0x55d844a025f0, flags=flags@entry=1) at event.c:1630
#3  0x00007f539cb21666 in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007f539f3ba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f539ef93133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f539d3c1000 (LWP 500842)):
#0  0x00007f539ef9346e in epoll_wait (epfd=8, events=events@entry=0x55d8449dd960, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f539ed14541 in epoll_dispatch (base=0x55d8449dd6b0, tv=<optimized out>) at epoll.c:407
#2  0x00007f539ed1792d in opal_libevent2022_event_base_loop (base=0x55d8449dd6b0, flags=flags@entry=1) at event.c:1630
#3  0x00007f539eccfff6 in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f539f3ba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f539ef93133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f539ec6b000 (LWP 500836)):
#0  0x00007f539ef7671b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007f539d5d9321 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f539d7d8214 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f539d76fb9c in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f539d770024 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007f539d5e52f0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007f539d7eb750 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007f539d58849f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007f539d58abaf in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#9  0x00007f539d6437f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#10 0x00007f53af19333c in ?? () from /usr/local/cuda-11.6/lib64/libcudart.so.11.0
#11 0x00007f53af1e8cb6 in cudaLaunchKernel () from /usr/local/cuda-11.6/lib64/libcudart.so.11.0
#12 0x000055d83a04afcb in void fastertransformer::transpose<__half2>(__half2*, __half2*, int, int, int, int) ()
#13 0x000055d83a055cfb in void fastertransformer::invokeTransposeQKV<__half>(__half*, __half*, int, int, int, int, CUstream_st*) ()
#14 0x000055d83a0443b4 in fastertransformer::GlmContextAttentionLayer<__half>::forward(std::vector<fastertransformer::Tensor, std::allocator<fastertransformer::Tensor> >*, std::vector<fastertransformer::Tensor, std::allocator<fastertransformer::Tensor> > const*, fastertransformer::AttentionWeight<__half> const*) ()
#15 0x000055d83a03f3ae in fastertransformer::TensorParallelGlmContextAttentionLayer<__half>::forward(std::vector<fastertransformer::Tensor, std::allocator<fas--Type <RET> for more, q to quit, c to continue without paging--
tertransformer::Tensor> >*, std::vector<fastertransformer::Tensor, std::allocator<fastertransformer::Tensor> > const*, fastertransformer::AttentionWeight<__half> const*) ()
#16 0x000055d83a01f317 in fastertransformer::GlmContextDecoder<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, std::vector<fastertransformer::GlmDecoderLayerWeight<__half>*, std::allocator<fastertransformer::GlmDecoderLayerWeight<__half>*> > const*, fastertransformer::LayerNormWeight<__half> const*) ()
#17 0x000055d839ef8d29 in fastertransformer::Glm<__half>::encode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) ()
#18 0x000055d839ee28c5 in void glm_example<__half>(INIReader) ()
#19 0x000055d839ec7287 in main ()

===========rank 1 ===============

(gdb) thread apply all bt

Thread 10 (Thread 0x7f8cb0bfb000 (LWP 500880)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f8cb0ffc158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f8cb0ffc108, cond=0x7f8cb0ffc130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f8cb0ffc130, mutex=mutex@entry=0x7f8cb0ffc108) at pthread_cond_wait.c:647
#3  0x00007f8ccd54b740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x55c52e390010) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x55c52e390010) at proxy.cc:868
#5  0x00007f8ccd4b1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f8ccd08a133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f8cc2260000 (LWP 500876)):
#0  0x00007f8ccd07d99f in __GI___poll (fds=fds@entry=0x7f8cc22539e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f8ccd54bda7 in poll (__timeout=500, __nfds=65, __fds=0x7f8cc22539e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x55c52e390010) at proxy.cc:1437
#3  0x00007f8ccd4b1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f8ccd08a133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f8cb1bfe000 (LWP 500871)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f8cb1fff158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f8cb1fff108, cond=0x7f8cb1fff130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f8cb1fff130, mutex=mutex@entry=0x7f8cb1fff108) at pthread_cond_wait.c:647
#3  0x00007f8ccd54b740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x55c530646b50) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x55c530646b50) at proxy.cc:868
#5  0x00007f8ccd4b1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f8ccd08a133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f8cc0974000 (LWP 500867)):
#0  0x00007f8ccd07d99f in __GI___poll (fds=fds@entry=0x7f8cc09679e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f8ccd54bda7 in poll (__timeout=500, __nfds=65, __fds=0x7f8cc09679e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x55c530646b50) at proxy.cc:1437
#3  0x00007f8ccd4b1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f8ccd08a133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f8cc1646000 (LWP 500863)):
#0  0x00007f8ccd07d99f in __GI___poll (fds=0x7f8c94000c10, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f8ccb789ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f8ccb79437a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f8ccb785606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f8ccd4b1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f8ccd08a133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f8cc2b7d000 (LWP 500854)):
#0  0x00007f8ccd08a46e in epoll_wait (epfd=55, events=events@entry=0x7f8cc2b72680, maxevents=16, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f8cc9251400 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f8cc2b727cc, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7f8cc92339d0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f8cc2b727d0) at sys/event_set.c:198
#2  0x00007f8cc9233b62 in ucs_async_thread_func (arg=0x55c52c0da160) at async/thread.c:130
#3  0x00007f8ccd4b1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x00007f8ccd08a133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f8cc3e7c000 (LWP 500850)):
#0  0x00007f8ccd07d99f in __GI___poll (fds=0x55c52c04e220, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f8ccb789ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f8ccb79437a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f8ccb785606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f8ccd4b1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f8ccd08a133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f8ccaac9000 (LWP 500845)):
#0  0x00007f8ccd08a46e in epoll_wait (epfd=12, events=events@entry=0x55c52befe8a0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f8ccce0b541 in epoll_dispatch (base=0x55c52befe5f0, tv=<optimized out>) at epoll.c:407
#2  0x00007f8ccce0e92d in opal_libevent2022_event_base_loop (base=0x55c52befe5f0, flags=flags@entry=1) at event.c:1630
#3  0x00007f8ccac18666 in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007f8ccd4b1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f8ccd08a133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f8ccb4b8000 (LWP 500840)):
#0  0x00007f8ccd08a46e in epoll_wait (epfd=8, events=events@entry=0x55c52bed9960, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f8ccce0b541 in epoll_dispatch (base=0x55c52bed96b0, tv=<optimized out>) at epoll.c:407
#2  0x00007f8ccce0e92d in opal_libevent2022_event_base_loop (base=0x55c52bed96b0, flags=flags@entry=1) at event.c:1630
#3  0x00007f8cccdc6ff6 in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f8ccd4b1609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f8ccd08a133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f8cccd62000 (LWP 500837)):
#0  0x00007f8ccd06d71b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007f8ccb6d0321 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f8ccb8cf214 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f8ccb866b9c in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f8ccb867024 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007f8ccb6dc2f0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007f8ccb8e2750 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007f8ccb67f49f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007f8ccb681baf in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#9  0x00007f8ccb73a7f2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#10 0x00007f8cdd28a33c in ?? () from /usr/local/cuda-11.6/lib64/libcudart.so.11.0
#11 0x00007f8cdd2dfcb6 in cudaLaunchKernel () from /usr/local/cuda-11.6/lib64/libcudart.so.11.0
#12 0x000055c521446fcb in void fastertransformer::transpose<__half2>(__half2*, __half2*, int, int, int, int) ()
#13 0x000055c521451cfb in void fastertransformer::invokeTransposeQKV<__half>(__half*, __half*, int, int, int, int, CUstream_st*) ()
#14 0x000055c5214403b4 in fastertransformer::GlmContextAttentionLayer<__half>::forward(std::vector<fastertransformer::Tensor, std::allocator<fastertransformer::Tensor> >*, std::vector<fastertransformer::Tensor, std::allocator<fastertransformer::Tensor> > const*, fastertransformer::AttentionWeight<__half> const*) ()
#15 0x000055c52143b3ae in fastertransformer::TensorParallelGlmContextAttentionLayer<__half>::forward(std::vector<fastertransformer::Tensor, std::allocator<fas--Type <RET> for more, q to quit, c to continue without paging--
tertransformer::Tensor> >*, std::vector<fastertransformer::Tensor, std::allocator<fastertransformer::Tensor> > const*, fastertransformer::AttentionWeight<__half> const*) ()
#16 0x000055c52141b317 in fastertransformer::GlmContextDecoder<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, std::vector<fastertransformer::GlmDecoderLayerWeight<__half>*, std::allocator<fastertransformer::GlmDecoderLayerWeight<__half>*> > const*, fastertransformer::LayerNormWeight<__half> const*) ()
#17 0x000055c5212f4d29 in fastertransformer::Glm<__half>::encode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) ()
#18 0x000055c5212de8c5 in void glm_example<__half>(INIReader) ()
#19 0x000055c5212c3287 in main ()

===========rank 2===============

(gdb) thread apply all bt

Thread 10 (Thread 0x7fd4b17fd000 (LWP 500878)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd4b1bfe158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fd4b1bfe108, cond=0x7fd4b1bfe130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7fd4b1bfe130, mutex=mutex@entry=0x7fd4b1bfe108) at pthread_cond_wait.c:647
#3  0x00007fd4cbed7740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x5636e7803740) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x5636e7803740) at proxy.cc:868
#5  0x00007fd4cbe3d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fd4cba16133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7fd4c0bef000 (LWP 500874)):
#0  0x00007fd4cba0999f in __GI___poll (fds=fds@entry=0x7fd4c0be29e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fd4cbed7da7 in poll (__timeout=500, __nfds=65, __fds=0x7fd4c0be29e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x5636e7803740) at proxy.cc:1437
#3  0x00007fd4cbe3d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007fd4cba16133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7fd4b4aef000 (LWP 500872)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd4b4ef0158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fd4b4ef0108, cond=0x7fd4b4ef0130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7fd4b4ef0130, mutex=mutex@entry=0x7fd4b4ef0108) at pthread_cond_wait.c:647
#3  0x00007fd4cbed7740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x5636e9a151c0) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x5636e9a151c0) at proxy.cc:868
#5  0x00007fd4cbe3d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fd4cba16133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7fd4b56f1000 (LWP 500868)):
#0  0x00007fd4cba0999f in __GI___poll (fds=fds@entry=0x7fd4b56e49e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fd4cbed7da7 in poll (__timeout=500, __nfds=65, __fds=0x7fd4b56e49e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x5636e9a151c0) at proxy.cc:1437
#3  0x00007fd4cbe3d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007fd4cba16133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7fd4b5fff000 (LWP 500862)):
#0  0x00007fd4cba0999f in __GI___poll (fds=0x7fd494000c10, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fd4ca115ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fd4ca12037a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007fd4ca111606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007fd4cbe3d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007fd4cba16133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fd4c150c000 (LWP 500853)):
#0  0x00007fd4cba1646e in epoll_wait (epfd=55, events=events@entry=0x7fd4c1501680, maxevents=16, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007fd4c3ae3400 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7fd4c15017cc, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7fd4c3ac59d0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7fd4c15017d0) at sys/event_set.c:198
#2  0x00007fd4c3ac5b62 in ucs_async_thread_func (arg=0x5636e54a2160) at async/thread.c:130
#3  0x00007fd4cbe3d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x00007fd4cba16133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fd4c2805000 (LWP 500849)):
#0  0x00007fd4cba0999f in __GI___poll (fds=0x5636e5416220, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fd4ca115ca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fd4ca12037a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007fd4ca111606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007fd4cbe3d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007fd4cba16133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fd4c9455000 (LWP 500844)):
#0  0x00007fd4cba1646e in epoll_wait (epfd=12, events=events@entry=0x5636e52c68a0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007fd4cb797541 in epoll_dispatch (base=0x5636e52c65f0, tv=<optimized out>) at epoll.c:407
#2  0x00007fd4cb79a92d in opal_libevent2022_event_base_loop (base=0x5636e52c65f0, flags=flags@entry=1) at event.c:1630
#3  0x00007fd4c95a4666 in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007fd4cbe3d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007fd4cba16133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fd4c9e44000 (LWP 500841)):
#0  0x00007fd4cba1646e in epoll_wait (epfd=8, events=events@entry=0x5636e52a1960, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007fd4cb797541 in epoll_dispatch (base=0x5636e52a16b0, tv=<optimized out>) at epoll.c:407
#2  0x00007fd4cb79a92d in opal_libevent2022_event_base_loop (base=0x5636e52a16b0, flags=flags@entry=1) at event.c:1630
#3  0x00007fd4cb752ff6 in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007fd4cbe3d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007fd4cba16133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fd4cb6ee000 (LWP 500838)):
#0  0x00007fd4ca0afb41 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007fd4ca0915c2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fd4ca2d31b5 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007fd4ca0c6ebd in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007fd4dbc14c90 in ?? () from /usr/local/cuda-11.6/lib64/libcudart.so.11.0
#5  0x00007fd4dbc6ba38 in cudaStreamSynchronize () from /usr/local/cuda-11.6/lib64/libcudart.so.11.0
#6  0x00005636d9e7ed18 in fastertransformer::GlmDecoder<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, std::vector<fastertransformer::GlmDecoderLayerWeight<__half>*, std::allocator<fastertransformer::GlmDecoderLayerWeight<__half>*> > const*) ()
#7  0x00005636d9e70929 in fastertransformer::Glm<__half>::decode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, faste--Type <RET> for more, q to quit, c to continue without paging--
rtransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*, int, bool) ()
#8  0x00005636d9e75867 in fastertransformer::Glm<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) ()
#9  0x00005636d9e5e8da in void glm_example<__half>(INIReader) ()
#10 0x00005636d9e43287 in main ()

===========rank 3===============

(gdb) thread apply all bt

Thread 10 (Thread 0x7f0c417fd000 (LWP 500877)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f0c41bfe158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f0c41bfe108, cond=0x7f0c41bfe130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f0c41bfe130, mutex=mutex@entry=0x7f0c41bfe108) at pthread_cond_wait.c:647
#3  0x00007f0c77c7c740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x561fc7403830) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x561fc7403830) at proxy.cc:868
#5  0x00007f0c77be2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f0c777bb133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f0c6c992000 (LWP 500873)):
#0  0x00007f0c777ae99f in __GI___poll (fds=fds@entry=0x7f0c6c9859e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f0c77c7cda7 in poll (__timeout=500, __nfds=65, __fds=0x7f0c6c9859e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x561fc7403830) at proxy.cc:1437
#3  0x00007f0c77be2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f0c777bb133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f0c6086e000 (LWP 500869)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f0c60c6f158) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f0c60c6f108, cond=0x7f0c60c6f130) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7f0c60c6f130, mutex=mutex@entry=0x7f0c60c6f108) at pthread_cond_wait.c:647
#3  0x00007f0c77c7c740 in ncclProxyGetPostedOps (added=<synthetic pointer>, proxyState=0x561fc88d4c90) at proxy.cc:713
#4  ncclProxyProgress (proxyState_=0x561fc88d4c90) at proxy.cc:868
#5  0x00007f0c77be2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007f0c777bb133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f0c61470000 (LWP 500865)):
#0  0x00007f0c777ae99f in __GI___poll (fds=fds@entry=0x7f0c614639e0, nfds=nfds@entry=65, timeout=timeout@entry=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f0c77c7cda7 in poll (__timeout=500, __nfds=65, __fds=0x7f0c614639e0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  ncclProxyService (_args=0x561fc88d4c90) at proxy.cc:1437
#3  0x00007f0c77be2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f0c777bb133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f0c61fff000 (LWP 500864)):
#0  0x00007f0c777ae99f in __GI___poll (fds=0x7f0c3c000c10, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f0c75ebaca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f0c75ec537a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f0c75eb6606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f0c77be2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0c777bb133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f0c6d2af000 (LWP 500851)):
#0  0x00007f0c777bb46e in epoll_wait (epfd=55, events=events@entry=0x7f0c6d2a4680, maxevents=16, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0c6f91f400 in ucs_event_set_wait (event_set=<optimized out>, num_events=num_events@entry=0x7f0c6d2a47cc, timeout_ms=<optimized out>, event_set_handler=event_set_handler@entry=0x7f0c6f9019d0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f0c6d2a47d0) at sys/event_set.c:198
#2  0x00007f0c6f901b62 in ucs_async_thread_func (arg=0x561fc4362160) at async/thread.c:130
#3  0x00007f0c77be2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x00007f0c777bb133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f0c6e5a8000 (LWP 500848)):
#0  0x00007f0c777ae99f in __GI___poll (fds=0x561fc42d6220, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f0c75ebaca1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f0c75ec537a in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f0c75eb6606 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f0c77be2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0c777bb133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f0c751fa000 (LWP 500846)):
#0  0x00007f0c777bb46e in epoll_wait (epfd=12, events=events@entry=0x561fc41868a0, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0c7753c541 in epoll_dispatch (base=0x561fc41865f0, tv=<optimized out>) at epoll.c:407
#2  0x00007f0c7753f92d in opal_libevent2022_event_base_loop (base=0x561fc41865f0, flags=flags@entry=1) at event.c:1630
#3  0x00007f0c75349666 in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007f0c77be2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0c777bb133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f0c75be9000 (LWP 500843)):
#0  0x00007f0c777bb46e in epoll_wait (epfd=8, events=events@entry=0x561fc4161960, maxevents=32, timeout=timeout@entry=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0c7753c541 in epoll_dispatch (base=0x561fc41616b0, tv=<optimized out>) at epoll.c:407
#2  0x00007f0c7753f92d in opal_libevent2022_event_base_loop (base=0x561fc41616b0, flags=flags@entry=1) at event.c:1630
#3  0x00007f0c774f7ff6 in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f0c77be2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0c777bb133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f0c77493000 (LWP 500839)):
#0  0x00007f0c75f6b534 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#1  0x00007f0c75e0075f in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f0c76080102 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f0c76000126 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f0c760013b1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#5  0x00007f0c75e54b56 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#6  0x00007f0c75e365c2 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#7  0x00007f0c760781b5 in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#8  0x00007f0c75e6bebd in ?? () from /lib/x86_64-linux-gnu/libcuda.so
#9  0x00007f0c879b9c90 in ?? () from /usr/local/cuda-11.6/lib64/libcudart.so.11.0
#10 0x00007f0c87a10a38 in cudaStreamSynchronize () from /usr/local/cuda-11.6/lib64/libcudart.so.11.0
#11 0x0000561fb9d81d18 in fastertransformer::GlmDecoder<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, std::vector<fastertransformer::GlmDecoderLayerWeight<__half>*, std::allocator<fastertransformer::GlmDec--Type <RET> for more, q to quit, c to continue without paging--
oderLayerWeight<__half>*> > const*) ()
#12 0x0000561fb9d73929 in fastertransformer::Glm<__half>::decode(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*, int, bool) ()
#13 0x0000561fb9d78867 in fastertransformer::Glm<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GlmWeight<__half> const*) ()
#14 0x0000561fb9d618da in void glm_example<__half>(INIReader) ()
#15 0x0000561fb9d46287 in main ()

======nccl warn log=========

Total ranks: 4.
Device NVIDIA A800-SXM4-80GB
P1 is runing with 1 GPU.
Device NVIDIA A800-SXM4-80GB
P3 is runing with 3 GPU.
Device NVIDIA A800-SXM4-80GB
P2 is runing with 2 GPU.
Device NVIDIA A800-SXM4-80GB
P0 is runing with 0 GPU.
NCCL version 2.18.3+cuda11.6
NCCL version 2.18.3+cuda11.6
NCCL version 2.18.3+cuda11.6
NCCL version 2.18.3+cuda11.6

@sjeaugey thank you

@sjeaugey
Copy link
Member

Ok, nothing in the backtrace nor in the log. It would look like the CUDA kernels are stuck.

I see two ncclProxyService and ncclProxyProgress threads on each process. That indicates there are two NCCL communicators. Could it be some operation has been launched on another communicator and was launched before the allreduceSum on some ranks but after the allreduceSum on other ranks, causing a deadlock?

@junior-zsy
Copy link
Author

Thank you very much for your analysis. Do you have any good suggestions or ideas on how to troubleshoot the cause of deadlocks? This is not a fixed occurrence, but occasionally occurs, making it difficult to troubleshoot. I hope you can provide some suggestions. Or do you need me to provide something else 。Thank you @sjeaugey

@sjeaugey
Copy link
Member

You may want to:

  1. confirm that you have two NCCL communicators (e.g. to communicate in multiple dimensions)
  2. check how those two communicators are used. Can they be overlapped? There could be options in the way you launch things, on how different communication patterns apply and when.
  3. Running cuda-gdb could show you the running kernels. That could confirm which kernel is running on which GPU and maybe show that one GPU is running a different operation than the others?

@junior-zsy
Copy link
Author

According to my code logic, I should only have one NCCL communicators,this is my code
if (tensor_para_.world_size_ > 1) {
if (!use_custom_all_reduce_kernel) {
ftNcclAllReduceSum(attention_out,
attention_out,
batch_size * hidden_units,
tensor_para_,
GlmDecoderSelfAttentionLayer::stream_);
}
else {
custom_all_reduce_comm_->customAllReduce(batch_size * hidden_units, GlmDecoderSelfAttentionLayer::stream_);
}
sync_check_cuda_error();
}

template
void ftNcclAllReduceSum(const T* send_buf, T* recv_buf, const int data_size, NcclParam nccl_param, cudaStream_t stream)
{
#ifdef BUILD_MULTI_GPU
ncclDataType_t nccl_data_type = getNcclDataType();
NCCLCHECK(ncclGroupStart());
NCCLCHECK(ncclAllReduce(
(const void*)send_buf, (void*)recv_buf, data_size, nccl_data_type, ncclSum, nccl_param.nccl_comm_, stream));
NCCLCHECK(ncclGroupEnd());
#endif
}
Should this function ftNcclAllReduceSum have only one NCCL communicators thanks @sjeaugey

@junior-zsy
Copy link
Author

Thank you very much. I have found the issue with the deadlock

@tuanzhangCS
Copy link

Thank you very much. I have found the issue with the deadlock

So what's the root cause in your case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants