NCCL allreduce hangs when cudaFreeHost #48

fmana · 2016-09-22T07:45:21Z

Hi Nickel team,

I have introduced your library into a my application. The integration was done into a multi-thread scenario. Each thread uses allreduce and in principle the allreduce is called into a loop.
The first part of the body of the loop is used to compute intermediate data then at the end
of the body of the loop I jump into allreduce.

It works perfectly but from time to time I fall into deadlock. Attaching the process with gdb,
I can see that (N-1) threads are into the cudaStreamSynchronize() (each allreduce has its own
custom cuda stream) while 1 thread is into cuFreeHost() (I use GPU malloc for GPU&CPU memory
allocator).
What's happening there is that during the first part of the body of the loop a thread needs to reallocate
some memory before doing its processing while the others (N-1) threads make their own processing
and jump into Nickel allreduce.
This creates from time to time some deadlock condition. What I can guiess is that there is some timeing condition with which threads make action that produce deadlock.
This is not deterministic: the need to reallocate is deterministic after some iteration but not always produces deadlock.

Could you help me in some way?

Not clear if it is a Cuda issue, a Nickel/Cuda bug or a Cuda limitation.
Does any memory management action alloc/free CPU/GPU require that gpus are idle?

I use Nickel allreduce GPU-based sync methods. No CPU-based barrier introduced before entering into allreduce().
Do I need to add a CPU-based barrier? Any C/C++ safe-code to use in case?

Thanks a lot,
Franco

Next some details about gdb info:
(gdb) where
#0 0x00007fffc6bffa11 in clock_gettime ()
#1 0x0000003ab7a03e46 in clock_gettime () from /lib64/librt.so.1
#2 0x00007fc415a821de in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc4154377ab in ?? () from /usr/lib64/libcuda.so.1
#4 0x00007fc41538ffde in ?? () from /usr/lib64/libcuda.so.1
#5 0x00007fc415412916 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00007fc415412fa8 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00007fc4153793fc in ?? () from /usr/lib64/libcuda.so.1
#8 0x00007fc415347392 in cuMemFreeHost () from /usr/lib64/libcuda.so.1
#9 0x00007fc41ac6284d in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
#10 0x00007fc41ac4782c in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5

(gdb) where
#0 0x00007fffc6bffa11 in clock_gettime ()
#1 0x0000003ab7a03e46 in clock_gettime () from /lib64/librt.so.1
#2 0x00007fc415a821de in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc4154377ab in ?? () from /usr/lib64/libcuda.so.1
#4 0x00007fc415414e33 in ?? () from /usr/lib64/libcuda.so.1
#5 0x00007fc415414f89 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00007fc415388c87 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00007fc4153610c2 in cuStreamSynchronize () from /usr/lib64/libcuda.so.1
#8 0x00007fc41ac40d90 in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
#9 0x00007fc41ac781fd in cudaStreamSynchronize () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5

fmana · 2016-09-22T08:28:20Z

I forgot to add some info:

1] I'm using NCCL 1.1.1.
2] My application is multi-thread and each thread uses a GPU (so multi-GPU too)
3] When the application falls into deadlock (N-1) GPU load is 100%, 1 GPU load 0%.
Probably (N-1) GPU polling to shake-their-hand condition,
1 GPU into cuFreeHost (enqueued in someplace)

CFAndy · 2016-10-17T08:13:06Z

Hi Fmana, any update?

fmana · 2016-10-17T13:44:12Z

The issue still remain and to overcome it I need to use CPU-based barrier in order to
protect NCCL allreduce primitive.
So I need to have fixed the CUDA programming model limitation as noticed in #37.

Any news on issue/question #49 about nccl deployment for future?

sjeaugey · 2018-09-26T17:57:00Z

Should be fixed since NCCL 2.1. Please re-open if still a problem.

Summary: Pull Request resolved: facebookresearch#48 Differential Revision: D55174222

…VIDIA#48) Summary: Pull Request resolved: facebookresearch#48 Differential Revision: D55174222

fmana mentioned this issue Sep 22, 2016

allreduce and barriers #37

Closed

sjeaugey closed this as completed Sep 26, 2018

himanshucodz55 mentioned this issue Jul 25, 2022

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms #708

Open

acphile mentioned this issue Sep 29, 2023

Question about ncclCommAbort stuck issue #1013

Open

minsii added a commit to minsii/nccl that referenced this issue Mar 22, 2024

ctran p2p fix (NVIDIA#48)

2982b96

Summary: Pull Request resolved: facebookresearch#48 Differential Revision: D55174222

minsii added a commit to minsii/nccl that referenced this issue Mar 22, 2024

ctran p2p fix (NVIDIA#48)

de40ed3

Summary: Pull Request resolved: facebookresearch#48 Differential Revision: D55174222

minsii added a commit to minsii/nccl that referenced this issue Mar 24, 2024

disable direct send/recv based allgather when ctran sendrecv is set (N…

ce98ce9

…VIDIA#48) Summary: Pull Request resolved: facebookresearch#48 Differential Revision: D55174222

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL allreduce hangs when cudaFreeHost #48

NCCL allreduce hangs when cudaFreeHost #48

fmana commented Sep 22, 2016

fmana commented Sep 22, 2016

CFAndy commented Oct 17, 2016

fmana commented Oct 17, 2016

sjeaugey commented Sep 26, 2018

NCCL allreduce hangs when cudaFreeHost #48

NCCL allreduce hangs when cudaFreeHost #48

Comments

fmana commented Sep 22, 2016

fmana commented Sep 22, 2016

CFAndy commented Oct 17, 2016

fmana commented Oct 17, 2016

sjeaugey commented Sep 26, 2018