Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL allreduce hangs when cudaFreeHost #48

Closed
fmana opened this issue Sep 22, 2016 · 4 comments
Closed

NCCL allreduce hangs when cudaFreeHost #48

fmana opened this issue Sep 22, 2016 · 4 comments

Comments

@fmana
Copy link

fmana commented Sep 22, 2016

Hi Nickel team,

I have introduced your library into a my application. The integration was done into a multi-thread scenario. Each thread uses allreduce and in principle the allreduce is called into a loop.
The first part of the body of the loop is used to compute intermediate data then at the end
of the body of the loop I jump into allreduce.

It works perfectly but from time to time I fall into deadlock. Attaching the process with gdb,
I can see that (N-1) threads are into the cudaStreamSynchronize() (each allreduce has its own
custom cuda stream) while 1 thread is into cuFreeHost() (I use GPU malloc for GPU&CPU memory
allocator).
What's happening there is that during the first part of the body of the loop a thread needs to reallocate
some memory before doing its processing while the others (N-1) threads make their own processing
and jump into Nickel allreduce.
This creates from time to time some deadlock condition. What I can guiess is that there is some timeing condition with which threads make action that produce deadlock.
This is not deterministic: the need to reallocate is deterministic after some iteration but not always produces deadlock.

Could you help me in some way?

Not clear if it is a Cuda issue, a Nickel/Cuda bug or a Cuda limitation.
Does any memory management action alloc/free CPU/GPU require that gpus are idle?

I use Nickel allreduce GPU-based sync methods. No CPU-based barrier introduced before entering into allreduce().
Do I need to add a CPU-based barrier? Any C/C++ safe-code to use in case?

Thanks a lot,
Franco

Next some details about gdb info:
(gdb) where
#0 0x00007fffc6bffa11 in clock_gettime ()
#1 0x0000003ab7a03e46 in clock_gettime () from /lib64/librt.so.1
#2 0x00007fc415a821de in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc4154377ab in ?? () from /usr/lib64/libcuda.so.1
#4 0x00007fc41538ffde in ?? () from /usr/lib64/libcuda.so.1
#5 0x00007fc415412916 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00007fc415412fa8 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00007fc4153793fc in ?? () from /usr/lib64/libcuda.so.1
#8 0x00007fc415347392 in cuMemFreeHost () from /usr/lib64/libcuda.so.1
#9 0x00007fc41ac6284d in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
#10 0x00007fc41ac4782c in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5

(gdb) where
#0 0x00007fffc6bffa11 in clock_gettime ()
#1 0x0000003ab7a03e46 in clock_gettime () from /lib64/librt.so.1
#2 0x00007fc415a821de in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc4154377ab in ?? () from /usr/lib64/libcuda.so.1
#4 0x00007fc415414e33 in ?? () from /usr/lib64/libcuda.so.1
#5 0x00007fc415414f89 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00007fc415388c87 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00007fc4153610c2 in cuStreamSynchronize () from /usr/lib64/libcuda.so.1
#8 0x00007fc41ac40d90 in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
#9 0x00007fc41ac781fd in cudaStreamSynchronize () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5

@fmana
Copy link
Author

fmana commented Sep 22, 2016

I forgot to add some info:

1] I'm using NCCL 1.1.1.
2] My application is multi-thread and each thread uses a GPU (so multi-GPU too)
3] When the application falls into deadlock (N-1) GPU load is 100%, 1 GPU load 0%.
Probably (N-1) GPU polling to shake-their-hand condition,
1 GPU into cuFreeHost (enqueued in someplace)

@CFAndy
Copy link

CFAndy commented Oct 17, 2016

Hi Fmana, any update?

@fmana
Copy link
Author

fmana commented Oct 17, 2016

The issue still remain and to overcome it I need to use CPU-based barrier in order to
protect NCCL allreduce primitive.
So I need to have fixed the CUDA programming model limitation as noticed in #37.

Any news on issue/question #49 about nccl deployment for future?

@sjeaugey
Copy link
Member

Should be fixed since NCCL 2.1. Please re-open if still a problem.

minsii added a commit to minsii/nccl that referenced this issue Mar 22, 2024
Summary: Pull Request resolved: facebookresearch#48

Differential Revision: D55174222
minsii added a commit to minsii/nccl that referenced this issue Mar 22, 2024
Summary: Pull Request resolved: facebookresearch#48

Differential Revision: D55174222
minsii added a commit to minsii/nccl that referenced this issue Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants