Can nccl be used with multiple processes per GPU? #32

CCorfield · 2016-06-22T22:39:04Z

The header file suggests that the number of ranks (or tasks) must be less than (or equal to) the number of devices. However, it would be convenient to have, say, two processes training their own copies of a neural net on the same GPU and then using the reduce and bcast functionality to transfer data between the models during an update. Specifically, using reduce to sum all the gradient parameters onto the master nnet, and then, after updating the master nnet's parameters, using bcast to send the updated parameters to the slave nnets. Is this already possible, or do I need to wait for an enhancement?

nluehr · 2016-06-22T23:05:49Z

In general what you suggest is already possible, but keep in mind that all ranks must run the reduction/bcast kernels simultaneously. You can cause deadlocks, for example, by running more ranks on a GPU than it has SM resources (a non issue for a handful of ranks on a "big" GPU). Similarly, all ranks need to be managed within a single CUDA context (kernels from different contexts don't overlap). So either all ranks need to be controlled from a single process, or if MPI is required, you'll need to use MPS.

The single-process tests included with NCCL can be used to test multiple ranks on a single GPU using the optional device flags. The following will run an all reduce with three buffers on GPU 0 and the fourth on GPU 1.

$ nccl/build/test/single/all_reduce_test 1000000 4 0 0 0 1

CCorfield · 2016-06-23T14:06:20Z

Many thanks for your quick reply. The particular use-case I have in mind is to use the Parallel package in Torch to handle the fork/exec/synchronization between a parent and its worker processes. Once the workers (including the parent) have pushed data through their nnet models, they synchronize, which (I hope) meets your requirement for simultaneity, because that would be the point for them to call nccl reduce to accumulate their gradient data in the parent's grad parameters; followed by a nccl bcast where the parent broadcasts its updated nnet parameters to the workers. In the case of two workers per GPU on, say an AWS gpu2-large instance, it sounds like each GPU could handle having two worker processes, and hence two ranks resident on the GPU (correct?). Let me check another assumption: after the initial fork and exec to set up the parent and 7 child-workers (assuming 4 GPUs and two processes per GPU), am I correct that each process then calls initrank to get tied into the data sharing mechanism? And the fact that the number of ranks (processes) is twice the number of devices won't cause indigestion in the nccl implementation -- I'm assuming the ranks will be numbered 0-7 while the number of devices is 4, hence some ranks have id's > ndev-1. A note that I am new to nccl (and MPI), so if my preceding explanation reveals a glaring misunderstanding, please feel free to chime in and point me at appropriate reading material.

-CC

cliffwoolley · 2016-06-23T14:14:51Z

This should work as far as I know, assuming you have CUDA MPS enabled.

That said, I'm not quite sure I understand the motivation for multiple
ranks per GPU in this case. Any one rank in Torch (and, by extension,
cuDNN) should normally be able to fill up a GPU (even a big GPU) on its
own...

Thanks,
Cliff
On Jun 23, 2016 7:06 AM, "CCorfield" [email protected] wrote:

Many thanks for your quick reply. The particular use-case I have in mind
is to use the Parallel package in Torch to handle the
fork/exec/synchronization between a parent and its worker processes. Once
the workers (including the parent) have pushed data through their nnet
models, they synchronize, which (I hope) meets your requirement for
simultaneity, because that would be the point for them to call nccl reduce
to accumulate their gradient data in the parent's grad parameters; followed
by a nccl bcast where the parent broadcasts its updated nnet parameters to
the workers. In the case of two workers per GPU on, say an AWS gpu2-large
instance, it sounds like each GPU could handle having two worker processes,
and hence two ranks resident on the GPU (correct?). Let me check another
assumption: after the initial fork and exec to set up the parent and 7
child-workers (assuming 4 GPUs and two processes per GPU), am I correct
that each process then calls initrank to get tied into the data sharing
mechanism? And the fact that the number of ranks (processes) is twice the
number of devices won't cause indigestion in the nccl implementation -- I'm
assuming the ranks will be numbered 0-7 while the number of devices is 4,
hence some ranks have id's > ndev-1. A note that I am new to nccl (and
MPI), so if my preceding explanation reveals a glaring misunderstanding,
please feel free to chime in and point me at appropriate reading material.

-CC

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#32 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AJO93nIbS663YEp8qq_Jo-ou6jhLQCL8ks5qOpLdgaJpZM4I8Svw
.

nluehr · 2016-06-23T17:56:22Z

A few minor clarifications. 1) A "device" in NCCL parlance is basically a "rank" in MPI speak, so in the case of 4 physical GPUs with two communicators/ranks per GPU, ndev will be 8, and ids are [0, ndev-1]. 2) I believe that Torch's Parallel package is implemented using threads within a single process, so MPS should not be necessary.

CCorfield · 2016-06-24T12:54:59Z

Thanks for the clarification regarding devices and ranks -- that resolves a head scratcher. Btw, There are two Torch packages which use the name "Parallel", but are actually different. "Data Parallel Table" and "Parallel". The former does indeed support a threading model, the latter uses bona-fide fork/exec to set up separate processes, and uses zero mq for IPC (inter process communication). As to Cliff's comment about a single process keeping a GPU busy, it appears that my use case requires two processes to keep the K520 GPU maxed out. However, I will no doubt find additional opportunities for optimization.
Thanks for the assistance!
-CC

CCorfield · 2016-06-27T23:19:38Z

There is something else I am missing about setting up the communicators:
In process #1: (1) set device; (2) ncclCommInitRank(comms, 2, uniqueID, 0)
In process #2: (1) set device; (2) ncclCommInitRank(comms, 2, uniqueID, 1)
Both these calls hang and do not return, so there is something about the synchronization model which I don't understand. The call to ncclGetUniqueId() produced identical ncclUniqueId's in both processes, which leads me to think that the uniqueID is intended to be the same for all communicators (correct?) Any suggestions? If you have an example chunk of code, by all means send it my way -- it is usually easier to start from something that works.
Thanks in advance,
-CC

nluehr · 2016-06-28T00:23:42Z

Each call to ncclGetUniqueId produces a different value. The idea is that you designate one rank as the master and use some mechanism (e.g. MPI_Bcast, shared memory, etc.) to send the master's identifier to all other ranks. With this identifier all ranks are able to open a common communication channel.

It sounds like you are using subtly different Ids for each rank, so that each is waiting for the other in different initialization windows.

The reason for this mechanism is that we assume that multi-process applications will use some communication library (e.g. openSHMEM), so we don't want nccl bring in dependencies to some other framework (e.g. MPI) that might conflict.

CCorfield · 2016-06-28T14:01:18Z

That worked. The various processes all returned from their separate calls to ncclCommInitRank(). I am using Torch, which in turn uses Lua, which has a convenient wrapper for the nccl C-library. Once I get through this exercise, I will see if I can get a multi-GPU Torch example up on github, and make some changes to the existing nccl-wrapper. Some of the terminology I have encountered, such as the word 'rank' will be unfamiliar to folks who come to this from other types of software; it might help in documentation and marketing material to add a parenthetical "process or thread identifier" to relate it to something more familiar.
Once again, thank you for your timely response.

CCorfield · 2016-06-28T16:59:14Z

And now to the next issue: Calls ncclReduce() and ncclAllReduce() hang and get terminated.

I have two processes which each perform ncclCommInitRank() successfully.
The two processes next attempt a reduce operation, and they hang. So, one question is whether there is anything else that needs to be done after the call to ncclCommInitRank() and before calling ncclReduce()/ncclAllReduce()?

Below are script excerpts from the two processes -- In case the Torch/Lua wrappings need explaining: ":data()" accesses the C-data of Torch tensors for passing into C-functions, in this case it is a pointer to float. ffi is the generic Lua way of interfacing to C, and the "nccl.C..." provide access to various C routines and struct/enum definitions in the nccl header file.

Parent process:
cutorch.setDevice(parentGPU)
parallel.children:join() -- Parent synchronizes with child process
nccl.C.ncclAllReduce(dparams:data(), recvbuff:data(), dparams:size(1), nccl.C.ncclFloat, nccl.C.ncclSum, comms[rank], ffi.C.THCState_getCurrentStream(cutorch.getState()))

Child process:
cutorch.setDevice(childGPU)
parallel.yield() -- Child synchronizes with parent process
nccl.C.ncclAllReduce(dparams:data(), recvbuff:data(), dparams:size(1), nccl.C.ncclFloat, nccl.C.ncclSum, comms[rank], ffi.C.THCState_getCurrentStream(cutorch.getState()))

When I have tried ncclAllReduce() and ncclReduce(), there are no complaints about the supplied arguments. The calls eventually time out and error is reported from within a low level Torch routine: cudaMemCpy(), which looks like it is trying to copy from the device to back to the host, which makes me wonder if there is some kind of synchronization I need to do for the GPU cards.
As ever, any pointers/suggestions much appreciated.
-CC

sjeaugey · 2016-06-28T17:09:21Z

There is no timeout in NCCL that I know of. So , unless Torch has such timeouts, if there is a hang, you shouldn't get an error, it would just block until you hit Ctrl-C. It may be worth looking at the cudaMemcpy error.

nluehr · 2016-06-28T18:37:47Z

Is it possible that there are errant luajit processes waiting around from a previous run that didn't get completely cleaned up? See the "Last, but not least" comment on https://github.com/clementfarabet/lua---parallel#api-in-very-short

CCorfield · 2016-06-29T16:52:04Z

It turned out that the answer to my issue lay in the "ffi" module of Lua (and hence Torch). This module provides a C-binding interface, so that C- and Lua-variables (and functions) can be converted back and forth. There are subtleties how this interface works. To get an appropriate "comm" variable for use with ncclCommInitRank()and a subsequent call to ncclReduce(), you need to do the following (in Torch/Lua):

local comm = ffi.new('ncclComm_t[?]',1) -- In effect, create a pointer to a size=1 array

And then pass it into the relevant function calls as follows:

1.) ncclCommInitRank(comm, ndev, uniqueId, rank)
2.) ncclReduce(sendbuff:data(), recvbuff:data(), sendbuff:size(1), nccl.C.ncclFloat, nccl.C.ncclSum, 0, comm[0], ffi.C.THCState_getCurrentStream(cutorch.getState()))

Best,
-CC

CCorfield · 2016-07-31T14:50:43Z

This issue can be closed.
I have a multi-process, multi-gpu training harness successfully running under CUDA 8.0rc on both GTX 1080's and GTX 970's. No MPI or MPS required. The vanilla interprocess communication runs through the Torch/parallel package and the data exchange between the network-clones runs through nccl (actually a Torch wrapper around the nccl header file).

Summary: Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI) 1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle - use boostrap allgather to exchange IPC handles - open IPC handles to get remote address to be used later 2. `ncclWinSharedQuery`: query call to get remote address for performing IPC-like communication 3. `ncclWinFree`: close IPC handles and free allocated resources Limitations: - only support intra-node case - rely on nccl boostrapAllgather (can implement a SHM function as well) - Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later) Differential Revision: D50910075

CCorfield closed this as completed Jul 31, 2016

sfantao mentioned this issue Nov 11, 2020

Multiple MPI ranks in the same GPU using Nvidia Multiprocess Service #418

Closed

himanshucodz55 mentioned this issue Jul 25, 2022

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms #708

Open

ice-tong mentioned this issue Aug 23, 2022

[Feat] Add distributed backends and unittests open-mmlab/mmeval#3

Merged

raninbowlalala mentioned this issue Jul 4, 2023

2 allreduce and a allgather hang in multi-node #899

Open

acphile mentioned this issue Sep 29, 2023

Question about ncclCommAbort stuck issue #1013

Open

Pbarbosa92 mentioned this issue Apr 30, 2024

Add CaNS simulator inductiva/kutu#14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can nccl be used with multiple processes per GPU? #32

Can nccl be used with multiple processes per GPU? #32

CCorfield commented Jun 22, 2016

nluehr commented Jun 22, 2016

CCorfield commented Jun 23, 2016

cliffwoolley commented Jun 23, 2016

nluehr commented Jun 23, 2016

CCorfield commented Jun 24, 2016

CCorfield commented Jun 27, 2016

nluehr commented Jun 28, 2016

CCorfield commented Jun 28, 2016

CCorfield commented Jun 28, 2016

sjeaugey commented Jun 28, 2016

nluehr commented Jun 28, 2016

CCorfield commented Jun 29, 2016

CCorfield commented Jul 31, 2016

Can nccl be used with multiple processes per GPU? #32

Can nccl be used with multiple processes per GPU? #32

Comments

CCorfield commented Jun 22, 2016

nluehr commented Jun 22, 2016

CCorfield commented Jun 23, 2016

cliffwoolley commented Jun 23, 2016

nluehr commented Jun 23, 2016

CCorfield commented Jun 24, 2016

CCorfield commented Jun 27, 2016

nluehr commented Jun 28, 2016

CCorfield commented Jun 28, 2016

CCorfield commented Jun 28, 2016

sjeaugey commented Jun 28, 2016

nluehr commented Jun 28, 2016

CCorfield commented Jun 29, 2016

CCorfield commented Jul 31, 2016