Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can nccl be used with multiple processes per GPU? #32

Closed
CCorfield opened this issue Jun 22, 2016 · 13 comments
Closed

Can nccl be used with multiple processes per GPU? #32

CCorfield opened this issue Jun 22, 2016 · 13 comments

Comments

@CCorfield
Copy link

The header file suggests that the number of ranks (or tasks) must be less than (or equal to) the number of devices. However, it would be convenient to have, say, two processes training their own copies of a neural net on the same GPU and then using the reduce and bcast functionality to transfer data between the models during an update. Specifically, using reduce to sum all the gradient parameters onto the master nnet, and then, after updating the master nnet's parameters, using bcast to send the updated parameters to the slave nnets. Is this already possible, or do I need to wait for an enhancement?

@nluehr
Copy link
Contributor

nluehr commented Jun 22, 2016

In general what you suggest is already possible, but keep in mind that all ranks must run the reduction/bcast kernels simultaneously. You can cause deadlocks, for example, by running more ranks on a GPU than it has SM resources (a non issue for a handful of ranks on a "big" GPU). Similarly, all ranks need to be managed within a single CUDA context (kernels from different contexts don't overlap). So either all ranks need to be controlled from a single process, or if MPI is required, you'll need to use MPS.

The single-process tests included with NCCL can be used to test multiple ranks on a single GPU using the optional device flags. The following will run an all reduce with three buffers on GPU 0 and the fourth on GPU 1.

$ nccl/build/test/single/all_reduce_test 1000000 4 0 0 0 1

@CCorfield
Copy link
Author

Many thanks for your quick reply. The particular use-case I have in mind is to use the Parallel package in Torch to handle the fork/exec/synchronization between a parent and its worker processes. Once the workers (including the parent) have pushed data through their nnet models, they synchronize, which (I hope) meets your requirement for simultaneity, because that would be the point for them to call nccl reduce to accumulate their gradient data in the parent's grad parameters; followed by a nccl bcast where the parent broadcasts its updated nnet parameters to the workers. In the case of two workers per GPU on, say an AWS gpu2-large instance, it sounds like each GPU could handle having two worker processes, and hence two ranks resident on the GPU (correct?). Let me check another assumption: after the initial fork and exec to set up the parent and 7 child-workers (assuming 4 GPUs and two processes per GPU), am I correct that each process then calls initrank to get tied into the data sharing mechanism? And the fact that the number of ranks (processes) is twice the number of devices won't cause indigestion in the nccl implementation -- I'm assuming the ranks will be numbered 0-7 while the number of devices is 4, hence some ranks have id's > ndev-1. A note that I am new to nccl (and MPI), so if my preceding explanation reveals a glaring misunderstanding, please feel free to chime in and point me at appropriate reading material.

-CC

@cliffwoolley
Copy link
Collaborator

This should work as far as I know, assuming you have CUDA MPS enabled.

That said, I'm not quite sure I understand the motivation for multiple
ranks per GPU in this case. Any one rank in Torch (and, by extension,
cuDNN) should normally be able to fill up a GPU (even a big GPU) on its
own...

Thanks,
Cliff
On Jun 23, 2016 7:06 AM, "CCorfield" [email protected] wrote:

Many thanks for your quick reply. The particular use-case I have in mind
is to use the Parallel package in Torch to handle the
fork/exec/synchronization between a parent and its worker processes. Once
the workers (including the parent) have pushed data through their nnet
models, they synchronize, which (I hope) meets your requirement for
simultaneity, because that would be the point for them to call nccl reduce
to accumulate their gradient data in the parent's grad parameters; followed
by a nccl bcast where the parent broadcasts its updated nnet parameters to
the workers. In the case of two workers per GPU on, say an AWS gpu2-large
instance, it sounds like each GPU could handle having two worker processes,
and hence two ranks resident on the GPU (correct?). Let me check another
assumption: after the initial fork and exec to set up the parent and 7
child-workers (assuming 4 GPUs and two processes per GPU), am I correct
that each process then calls initrank to get tied into the data sharing
mechanism? And the fact that the number of ranks (processes) is twice the
number of devices won't cause indigestion in the nccl implementation -- I'm
assuming the ranks will be numbered 0-7 while the number of devices is 4,
hence some ranks have id's > ndev-1. A note that I am new to nccl (and
MPI), so if my preceding explanation reveals a glaring misunderstanding,
please feel free to chime in and point me at appropriate reading material.

-CC


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#32 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AJO93nIbS663YEp8qq_Jo-ou6jhLQCL8ks5qOpLdgaJpZM4I8Svw
.

@nluehr
Copy link
Contributor

nluehr commented Jun 23, 2016

A few minor clarifications. 1) A "device" in NCCL parlance is basically a "rank" in MPI speak, so in the case of 4 physical GPUs with two communicators/ranks per GPU, ndev will be 8, and ids are [0, ndev-1]. 2) I believe that Torch's Parallel package is implemented using threads within a single process, so MPS should not be necessary.

@CCorfield
Copy link
Author

Thanks for the clarification regarding devices and ranks -- that resolves a head scratcher. Btw, There are two Torch packages which use the name "Parallel", but are actually different. "Data Parallel Table" and "Parallel". The former does indeed support a threading model, the latter uses bona-fide fork/exec to set up separate processes, and uses zero mq for IPC (inter process communication). As to Cliff's comment about a single process keeping a GPU busy, it appears that my use case requires two processes to keep the K520 GPU maxed out. However, I will no doubt find additional opportunities for optimization.
Thanks for the assistance!
-CC

@CCorfield
Copy link
Author

There is something else I am missing about setting up the communicators:
In process #1: (1) set device; (2) ncclCommInitRank(comms, 2, uniqueID, 0)
In process #2: (1) set device; (2) ncclCommInitRank(comms, 2, uniqueID, 1)
Both these calls hang and do not return, so there is something about the synchronization model which I don't understand. The call to ncclGetUniqueId() produced identical ncclUniqueId's in both processes, which leads me to think that the uniqueID is intended to be the same for all communicators (correct?) Any suggestions? If you have an example chunk of code, by all means send it my way -- it is usually easier to start from something that works.
Thanks in advance,
-CC

@nluehr
Copy link
Contributor

nluehr commented Jun 28, 2016

Each call to ncclGetUniqueId produces a different value. The idea is that you designate one rank as the master and use some mechanism (e.g. MPI_Bcast, shared memory, etc.) to send the master's identifier to all other ranks. With this identifier all ranks are able to open a common communication channel.

It sounds like you are using subtly different Ids for each rank, so that each is waiting for the other in different initialization windows.

The reason for this mechanism is that we assume that multi-process applications will use some communication library (e.g. openSHMEM), so we don't want nccl bring in dependencies to some other framework (e.g. MPI) that might conflict.

@CCorfield
Copy link
Author

That worked. The various processes all returned from their separate calls to ncclCommInitRank(). I am using Torch, which in turn uses Lua, which has a convenient wrapper for the nccl C-library. Once I get through this exercise, I will see if I can get a multi-GPU Torch example up on github, and make some changes to the existing nccl-wrapper. Some of the terminology I have encountered, such as the word 'rank' will be unfamiliar to folks who come to this from other types of software; it might help in documentation and marketing material to add a parenthetical "process or thread identifier" to relate it to something more familiar.
Once again, thank you for your timely response.

@CCorfield
Copy link
Author

And now to the next issue: Calls ncclReduce() and ncclAllReduce() hang and get terminated.

I have two processes which each perform ncclCommInitRank() successfully.
The two processes next attempt a reduce operation, and they hang. So, one question is whether there is anything else that needs to be done after the call to ncclCommInitRank() and before calling ncclReduce()/ncclAllReduce()?

Below are script excerpts from the two processes -- In case the Torch/Lua wrappings need explaining: ":data()" accesses the C-data of Torch tensors for passing into C-functions, in this case it is a pointer to float. ffi is the generic Lua way of interfacing to C, and the "nccl.C..." provide access to various C routines and struct/enum definitions in the nccl header file.

Parent process:
cutorch.setDevice(parentGPU)
parallel.children:join() -- Parent synchronizes with child process
nccl.C.ncclAllReduce(dparams:data(), recvbuff:data(), dparams:size(1), nccl.C.ncclFloat, nccl.C.ncclSum, comms[rank], ffi.C.THCState_getCurrentStream(cutorch.getState()))

Child process:
cutorch.setDevice(childGPU)
parallel.yield() -- Child synchronizes with parent process
nccl.C.ncclAllReduce(dparams:data(), recvbuff:data(), dparams:size(1), nccl.C.ncclFloat, nccl.C.ncclSum, comms[rank], ffi.C.THCState_getCurrentStream(cutorch.getState()))

When I have tried ncclAllReduce() and ncclReduce(), there are no complaints about the supplied arguments. The calls eventually time out and error is reported from within a low level Torch routine: cudaMemCpy(), which looks like it is trying to copy from the device to back to the host, which makes me wonder if there is some kind of synchronization I need to do for the GPU cards.
As ever, any pointers/suggestions much appreciated.
-CC

@sjeaugey
Copy link
Member

There is no timeout in NCCL that I know of. So , unless Torch has such timeouts, if there is a hang, you shouldn't get an error, it would just block until you hit Ctrl-C. It may be worth looking at the cudaMemcpy error.

@nluehr
Copy link
Contributor

nluehr commented Jun 28, 2016

Is it possible that there are errant luajit processes waiting around from a previous run that didn't get completely cleaned up? See the "Last, but not least" comment on https://github.com/clementfarabet/lua---parallel#api-in-very-short

@CCorfield
Copy link
Author

It turned out that the answer to my issue lay in the "ffi" module of Lua (and hence Torch). This module provides a C-binding interface, so that C- and Lua-variables (and functions) can be converted back and forth. There are subtleties how this interface works. To get an appropriate "comm" variable for use with ncclCommInitRank()and a subsequent call to ncclReduce(), you need to do the following (in Torch/Lua):

local comm = ffi.new('ncclComm_t[?]',1) -- In effect, create a pointer to a size=1 array

And then pass it into the relevant function calls as follows:

1.) ncclCommInitRank(comm, ndev, uniqueId, rank)
2.) ncclReduce(sendbuff:data(), recvbuff:data(), sendbuff:size(1), nccl.C.ncclFloat, nccl.C.ncclSum, 0, comm[0], ffi.C.THCState_getCurrentStream(cutorch.getState()))

Best,
-CC

@CCorfield
Copy link
Author

This issue can be closed.
I have a multi-process, multi-gpu training harness successfully running under CUDA 8.0rc on both GTX 1080's and GTX 970's. No MPI or MPS required. The vanilla interprocess communication runs through the Torch/parallel package and the data exchange between the network-clones runs through nccl (actually a Torch wrapper around the nccl header file).

minsii pushed a commit to minsii/nccl that referenced this issue Nov 7, 2023
Summary:

Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI)
1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node
  - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle
  - use boostrap allgather to exchange IPC handles
  - open IPC handles to get remote address to be used later

2. `ncclWinSharedQuery`: query call to get remote address for  performing IPC-like communication

3. `ncclWinFree`: close IPC handles and free allocated resources

Limitations:
- only support intra-node case
- rely on nccl boostrapAllgather (can implement a SHM function as well)
-  Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later)

Differential Revision: D50910075
minsii pushed a commit to minsii/nccl that referenced this issue Nov 8, 2023
Summary:

Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI)
1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node
  - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle
  - use boostrap allgather to exchange IPC handles
  - open IPC handles to get remote address to be used later

2. `ncclWinSharedQuery`: query call to get remote address for  performing IPC-like communication

3. `ncclWinFree`: close IPC handles and free allocated resources

Limitations:
- only support intra-node case
- rely on nccl boostrapAllgather (can implement a SHM function as well)
-  Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later)

Differential Revision: D50910075
minsii pushed a commit to minsii/nccl that referenced this issue Nov 15, 2023
Summary:

Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI)
1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node
  - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle
  - use boostrap allgather to exchange IPC handles
  - open IPC handles to get remote address to be used later

2. `ncclWinSharedQuery`: query call to get remote address for  performing IPC-like communication

3. `ncclWinFree`: close IPC handles and free allocated resources

Limitations:
- only support intra-node case
- rely on nccl boostrapAllgather (can implement a SHM function as well)
-  Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later)

Differential Revision: D50910075
minsii pushed a commit to minsii/nccl that referenced this issue Nov 30, 2023
Summary:

Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI)
1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node
  - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle
  - use boostrap allgather to exchange IPC handles
  - open IPC handles to get remote address to be used later

2. `ncclWinSharedQuery`: query call to get remote address for  performing IPC-like communication

3. `ncclWinFree`: close IPC handles and free allocated resources

Limitations:
- only support intra-node case
- rely on nccl boostrapAllgather (can implement a SHM function as well)
-  Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later)

Differential Revision: D50910075
minsii pushed a commit to minsii/nccl that referenced this issue Nov 30, 2023
Summary:

Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI)
1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node
  - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle
  - use boostrap allgather to exchange IPC handles
  - open IPC handles to get remote address to be used later

2. `ncclWinSharedQuery`: query call to get remote address for  performing IPC-like communication

3. `ncclWinFree`: close IPC handles and free allocated resources

Limitations:
- only support intra-node case
- rely on nccl boostrapAllgather (can implement a SHM function as well)
-  Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later)

Differential Revision: D50910075
minsii pushed a commit to minsii/nccl that referenced this issue Dec 1, 2023
Summary:

Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI)
1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node
  - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle
  - use boostrap allgather to exchange IPC handles
  - open IPC handles to get remote address to be used later

2. `ncclWinSharedQuery`: query call to get remote address for  performing IPC-like communication

3. `ncclWinFree`: close IPC handles and free allocated resources

Limitations:
- only support intra-node case
- rely on nccl boostrapAllgather (can implement a SHM function as well)
-  Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later)

Differential Revision: D50910075
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants