-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can nccl be used with multiple processes per GPU? #32
Comments
In general what you suggest is already possible, but keep in mind that all ranks must run the reduction/bcast kernels simultaneously. You can cause deadlocks, for example, by running more ranks on a GPU than it has SM resources (a non issue for a handful of ranks on a "big" GPU). Similarly, all ranks need to be managed within a single CUDA context (kernels from different contexts don't overlap). So either all ranks need to be controlled from a single process, or if MPI is required, you'll need to use MPS. The single-process tests included with NCCL can be used to test multiple ranks on a single GPU using the optional device flags. The following will run an all reduce with three buffers on GPU 0 and the fourth on GPU 1. $ nccl/build/test/single/all_reduce_test 1000000 4 0 0 0 1 |
Many thanks for your quick reply. The particular use-case I have in mind is to use the Parallel package in Torch to handle the fork/exec/synchronization between a parent and its worker processes. Once the workers (including the parent) have pushed data through their nnet models, they synchronize, which (I hope) meets your requirement for simultaneity, because that would be the point for them to call nccl reduce to accumulate their gradient data in the parent's grad parameters; followed by a nccl bcast where the parent broadcasts its updated nnet parameters to the workers. In the case of two workers per GPU on, say an AWS gpu2-large instance, it sounds like each GPU could handle having two worker processes, and hence two ranks resident on the GPU (correct?). Let me check another assumption: after the initial fork and exec to set up the parent and 7 child-workers (assuming 4 GPUs and two processes per GPU), am I correct that each process then calls initrank to get tied into the data sharing mechanism? And the fact that the number of ranks (processes) is twice the number of devices won't cause indigestion in the nccl implementation -- I'm assuming the ranks will be numbered 0-7 while the number of devices is 4, hence some ranks have id's > ndev-1. A note that I am new to nccl (and MPI), so if my preceding explanation reveals a glaring misunderstanding, please feel free to chime in and point me at appropriate reading material. -CC |
This should work as far as I know, assuming you have CUDA MPS enabled. That said, I'm not quite sure I understand the motivation for multiple Thanks,
|
A few minor clarifications. 1) A "device" in NCCL parlance is basically a "rank" in MPI speak, so in the case of 4 physical GPUs with two communicators/ranks per GPU, ndev will be 8, and ids are [0, ndev-1]. 2) I believe that Torch's Parallel package is implemented using threads within a single process, so MPS should not be necessary. |
Thanks for the clarification regarding devices and ranks -- that resolves a head scratcher. Btw, There are two Torch packages which use the name "Parallel", but are actually different. "Data Parallel Table" and "Parallel". The former does indeed support a threading model, the latter uses bona-fide fork/exec to set up separate processes, and uses zero mq for IPC (inter process communication). As to Cliff's comment about a single process keeping a GPU busy, it appears that my use case requires two processes to keep the K520 GPU maxed out. However, I will no doubt find additional opportunities for optimization. |
There is something else I am missing about setting up the communicators: |
Each call to ncclGetUniqueId produces a different value. The idea is that you designate one rank as the master and use some mechanism (e.g. MPI_Bcast, shared memory, etc.) to send the master's identifier to all other ranks. With this identifier all ranks are able to open a common communication channel. It sounds like you are using subtly different Ids for each rank, so that each is waiting for the other in different initialization windows. The reason for this mechanism is that we assume that multi-process applications will use some communication library (e.g. openSHMEM), so we don't want nccl bring in dependencies to some other framework (e.g. MPI) that might conflict. |
That worked. The various processes all returned from their separate calls to ncclCommInitRank(). I am using Torch, which in turn uses Lua, which has a convenient wrapper for the nccl C-library. Once I get through this exercise, I will see if I can get a multi-GPU Torch example up on github, and make some changes to the existing nccl-wrapper. Some of the terminology I have encountered, such as the word 'rank' will be unfamiliar to folks who come to this from other types of software; it might help in documentation and marketing material to add a parenthetical "process or thread identifier" to relate it to something more familiar. |
And now to the next issue: Calls ncclReduce() and ncclAllReduce() hang and get terminated. I have two processes which each perform ncclCommInitRank() successfully. Below are script excerpts from the two processes -- In case the Torch/Lua wrappings need explaining: ":data()" accesses the C-data of Torch tensors for passing into C-functions, in this case it is a pointer to float. ffi is the generic Lua way of interfacing to C, and the "nccl.C..." provide access to various C routines and struct/enum definitions in the nccl header file. Parent process: Child process: When I have tried ncclAllReduce() and ncclReduce(), there are no complaints about the supplied arguments. The calls eventually time out and error is reported from within a low level Torch routine: cudaMemCpy(), which looks like it is trying to copy from the device to back to the host, which makes me wonder if there is some kind of synchronization I need to do for the GPU cards. |
There is no timeout in NCCL that I know of. So , unless Torch has such timeouts, if there is a hang, you shouldn't get an error, it would just block until you hit Ctrl-C. It may be worth looking at the cudaMemcpy error. |
Is it possible that there are errant luajit processes waiting around from a previous run that didn't get completely cleaned up? See the "Last, but not least" comment on https://github.com/clementfarabet/lua---parallel#api-in-very-short |
It turned out that the answer to my issue lay in the "ffi" module of Lua (and hence Torch). This module provides a C-binding interface, so that C- and Lua-variables (and functions) can be converted back and forth. There are subtleties how this interface works. To get an appropriate "comm" variable for use with
And then pass it into the relevant function calls as follows: 1.) Best, |
This issue can be closed. |
Summary: Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI) 1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle - use boostrap allgather to exchange IPC handles - open IPC handles to get remote address to be used later 2. `ncclWinSharedQuery`: query call to get remote address for performing IPC-like communication 3. `ncclWinFree`: close IPC handles and free allocated resources Limitations: - only support intra-node case - rely on nccl boostrapAllgather (can implement a SHM function as well) - Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later) Differential Revision: D50910075
Summary: Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI) 1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle - use boostrap allgather to exchange IPC handles - open IPC handles to get remote address to be used later 2. `ncclWinSharedQuery`: query call to get remote address for performing IPC-like communication 3. `ncclWinFree`: close IPC handles and free allocated resources Limitations: - only support intra-node case - rely on nccl boostrapAllgather (can implement a SHM function as well) - Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later) Differential Revision: D50910075
Summary: Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI) 1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle - use boostrap allgather to exchange IPC handles - open IPC handles to get remote address to be used later 2. `ncclWinSharedQuery`: query call to get remote address for performing IPC-like communication 3. `ncclWinFree`: close IPC handles and free allocated resources Limitations: - only support intra-node case - rely on nccl boostrapAllgather (can implement a SHM function as well) - Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later) Differential Revision: D50910075
Summary: Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI) 1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle - use boostrap allgather to exchange IPC handles - open IPC handles to get remote address to be used later 2. `ncclWinSharedQuery`: query call to get remote address for performing IPC-like communication 3. `ncclWinFree`: close IPC handles and free allocated resources Limitations: - only support intra-node case - rely on nccl boostrapAllgather (can implement a SHM function as well) - Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later) Differential Revision: D50910075
Summary: Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI) 1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle - use boostrap allgather to exchange IPC handles - open IPC handles to get remote address to be used later 2. `ncclWinSharedQuery`: query call to get remote address for performing IPC-like communication 3. `ncclWinFree`: close IPC handles and free allocated resources Limitations: - only support intra-node case - rely on nccl boostrapAllgather (can implement a SHM function as well) - Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later) Differential Revision: D50910075
Summary: Adding three new MPIs to support basic functionality of one-sided communication (i.e., similar to MPI) 1. `ncclWinAllocShared`: create a window with memory allocation to be shared among GPUs within the node - each GPU allocate a CUDA buffer (get from the free memory pool if available) and get IPC memory handle - use boostrap allgather to exchange IPC handles - open IPC handles to get remote address to be used later 2. `ncclWinSharedQuery`: query call to get remote address for performing IPC-like communication 3. `ncclWinFree`: close IPC handles and free allocated resources Limitations: - only support intra-node case - rely on nccl boostrapAllgather (can implement a SHM function as well) - Application must query the remote address by calling `ncclWinSharedQuery` to get the remote address since the shared memory region is not contiguous among GPUs (might be possible to make them contiguous using CUDA driver APIs, which can be explored later) Differential Revision: D50910075
The header file suggests that the number of ranks (or tasks) must be less than (or equal to) the number of devices. However, it would be convenient to have, say, two processes training their own copies of a neural net on the same GPU and then using the reduce and bcast functionality to transfer data between the models during an update. Specifically, using reduce to sum all the gradient parameters onto the master nnet, and then, after updating the master nnet's parameters, using bcast to send the updated parameters to the slave nnets. Is this already possible, or do I need to wait for an enhancement?
The text was updated successfully, but these errors were encountered: