Skip to content

Commit

Permalink
Fix segmentation fault in the proxy profiler
Browse files Browse the repository at this point in the history
NCCL can be built with `-DPROFILE_PROXY -DENABLE_TIMER` and run
with `NCCL_PROXY_PROFILE=trace.json` environment variable to dump
the timeline of all proxy operations to `trace.json` using the
so-called Trace Event Format.

However, currently NCCL segfaults even when trying to profile very
simple operations. Just running any binary from `nccl-tests` is
enough to trigger the issue, for example:

$ cat run.sh
NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1 \
NCCL_PROXY_PROFILE="proxy_${OMPI_COMM_WORLD_RANK}.json" \
./build/all_reduce_perf -g 2
$ mpirun -n 2 ./run.sh
...
./run.sh: line 4: 372891 Segmentation fault [...]
...
$ gdb build/all_reduce_perf core
...
Program terminated with signal SIGSEGV, Segmentation fault.
57        event->timestamp[state%8] = gettime()-profilingStart;

Note that the only point of `NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1`
is to trigger network communication (and hence proxying) between the
two GPUs even when they are on the same host. If the ranks are on
different hosts, you don't need these variables to trigger the issue.

The root cause of the problem is that the proxy profiler tries to track
only `NCCL_STEPS` (i.e., 8) steps per sub at a time, since there can
never be more than `NCCL_STEPS` in flight at any given moment.
So, for any tracked step `e`, the profiler will:
  a) fill `e->timestamps[ncclProxyProfileBegin]` at the very beginning
     of the sub, allocating a slot in `args->subs[sub].profilingEvents[step%NCCL_STEPS]`
  b) fill `e->timestamps[x]` for `x < ncclProxyProfileEnd` (e.g.,
     x = ncclProxyProfileSendGPUWait, x = ncclProxyProfileSendWait, etc.)
     when `x` happens for the step
  c) fill `e->timestamps[ncclProxyProfileEnd]` when the step is finished
  d) set `args->subs[sub].profilingEvents[step%NCCL_STEPS] = NULL` to clear this
     slot corresponding to `e`, so that the next steps can be safely
     written to `slot`

The problem happens in a): the profiler tries to set `e->timestamps[ncclProxyProfileBegin]`
for *ALL* steps at once, ignoring the fact that we can only track 8
steps at at time. As as result, we essentially only allocate slots for the first 8 steps.
When b) first happens for e.g. step NVIDIA#8 (take a look at the gdb stack
trace above), the corresponding slot is still set to `NULL`, since we cleared
the slot after step #0 finished, but never re-allocated it.

This commit fixes this by first storing the timestamp when profiling began in the proxy args.
When b) first happens, we allocate a slot for the step, and copy the stored timestamp
to `e->timestamp[ncclProxyProfileBegin]`. This way, the generated
timeline is exactly the same as before, but we never try processing
more than `NCCL_STEPS` steps at once.
  • Loading branch information
dfyz committed Nov 24, 2023
1 parent b6d7438 commit 1913273
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 6 deletions.
1 change: 1 addition & 0 deletions src/include/proxy.h
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ struct ncclProxySubArgs {
uint64_t end;
void* requests[NCCL_STEPS];
void* profilingEvents[NCCL_STEPS];
double profileBeginTimestamp;
void* recvRequestsCache[NCCL_STEPS];
int recvRequestsSubCount;
};
Expand Down
21 changes: 17 additions & 4 deletions src/misc/profiler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#ifdef PROFILE_PROXY
#include "timer.h"
#include "alloc.h"
#include <assert.h>

static const char* profilingStateSendStr[] = { "BufferWait", "GPUWait", "SendWait", "", "End" };
static const char* profilingStateRecvStr[] = { "BufferWait", "RecvWait", "FlushWait", "GPUWait", "End" };
Expand All @@ -34,21 +35,33 @@ ncclResult_t ncclProfilingRecord(struct ncclProxyArgs* args, int sub, int step,
NCCLCHECK(ncclCalloc(&profilingEvents, MAX_EVENTS));
profilingStart = gettime();
}
struct ncclProxyProfileEvent* event = NULL;
if (state%8 == 0) {
if (state == ncclProxyProfileBegin) {
double profileBeginTimestamp = gettime()-profilingStart;
// This record is created when the proxy starts profiling the steps of this sub.
// Any future step will have `e->timestamps[ncclProxyProfileBegin] == profileBeginTimestamp`;
// however, at this point, the proxy may not have started processing the step yet.
// Therefore, here we store `profileBeginTimestamp` in a field of `ncclProxySubArgs`,
// so that it can be copied to `e->timestamps[ncclProxyProfileBegin]` below.
args->subs[sub].profileBeginTimestamp = profileBeginTimestamp;
return ncclSuccess;
}
struct ncclProxyProfileEvent* event = (struct ncclProxyProfileEvent*)args->subs[sub].profilingEvents[step%NCCL_STEPS];
if (event == nullptr || state%8 == 0) {
if (profilingIndex == MAX_EVENTS) return ncclSuccess;
args->subs[sub].profilingEvents[step%NCCL_STEPS] = event = profilingEvents+profilingIndex++;
if (state == ncclProxyProfileBegin) {
if (state < 8) {
// Proxy operation information
// `state` should be the very first state in a step (i.e., either a send GPU wait, or a recv wait).
assert(state == ncclProxyProfileSendGPUWait || state == ncclProxyProfileRecvWait);
event->opCount = args->opCount;
event->channel = args->subs[sub].channelId;
event->peer = args->subs[sub].peer;
event->type = args->pattern;
event->step = step;
event->opIndex = (((uint64_t)args)/sizeof(struct ncclProxyArgs))%256;
event->timestamp[ncclProxyProfileBegin] = args->subs[sub].profileBeginTimestamp;
} else event->peer = -state;
} else {
event = (struct ncclProxyProfileEvent*)args->subs[sub].profilingEvents[step%NCCL_STEPS];
if (state == ncclProxyProfileEnd) args->subs[sub].profilingEvents[step%NCCL_STEPS] = NULL;
if (state == ncclProxyProfileAppendEnd) event->opCount = args->opCount;
}
Expand Down
8 changes: 6 additions & 2 deletions src/transport/net.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1023,7 +1023,9 @@ static ncclResult_t sendProxyProgress(struct ncclProxyState* proxyState, struct
// Round to next multiple of sliceSteps
sub->base = ROUNDUP(resources->step, args->chunkSteps);
sub->posted = sub->transmitted = sub->done = 0;
for (uint64_t step=0; step<sub->nsteps; step++) ncclProfilingRecord(args, s, step, ncclProxyProfileBegin);
// The `ncclProxyProfileBegin` state doesn't correspond to any particular step,
// so we pass `0` as the step number.
ncclProfilingRecord(args, s, 0, ncclProxyProfileBegin);
}
args->state = ncclProxyOpProgress;
}
Expand Down Expand Up @@ -1173,7 +1175,9 @@ static ncclResult_t recvProxyProgress(struct ncclProxyState* proxyState, struct
sub->base = ROUNDUP(resources->step, args->chunkSteps);
sub->posted = sub->received = sub->transmitted = sub->done = 0;
for (int i=0; i<groupSize; i++) sub[-i].groupSize = groupSize;
for (uint64_t step=0; step<sub->nsteps; step++) ncclProfilingRecord(args, s, step, ncclProxyProfileBegin);
// The `ncclProxyProfileBegin` state doesn't correspond to any particular step,
// so we pass `0` as the step number.
ncclProfilingRecord(args, s, 0, ncclProxyProfileBegin);
}
args->state = ncclProxyOpProgress;
}
Expand Down

0 comments on commit 1913273

Please sign in to comment.