Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix segmentation fault in the proxy profiler
NCCL can be built with `-DPROFILE_PROXY -DENABLE_TIMER` and run with `NCCL_PROXY_PROFILE=trace.json` environment variable to dump the timeline of all proxy operations to `trace.json` using the so-called Trace Event Format. However, currently NCCL segfaults even when trying to profile very simple operations. Just running any binary from `nccl-tests` is enough to trigger the issue, for example: $ cat run.sh NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1 \ NCCL_PROXY_PROFILE="proxy_${OMPI_COMM_WORLD_RANK}.json" \ ./build/all_reduce_perf -g 2 $ mpirun -n 2 ./run.sh ... ./run.sh: line 4: 372891 Segmentation fault [...] ... $ gdb build/all_reduce_perf core ... Program terminated with signal SIGSEGV, Segmentation fault. 57 event->timestamp[state%8] = gettime()-profilingStart; Note that the only point of `NCCL_P2P_DISABLE=1 NCCL_P2P_DIRECT_DISABLE=1 NCCL_SHM_DISABLE=1` is to trigger network communication (and hence proxying) between the two GPUs even when they are on the same host. If the ranks are on different hosts, you don't need these variables to trigger the issue. The root cause of the problem is that the proxy profiler tries to track only `NCCL_STEPS` (i.e., 8) steps per sub at a time, since there can never be more than `NCCL_STEPS` in flight at any given moment. So, for any tracked step `e`, the profiler will: a) fill `e->timestamps[ncclProxyProfileBegin]` at the very beginning of the sub, allocating a slot in `args->subs[sub].profilingEvents[step%NCCL_STEPS]` b) fill `e->timestamps[x]` for `x < ncclProxyProfileEnd` (e.g., x = ncclProxyProfileSendGPUWait, x = ncclProxyProfileSendWait, etc.) when `x` happens for the step c) fill `e->timestamps[ncclProxyProfileEnd]` when the step is finished d) set `args->subs[sub].profilingEvents[step%NCCL_STEPS] = NULL` to clear this slot corresponding to `e`, so that the next steps can be safely written to `slot` The problem happens in a): the profiler tries to set `e->timestamps[ncclProxyProfileBegin]` for *ALL* steps at once, ignoring the fact that we can only track 8 steps at at time. As as result, we essentially only allocate slots for the first 8 steps. When b) first happens for e.g. step NVIDIA#8 (take a look at the gdb stack trace above), the corresponding slot is still set to `NULL`, since we cleared the slot after step #0 finished, but never re-allocated it. This commit fixes this by first storing the timestamp when profiling began in the proxy args. When b) first happens, we allocate a slot for the step, and copy the stored timestamp to `e->timestamp[ncclProxyProfileBegin]`. This way, the generated timeline is exactly the same as before, but we never try processing more than `NCCL_STEPS` steps at once.
- Loading branch information