Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggest profiler to invoke sub-events when parent handle is null #1544

Open
fishautumn opened this issue Dec 20, 2024 · 1 comment
Open

suggest profiler to invoke sub-events when parent handle is null #1544

fishautumn opened this issue Dec 20, 2024 · 1 comment

Comments

@fishautumn
Copy link

NCCL profiler defines hierarchical structure context -> group -> task -> op -> step.

I found when parent handle is not set, sub-event's won't get invoked. I suggest to just invoke sub-event if the eActivationMask is set, no matter the parent handle is set or not.

For instance the example profiler:

  1. on start group event, if the plugin's groupPool is fully used, plan->groupEventHandle will be null
__hidden ncclResult_t watchdogProfilerStartEvent(void* context, void** eHandle, ncclProfilerEventDescr_v1_t* eDescr) {
  *eHandle = NULL;
  struct context* ctx = (struct context *)context;
  if (eDescr->type == ncclProfileGroup) {
    struct group* event;
    int groupId = __atomic_fetch_add(&ctx->groupPoolIndex, 1, __ATOMIC_RELAXED);
    if ((groupId - __atomic_load_n(&ctx->groupPoolBase, __ATOMIC_RELAXED)) < groupPoolSize) {
      // if there are available group events grab one
      // ...
    } else {
      // else drop this event
      __atomic_fetch_sub(&ctx->groupPoolIndex, 1, __ATOMIC_RELAXED);
      return ncclSuccess;  //----------------------- it doesn't set *eHandle, so it will be NULL
    }
  1. on start task event, as plan->groupEventHandle is null, start task event of plugin won't get invoked
ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan) {
  TIME_START_EVENT(taskStart);
  if (__builtin_expect(ncclProfiler != NULL, 0)) {
    int enable = eActivationMaskGroup & (ncclProfileProxyOp | ncclProfileProxyStep | ncclProfileColl);
    if (plan->groupEventHandle && enable) {   //---------------------- this condition is false
      struct ncclTaskColl* ct = ncclIntruQueueHead(&plan->collTaskQueue);
      while (ct) {
        // ...
        ncclProfiler->startEvent(plan->comm->profilerContext, &ct->eventHandle, &eDescr); // ----------- plugin method not called

        // update collective task with group event activation mask
        ct->eActivationMask = eActivationMaskGroup;  //---------------------- activation mask won't be passed down
        ct = ct->next;
      }
      struct ncclTaskP2p* pt = ncclIntruQueueHead(&plan->p2pTaskQueue);
      while (pt) {
        // ...
        ncclProfiler->startEvent(plan->comm->profilerContext, &pt->eventHandle, &eDescr); // ----------- plugin method not called

        // update collective task with group event activation mask
        pt->eActivationMask = eActivationMaskGroup;  //---------------------- activation mask won't be passed down
        pt = pt->next;
      }
@gcongiu
Copy link

gcongiu commented Dec 20, 2024

Thank you for the suggestion, @fishautumn. That makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants