[BUG REPORT] NCCL_SHARP_DISABLE env variable does not take effect #116

wangxiaoyang-dev · 2023-08-02T10:45:09Z

In Megatron-LM repo https://github.com/NVIDIA/Megatron-LM/blob/v3.0.2/megatron/mpu/initialize.py#L62, there are three positions will create pg through torch.distributed.new_group.

If I set os.environ["NCCL_SHARP_DISABLE"] = "1" after data parallel, the expect result is data parallel pg will allocate sharp resources, the model parallel pg and the tensor parallel pg will not allocate sharp resources.

But from repo https://github.com/Mellanox/nccl-rdma-sharp-plugins/blob/master/src/sharp_plugin.c#L252 and my experiment, debug log reports "SHARP: Set to disable on this communicator" and all pg can not allocate sharp resources, this is not in line with expectations.

Could you check this problem ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG REPORT] NCCL_SHARP_DISABLE env variable does not take effect #116

[BUG REPORT] NCCL_SHARP_DISABLE env variable does not take effect #116

wangxiaoyang-dev commented Aug 2, 2023

[BUG REPORT] NCCL_SHARP_DISABLE env variable does not take effect #116

[BUG REPORT] NCCL_SHARP_DISABLE env variable does not take effect #116

Comments

wangxiaoyang-dev commented Aug 2, 2023