Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG REPORT] NCCL_SHARP_DISABLE env variable does not take effect #116

Open
wangxiaoyang-dev opened this issue Aug 2, 2023 · 0 comments

Comments

@wangxiaoyang-dev
Copy link

In Megatron-LM repo https://github.com/NVIDIA/Megatron-LM/blob/v3.0.2/megatron/mpu/initialize.py#L62, there are three positions will create pg through torch.distributed.new_group.

If I set os.environ["NCCL_SHARP_DISABLE"] = "1" after data parallel, the expect result is data parallel pg will allocate sharp resources, the model parallel pg and the tensor parallel pg will not allocate sharp resources.

But from repo https://github.com/Mellanox/nccl-rdma-sharp-plugins/blob/master/src/sharp_plugin.c#L252 and my experiment, debug log reports "SHARP: Set to disable on this communicator" and all pg can not allocate sharp resources, this is not in line with expectations.

Could you check this problem ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant