Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update some files to enhance robustness. #1164

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wangfakang
Copy link

No description provided.

@wangfakang wangfakang changed the title update some files to enhance robustness. Update some files to enhance robustness. Feb 1, 2024
@wangfakang
Copy link
Author

wangfakang commented Feb 2, 2024

@sjeaugey @jbachan @borisfom Would you like to review it and thanks.

@sjeaugey
Copy link
Member

sjeaugey commented Feb 2, 2024

Changes look good to me. We'll try to merge them soon (not sure which version will reflect them though).

@wangfakang
Copy link
Author

Changes look good to me. We'll try to merge them soon (not sure which version will reflect them though).

@sjeaugey Some merge conflict issues have been resolved, thank you.

@wangfakang
Copy link
Author

ping @sjeaugey @jbachan @borisfom

@sjeaugey
Copy link
Member

sjeaugey commented Apr 2, 2024

This was merged in NCCL 2.21. To be released soon.

sjeaugey added a commit that referenced this pull request Apr 4, 2024
Add support for IB SHARP 1PPN operation with user buffers.
Improve support for MNNVL, add NVLS support and multi-clique support.
 * Detect the NVLS clique through NVML
 * Exchange XML between peers in the same NVLS clique and fuse XMLs
   before creating the topology graph.
 * Rework bootstrap allgather algorithms to allow for large allgather
   operations intra-node (XML exchange).
Net/IB: add support for dynamic GID detection.
 * Automatically select RoCEv2/IPv4 interface by default. Allow to
   select IPv6 or even the network/mask.
Reduce NVLS memory usage.
 * Add stepSize as property of a connection to allow for different
   sizes on different peers; set it to 128K for NVLink SHARP.
Improve tuner loading
 * Look for more paths, be more consistent with the network device
   plugin.
 * Also search for tuner support inside the net plugin.
Improve tuner API
 * Add context to support multi-device per process.
Add magic number around comm object to detect comm corruption.
 * Add some basic check around communicators so that we can report a
   problem when a communicator gets corrupted or a wrong comm pointer
   is passed to NCCL.
Fix net/IB error path. Github PR #1164
Fix collnet rail mapping with split comm.
Fix packet reordering issue causing bootstrap mismatch
 * Use a different tag in ncclTransportP2pSetup for the connectInfo
   exchange and the following barrier.
Fix hang when crossNic is inconsistent between ranks.
Fix minCompCap/maxCompCap computation. Github issue #1184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants