Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix numpy and XGMI 1-hop detection #67

Merged
merged 2 commits into from
Jun 25, 2024

Conversation

mawong-amd
Copy link

Temporarily fix numpy < 2.0.0 as Numpy 2.0 breaks ROCm PyTorch. Counterpart to vllm-project#5582

Fix XGMI 1-hop detection: previous version has the following problems

  1. Ignores the device_ids passed in.
  2. Each device only checks to see if it's 1-hop XGMI connected to all other devices. Instead, each device should check that all devices are 1-hop XGMI connected to all other devices. This prevents the odd case where some devices are 1-hop XGMI connected to all other devices, but others are not, which would result in not every device enabling custom_all_reduce and hence deadlock.

@mawong-amd mawong-amd merged commit 3e7b0b6 into main Jun 25, 2024
9 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant