-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to use multiple NICs #1519
Comments
ip a output
graph dump file
|
@thecodingwizard it looks like you're not using the tcpxo plugin provided by GCP, which is necessary to get good performance on A3 mega. Here is the official user guide : https://cloud.google.com/cluster-toolkit/docs/machine-learning/a3-mega-enable-gpudirect-tcpxo The fastsocket plugin is for earlier GPU instances than A3 / A3 mega. |
Thanks @wenbilliams! I'm deliberately not using GPUDirect-TCPXO because I'm using normal compute instances and I don't want to have to use Slurm/GKE. I know I won't get optimal performance but is it possible to get NCCL to use all the NICs without GPUDirect? If each NIC is 100Gbps we should still be able to get close to 800Gbps of inter-node bandwidth? |
I switched to a new OS but kept the same networking setup. Now it seems to use 2 NICs (I think eth0 and eth5, but I could be mistaken), but it still doesn't use all 8 available to it. The most obvious difference I can see is the topology dump on my new OS now has a topology
graph
debug logs
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
|
please check your pci and gdr set. make sure nccl get right system topo. when gdr is disable, graph will be established in a totally wrong way. If nccl thougt a necessary way's bw has be used, other nics will be unused. |
I'm running NCCL on two GCP
a3-megagpu-8g
instances with 8 NICs attached, but NCCL is only using one of them. Do you know what I might be doing wrong / how I can troubleshoot this?nccl's topo file
nccl debug logs with fastsocket
(for one process only)
nccl debug logs without fastsocket
(for one process only)
Benchmarking script
Please let me know if there's any other information I can provide that might be helpful. Thank you in advance for your help!
Some notes:
a3-megagpu-8g
.bwm-ng
to monitor NIC traffic.NCCL_SOCKET_IFNAME
does cause which NIC is used to change.NCCL_ALGO=ring
for benchmarking since I'm only using two nodes. However, removingNCCL_ALGO=ring
still only causes one NIC to be used.NCCL_CROSS_NIC=0/1/2
did not change the number of NICs used.The text was updated successfully, but these errors were encountered: