Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use multiple NICs #1519

Open
thecodingwizard opened this issue Nov 19, 2024 · 5 comments
Open

Unable to use multiple NICs #1519

thecodingwizard opened this issue Nov 19, 2024 · 5 comments

Comments

@thecodingwizard
Copy link

thecodingwizard commented Nov 19, 2024

I'm running NCCL on two GCP a3-megagpu-8g instances with 8 NICs attached, but NCCL is only using one of them. Do you know what I might be doing wrong / how I can troubleshoot this?

nccl's topo file
<system version="1">
  <cpu numaid="0" affinity="0000,00000000,0fffffff,ffffff00,00000000,000fffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
    <pci busid="0000:00:0c.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="" link_width="0">
      <nic>
        <net name="enp0s12" dev="0" speed="200000" port="0" latency="0.000000" guid="0x0" maxconn="65536" gdr="0"/>
      </nic>
    </pci>
    <pci busid="0000:02:00.0" class="0x060400" vendor="0x10b5" device="0x8796" subsystem_vendor="0x10b5" subsystem_device="0x8796" link_speed="16.0 GT/s PCIe" link_width="16">
      <pci busid="0000:04:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="0" sm="90" rank="0" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:05:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="1" sm="90" rank="1" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:06:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="enp6s0f0" dev="1" speed="200000" port="0" latency="0.000000" guid="0x1" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
      <pci busid="0000:07:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="enp7s0f0" dev="2" speed="200000" port="0" latency="0.000000" guid="0x2" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
    </pci>
    <pci busid="0000:09:00.0" class="0x060400" vendor="0x10b5" device="0x8796" subsystem_vendor="0x10b5" subsystem_device="0x8796" link_speed="16.0 GT/s PCIe" link_width="16">
      <pci busid="0000:0b:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="2" sm="90" rank="2" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:0c:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="3" sm="90" rank="3" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:0d:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="enp13s0f0" dev="3" speed="200000" port="0" latency="0.000000" guid="0x3" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
      <pci busid="0000:0e:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="enp14s0f0" dev="4" speed="200000" port="0" latency="0.000000" guid="0x4" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
    </pci>
  </cpu>
  <cpu numaid="1" affinity="ffff,ffffffff,f0000000,000000ff,ffffffff,fff00000,00000000" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
    <pci busid="0000:82:00.0" class="0x060400" vendor="0x10b5" device="0x8796" subsystem_vendor="0x10b5" subsystem_device="0x8796" link_speed="16.0 GT/s PCIe" link_width="16">
      <pci busid="0000:84:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="4" sm="90" rank="4" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:85:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="5" sm="90" rank="5" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:86:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="enp134s0f0" dev="5" speed="200000" port="0" latency="0.000000" guid="0x5" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
      <pci busid="0000:87:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="enp135s0f0" dev="6" speed="200000" port="0" latency="0.000000" guid="0x6" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
    </pci>
    <pci busid="0000:89:00.0" class="0x060400" vendor="0x10b5" device="0x8796" subsystem_vendor="0x10b5" subsystem_device="0x8796" link_speed="16.0 GT/s PCIe" link_width="16">
      <pci busid="0000:8b:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="6" sm="90" rank="6" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:8c:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="7" sm="90" rank="7" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:8d:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="enp141s0f0" dev="7" speed="200000" port="0" latency="0.000000" guid="0x7" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
    </pci>
  </cpu>
</system>
nccl debug logs with fastsocket

(for one process only)

(base) ec2-user@nathan-h100-1:~$ cat debug.nathan-h100-1.14492
nathan-h100-1:14492:14492 [0] NCCL INFO Bootstrap : Using enp0s12:10.0.0.6<0>
nathan-h100-1:14492:14492 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
nathan-h100-1:14492:14492 [0] NCCL INFO NET/Plugin: Loaded net plugin FastSocket (v6)
nathan-h100-1:14492:14492 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nathan-h100-1:14492:14492 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nathan-h100-1:14492:14492 [0] NCCL INFO cudaDriverVersion 12040
nathan-h100-1:14492:14492 [0] NCCL INFO NCCL version 2.20.5+cuda12.4
nathan-h100-1:14492:14492 [0] NCCL INFO init.cc:1732 Cuda Host Alloc Size 4 pointer 0x7f4231e00000
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : Tx CPU start: -2
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : Rx CPU start: -2
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : queue skip: 0
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : Using [0]enp0s12:10.0.0.6<0> [1]enp6s0f0:10.0.1.4<0> [2]enp7s0f0:10.0.2.4<0> [3]enp13s0f0:10.0.3.2<0> [4]enp14s0f0:10.0.5.2<0> [5]enp134s0f0:10.0.6.2<0> [6]enp135s0f0:10.0.7.2<0> [7]enp141s0f0:10.0.8.2<0>
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket plugin initialized
nathan-h100-1:14492:14605 [0] NCCL INFO Using non-device net plugin version 0
nathan-h100-1:14492:14605 [0] NCCL INFO Using network FastSocket
nathan-h100-1:14492:14605 [0] NCCL INFO comm 0x55d35c3a9ec0 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 4000 commId 0xf9c7d9d9f296cb1f - Init START
nathan-h100-1:14492:14605 [0] NCCL INFO MNNVL busId 0x4000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
nathan-h100-1:14492:14605 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:14492:14605 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:14492:14605 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:14492:14605 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:14492:14605 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:14492:14605 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:14492:14605 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:14492:14605 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:14492:14605 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/max_link_speed, ignoring
nathan-h100-1:14492:14605 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/../max_link_speed, ignoring
nathan-h100-1:14492:14605 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/max_link_width, ignoring
nathan-h100-1:14492:14605 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/../max_link_width, ignoring
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 0 'enp0s12'
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 1 'enp6s0f0'
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 2 'enp7s0f0'
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 3 'enp13s0f0'
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 4 'enp14s0f0'
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 5 'enp134s0f0'
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 6 'enp135s0f0'
nathan-h100-1:14492:14605 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 7 'enp141s0f0'
nathan-h100-1:14492:14605 [0] NCCL INFO NCCL_TOPO_DUMP_FILE set by environment to topo.xml
nathan-h100-1:14492:14605 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
nathan-h100-1:14492:14605 [0] NCCL INFO === System : maxBw 24.0 totalBw 370.8 ===
nathan-h100-1:14492:14605 [0] NCCL INFO CPU/0 (1/1/2)
nathan-h100-1:14492:14605 [0] NCCL INFO + PCI[24.0] - PCI/2000 (10b5879610b58796)
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - GPU/4000 (0)
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - GPU/5000 (1)
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - NIC/6000
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NET[25.0] - NET/1 (1/0/25.000000)
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - NIC/7000
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NET[25.0] - NET/2 (2/0/25.000000)
nathan-h100-1:14492:14605 [0] NCCL INFO + PCI[24.0] - PCI/9000 (10b5879610b58796)
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - GPU/B000 (2)
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - GPU/C000 (3)
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - NIC/D000
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NET[25.0] - NET/3 (3/0/25.000000)
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - NIC/E000
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NET[25.0] - NET/4 (4/0/25.000000)
nathan-h100-1:14492:14605 [0] NCCL INFO + PCI[12.0] - NIC/C0
nathan-h100-1:14492:14605 [0] NCCL INFO               + NET[25.0] - NET/0 (0/0/25.000000)
nathan-h100-1:14492:14605 [0] NCCL INFO + SYS[10.0] - CPU/1
nathan-h100-1:14492:14605 [0] NCCL INFO CPU/1 (1/1/2)
nathan-h100-1:14492:14605 [0] NCCL INFO + PCI[24.0] - PCI/82000 (10b5879610b58796)
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - GPU/84000 (4)
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - GPU/85000 (5)
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - NIC/86000
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NET[25.0] - NET/5 (5/0/25.000000)
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - NIC/87000
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NET[25.0] - NET/6 (6/0/25.000000)
nathan-h100-1:14492:14605 [0] NCCL INFO + PCI[24.0] - PCI/89000 (10b5879610b58796)
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - GPU/8B000 (6)
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - GPU/8C000 (7)
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:14492:14605 [0] NCCL INFO               + PCI[24.0] - NIC/8D000
nathan-h100-1:14492:14605 [0] NCCL INFO                             + NET[25.0] - NET/7 (7/0/25.000000)
nathan-h100-1:14492:14605 [0] NCCL INFO + SYS[10.0] - CPU/0
nathan-h100-1:14492:14605 [0] NCCL INFO ==========================================
nathan-h100-1:14492:14605 [0] NCCL INFO GPU/4000 :GPU/4000 (0/5000.000000/LOC) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (4/12.000000/PHB) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (5/24.000000/PHB) NET/4 (5/24.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:14492:14605 [0] NCCL INFO GPU/5000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (0/5000.000000/LOC) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (4/12.000000/PHB) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (5/24.000000/PHB) NET/4 (5/24.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:14492:14605 [0] NCCL INFO GPU/B000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (0/5000.000000/LOC) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (4/12.000000/PHB) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (5/24.000000/PHB) NET/4 (5/24.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:14492:14605 [0] NCCL INFO GPU/C000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (0/5000.000000/LOC) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (4/12.000000/PHB) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (5/24.000000/PHB) NET/4 (5/24.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:14492:14605 [0] NCCL INFO GPU/84000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (0/5000.000000/LOC) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) NET/0 (5/10.000000/SYS) NET/1 (6/10.000000/SYS) NET/2 (6/10.000000/SYS) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS) NET/5 (5/24.000000/PHB) NET/6 (5/24.000000/PHB) NET/7 (5/24.000000/PHB)
nathan-h100-1:14492:14605 [0] NCCL INFO GPU/85000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (0/5000.000000/LOC) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) NET/0 (5/10.000000/SYS) NET/1 (6/10.000000/SYS) NET/2 (6/10.000000/SYS) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS) NET/5 (5/24.000000/PHB) NET/6 (5/24.000000/PHB) NET/7 (5/24.000000/PHB)
nathan-h100-1:14492:14605 [0] NCCL INFO GPU/8B000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (0/5000.000000/LOC) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) NET/0 (5/10.000000/SYS) NET/1 (6/10.000000/SYS) NET/2 (6/10.000000/SYS) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS) NET/5 (5/24.000000/PHB) NET/6 (5/24.000000/PHB) NET/7 (5/24.000000/PHB)
nathan-h100-1:14492:14605 [0] NCCL INFO GPU/8C000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (0/5000.000000/LOC) NVS/0 (1/370.800018/NVL) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) NET/0 (5/10.000000/SYS) NET/1 (6/10.000000/SYS) NET/2 (6/10.000000/SYS) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS) NET/5 (5/24.000000/PHB) NET/6 (5/24.000000/PHB) NET/7 (5/24.000000/PHB)
nathan-h100-1:14492:14605 [0] NCCL INFO NET/0 :GPU/4000 (4/12.000000/PHB) GPU/5000 (4/12.000000/PHB) GPU/B000 (4/12.000000/PHB) GPU/C000 (4/12.000000/PHB) GPU/84000 (5/10.000000/SYS) GPU/85000 (5/10.000000/SYS) GPU/8B000 (5/10.000000/SYS) GPU/8C000 (5/10.000000/SYS) CPU/0 (2/12.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (5/12.000000/PHB) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) NET/4 (5/12.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:14492:14605 [0] NCCL INFO NET/1 :GPU/4000 (5/24.000000/PHB) GPU/5000 (5/24.000000/PHB) GPU/B000 (5/24.000000/PHB) GPU/C000 (5/24.000000/PHB) GPU/84000 (6/10.000000/SYS) GPU/85000 (6/10.000000/SYS) GPU/8B000 (6/10.000000/SYS) GPU/8C000 (6/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (0/5000.000000/LOC) NET/2 (4/24.000000/PIX) NET/3 (6/24.000000/PHB) NET/4 (6/24.000000/PHB) NET/5 (7/10.000000/SYS) NET/6 (7/10.000000/SYS) NET/7 (7/10.000000/SYS)
nathan-h100-1:14492:14605 [0] NCCL INFO NET/2 :GPU/4000 (5/24.000000/PHB) GPU/5000 (5/24.000000/PHB) GPU/B000 (5/24.000000/PHB) GPU/C000 (5/24.000000/PHB) GPU/84000 (6/10.000000/SYS) GPU/85000 (6/10.000000/SYS) GPU/8B000 (6/10.000000/SYS) GPU/8C000 (6/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (4/24.000000/PIX) NET/2 (0/5000.000000/LOC) NET/3 (6/24.000000/PHB) NET/4 (6/24.000000/PHB) NET/5 (7/10.000000/SYS) NET/6 (7/10.000000/SYS) NET/7 (7/10.000000/SYS)
nathan-h100-1:14492:14605 [0] NCCL INFO NET/3 :GPU/4000 (5/24.000000/PHB) GPU/5000 (5/24.000000/PHB) GPU/B000 (5/24.000000/PHB) GPU/C000 (5/24.000000/PHB) GPU/84000 (6/10.000000/SYS) GPU/85000 (6/10.000000/SYS) GPU/8B000 (6/10.000000/SYS) GPU/8C000 (6/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (6/24.000000/PHB) NET/2 (6/24.000000/PHB) NET/3 (0/5000.000000/LOC) NET/4 (4/24.000000/PIX) NET/5 (7/10.000000/SYS) NET/6 (7/10.000000/SYS) NET/7 (7/10.000000/SYS)
nathan-h100-1:14492:14605 [0] NCCL INFO NET/4 :GPU/4000 (5/24.000000/PHB) GPU/5000 (5/24.000000/PHB) GPU/B000 (5/24.000000/PHB) GPU/C000 (5/24.000000/PHB) GPU/84000 (6/10.000000/SYS) GPU/85000 (6/10.000000/SYS) GPU/8B000 (6/10.000000/SYS) GPU/8C000 (6/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (6/24.000000/PHB) NET/2 (6/24.000000/PHB) NET/3 (4/24.000000/PIX) NET/4 (0/5000.000000/LOC) NET/5 (7/10.000000/SYS) NET/6 (7/10.000000/SYS) NET/7 (7/10.000000/SYS)
nathan-h100-1:14492:14605 [0] NCCL INFO NET/5 :GPU/4000 (6/10.000000/SYS) GPU/5000 (6/10.000000/SYS) GPU/B000 (6/10.000000/SYS) GPU/C000 (6/10.000000/SYS) GPU/84000 (5/24.000000/PHB) GPU/85000 (5/24.000000/PHB) GPU/8B000 (5/24.000000/PHB) GPU/8C000 (5/24.000000/PHB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS) NET/5 (0/5000.000000/LOC) NET/6 (4/24.000000/PIX) NET/7 (6/24.000000/PHB)
nathan-h100-1:14492:14605 [0] NCCL INFO NET/6 :GPU/4000 (6/10.000000/SYS) GPU/5000 (6/10.000000/SYS) GPU/B000 (6/10.000000/SYS) GPU/C000 (6/10.000000/SYS) GPU/84000 (5/24.000000/PHB) GPU/85000 (5/24.000000/PHB) GPU/8B000 (5/24.000000/PHB) GPU/8C000 (5/24.000000/PHB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS) NET/5 (4/24.000000/PIX) NET/6 (0/5000.000000/LOC) NET/7 (6/24.000000/PHB)
nathan-h100-1:14492:14605 [0] NCCL INFO NET/7 :GPU/4000 (6/10.000000/SYS) GPU/5000 (6/10.000000/SYS) GPU/B000 (6/10.000000/SYS) GPU/C000 (6/10.000000/SYS) GPU/84000 (5/24.000000/PHB) GPU/85000 (5/24.000000/PHB) GPU/8B000 (5/24.000000/PHB) GPU/8C000 (5/24.000000/PHB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS) NET/5 (6/24.000000/PHB) NET/6 (6/24.000000/PHB) NET/7 (0/5000.000000/LOC)
nathan-h100-1:14492:14605 [0] NCCL INFO Setting affinity for GPU 0 to 0fffffff,ffffff00,00000000,000fffff,ffffffff
nathan-h100-1:14492:14605 [0] NCCL INFO NVLS multicast support is available on dev 0
nathan-h100-1:14492:14605 [0] NCCL INFO Pattern 4, crossNic 1, nChannels 1, bw 20.000000/20.000000, type NVL/PHB, sameChannels 1
nathan-h100-1:14492:14605 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/1 GPU/2 NET/3
nathan-h100-1:14492:14605 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 40.000000/20.000000, type NVL/PHB, sameChannels 1
nathan-h100-1:14492:14605 [0] NCCL INFO  0 : NET/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 NET/1
nathan-h100-1:14492:14605 [0] NCCL INFO Pattern 5, crossNic 0, nChannels 7, bw 3.000000/3.000000, type NVL/PHB, sameChannels 0
nathan-h100-1:14492:14605 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/1
nathan-h100-1:14492:14605 [0] NCCL INFO  1 : NET/3 GPU/1 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/3
nathan-h100-1:14492:14605 [0] NCCL INFO  2 : NET/2 GPU/2 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/2
nathan-h100-1:14492:14605 [0] NCCL INFO  3 : NET/4 GPU/3 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/4
nathan-h100-1:14492:14605 [0] NCCL INFO  4 : NET/6 GPU/4 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/6
nathan-h100-1:14492:14605 [0] NCCL INFO  5 : NET/7 GPU/5 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/7
nathan-h100-1:14492:14605 [0] NCCL INFO  6 : NET/5 GPU/6 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/5
nathan-h100-1:14492:14605 [0] NCCL INFO comm 0x55d35c3a9ec0 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
nathan-h100-1:14492:14605 [0] NCCL INFO NVLS Head  0:  0  8
nathan-h100-1:14492:14605 [0] NCCL INFO NVLS Head  1:  1  9
nathan-h100-1:14492:14605 [0] NCCL INFO NVLS Head  2:  2 10
nathan-h100-1:14492:14605 [0] NCCL INFO NVLS Head  3:  3 11
nathan-h100-1:14492:14605 [0] NCCL INFO NVLS Head  4:  4 12
nathan-h100-1:14492:14605 [0] NCCL INFO NVLS Head  5:  5 13
nathan-h100-1:14492:14605 [0] NCCL INFO NVLS Head  6:  6 14
nathan-h100-1:14492:14605 [0] NCCL INFO NVLS Trees : 17/8->0->-1 17/-1->0->8
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 00/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 01/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 02/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 03/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 04/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 05/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 06/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 07/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 08/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 09/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 10/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 11/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 12/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 13/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 14/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Channel 15/16 :    0   7   6   5   4   3   1   2   8  15  14  13  12  11   9  10
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 00 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 01 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 02 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 03 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 04 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 05 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 06 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 07 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 08 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 09 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 10 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 11 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 12 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 13 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 14 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Ring 15 : 10 -> 0 -> 7
nathan-h100-1:14492:14605 [0] NCCL INFO Trees [0] 1/-1/-1->0->7 [1] 1/-1/-1->0->7 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->7 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->7 [8] 1/-1/-1->0->7 [9] 1/-1/-1->0->7 [10] 1/-1/-1->0->7 [11] 1/-1/-1->0->7 [12] 1/-1/-1->0->7 [13] 1/-1/-1->0->7 [14] 1/-1/-1->0->7 [15] 1/-1/-1->0->7
nathan-h100-1:14492:14605 [0] NCCL INFO P2P Chunksize set to 131072

# truncated for length

nathan-h100-1:14492:14611 [0] NCCL INFO proxyProgressAsync opId=0x7f41fd8e2da0 op.type=4 op.reqBuff=0x7f4240102620 op.respSize=21040 done
nathan-h100-1:14492:14605 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f41fd8e2da0
nathan-h100-1:14492:14611 [0] NCCL INFO Received and initiated operation=Connect res=0
nathan-h100-1:14492:14605 [0] NCCL INFO resp.opId=0x7f41fd8e2da0 matches expected opId=0x7f41fd8e2da0
nathan-h100-1:14492:14605 [0] NCCL INFO recvConnect ncclPollProxyResponse opId=0x7f41fd8e2da0
nathan-h100-1:14492:14605 [0] NCCL INFO Connected NVLS tree
nathan-h100-1:14492:14605 [0] NCCL INFO NCCL_ALGO set by environment to ring
nathan-h100-1:14492:14605 [0] NCCL INFO   Algorithm   |                            Tree                  |                            Ring                  |                   CollNetDirect                  |
nathan-h100-1:14492:14605 [0] NCCL INFO   Protocol    |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |
nathan-h100-1:14492:14605 [0] NCCL INFO  Max NThreads |            512 |            640 |            512 |            512 |            640 |            512 |              0 |              0 |            640 |
nathan-h100-1:14492:14605 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     9.3/   5.0 |    18.0/   0.0 |    22.4/  20.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     9.3/   5.0 |    18.0/   0.0 |    22.4/  20.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    23.3/   5.3 |    44.6/   0.0 |    70.0/  21.3 |     5.6/   0.0 |     5.6/   0.0 |    44.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    23.3/   5.3 |    44.6/   0.0 |    70.0/  21.3 |     5.6/   0.0 |     5.6/   0.0 |    44.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO     AllReduce |    25.2/   0.0 |    48.5/   0.0 |   448.0/   0.0 |    43.4/   2.7 |    79.4/   0.0 |   152.8/  10.7 |     5.6/   0.0 |     5.6/   0.0 |    44.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO   Algorithm   |                    CollNetChain                  |                            NVLS                  |                        NVLSTree                  |
nathan-h100-1:14492:14605 [0] NCCL INFO   Protocol    |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |
nathan-h100-1:14492:14605 [0] NCCL INFO  Max NThreads |              0 |              0 |            640 |              0 |              0 |            640 |              0 |              0 |            640 |
nathan-h100-1:14492:14605 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    41.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    41.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO     AllReduce |     0.0/   0.0 |     0.0/   0.0 |    69.2/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    41.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    51.0/   0.0 |
nathan-h100-1:14492:14605 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
nathan-h100-1:14492:14605 [0] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
nathan-h100-1:14492:14611 [0] NCCL INFO New proxy send connection 112 from local rank 0, transport 2
nathan-h100-1:14492:14611 [0] NCCL INFO proxyProgressAsync opId=0x7f41fcddbe40 op.type=1 op.reqBuff=0x7f42401ad980 op.respSize=16 done
nathan-h100-1:14492:14611 [0] NCCL INFO Received and initiated operation=Init res=0
nathan-h100-1:14492:14605 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f41fcddbe40
nathan-h100-1:14492:14605 [0] NCCL INFO resp.opId=0x7f41fcddbe40 matches expected opId=0x7f41fcddbe40
nathan-h100-1:14492:14605 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x7f42400083b0
nathan-h100-1:14492:14611 [0] NCCL INFO Allocated shareable buffer 0xa3a000000 size 33554432 ipcDesc 0x7f424022f5f8
nathan-h100-1:14492:14611 [0] NCCL INFO proxyProgressAsync opId=0x7f41fcddbe40 op.type=2 op.reqBuff=0x7f42401079c0 op.respSize=0 done
nathan-h100-1:14492:14611 [0] NCCL INFO Received and initiated operation=SharedInit res=0
nathan-h100-1:14492:14605 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f41fcddbe40
nathan-h100-1:14492:14605 [0] NCCL INFO resp.opId=0x7f41fcddbe40 matches expected opId=0x7f41fcddbe40
nathan-h100-1:14492:14605 [0] NCCL INFO init.cc:401 Cuda Alloc Size 8784 pointer 0x7f41cd400000
nathan-h100-1:14492:14605 [0] NCCL INFO init.cc:429 Cuda Host Alloc Size 33554432 pointer 0x7f41ca000000
nathan-h100-1:14492:14605 [0] NCCL INFO init.cc:435 Cuda Host Alloc Size 128 pointer 0x7f4231e00200
nathan-h100-1:14492:14605 [0] NCCL INFO comm 0x55d35c3a9ec0 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 4000 commId 0xf9c7d9d9f296cb1f - Init COMPLETE
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f40d8800200 recvbuff 0x7f40d8800200 count 1 datatype 1 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
nathan-h100-1:14492:14492 [0] NCCL INFO 1 Bytes -> Algo 1 proto 0 time 43.400372
nathan-h100-1:14492:14761 [0] NCCL INFO Comm 0x7f4240079580 thread 1 started
nathan-h100-1:14492:14761 [0] NCCL INFO Comm 0x7f4240079580 thread 1 binding to core -2
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14766 [0] NCCL INFO Comm 0x7f424007d930 thread 1 started
nathan-h100-1:14492:14766 [0] NCCL INFO Comm 0x7f424007d930 thread 1 binding to core -2
nathan-h100-1:14492:14767 [0] NCCL INFO Comm 0x7f424007d930 thread 2 started
nathan-h100-1:14492:14767 [0] NCCL INFO Comm 0x7f424007d930 thread 2 binding to core -2
nathan-h100-1:14492:14769 [0] NCCL INFO Comm 0x7f424007d930 thread 3 started
nathan-h100-1:14492:14769 [0] NCCL INFO Comm 0x7f424007d930 thread 3 binding to core -2
nathan-h100-1:14492:14772 [0] NCCL INFO Comm 0x7f424007d930 thread 0 started
nathan-h100-1:14492:14772 [0] NCCL INFO Comm 0x7f424007d930 thread 0 binding to core -2
nathan-h100-1:14492:14773 [0] NCCL INFO Comm 0x7f4240079580 thread 2 started
nathan-h100-1:14492:14773 [0] NCCL INFO Comm 0x7f4240079580 thread 2 binding to core -2
nathan-h100-1:14492:14774 [0] NCCL INFO Comm 0x7f4240079580 thread 3 started
nathan-h100-1:14492:14774 [0] NCCL INFO Comm 0x7f4240079580 thread 3 binding to core -2
nathan-h100-1:14492:14775 [0] NCCL INFO Comm 0x7f4240079580 thread 0 started
nathan-h100-1:14492:14775 [0] NCCL INFO Comm 0x7f4240079580 thread 0 binding to core -2
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 5 sendbuff 0x7f40d8800000 recvbuff 0x7f40d8800000 count 1 datatype 1 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 1 Bytes -> Algo 1 proto 0 time 43.400372
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 6 sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 7 sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 8 sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 9 sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount a sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount b sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount c sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount d sendbuff 0x7f3fea000000 recvbuff 0x7f3fea000000 count 1000000000 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4000000000 Bytes -> Algo 1 proto 2 time 375213.906250
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount e sendbuff 0x7f40d8800200 recvbuff 0x7f40d8800200 count 1 datatype 1 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 1 Bytes -> Algo 1 proto 0 time 43.400372
nathan-h100-1:14492:14492 [0] NCCL INFO Reduce: opCount f sendbuff 0x7f40d8800200 recvbuff 0x7f40d8800200 count 1 datatype 7 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 4 Bytes -> Algo 1 proto 0 time 9.300800
nathan-h100-1:14492:14492 [0] NCCL INFO AllReduce: opCount 10 sendbuff 0x7f40d8800000 recvbuff 0x7f40d8800000 count 1 datatype 1 op 0 root 0 comm 0x55d35c3a9ec0 [nranks=16] stream 0x55d35c3ae4d0
nathan-h100-1:14492:14492 [0] NCCL INFO 1 Bytes -> Algo 1 proto 0 time 43.400372
nathan-h100-1:14492:14611 [0] NCCL INFO [Service thread] Connection closed by localRank 0
nathan-h100-1:14492:14614 [0] NCCL INFO [Proxy Service UDS] exit: stop 1 abortFlag 1
nathan-h100-1:14492:14611 [0] NCCL INFO All bytes: 0
nathan-h100-1:14492:14782 [0] NCCL INFO NVLS Unbind MC handle 7f41fd9cdee0 size 1610612736 dev 0
nathan-h100-1:14492:14782 [0] NCCL INFO NVLS Unmap mem UC handle 0x7f41fd9ce700(0xa40000000) MC handle 0x7f41fd9cdee0(0xaa0000000)
nathan-h100-1:14492:14782 [0] NCCL INFO comm 0x55d35c3a9ec0 rank 0 nranks 16 cudaDev 0 busId 4000 - Abort COMPLETE
nccl debug logs without fastsocket

(for one process only)

nathan-h100-1:11902:11902 [7] NCCL INFO cudaDriverVersion 12040
nathan-h100-1:11902:11902 [7] NCCL INFO Bootstrap : Using enp0s12:10.0.0.6<0>
nathan-h100-1:11902:11902 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
nathan-h100-1:11902:11902 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin symbol (>= v5). ncclNetPlugin symbols v4 and lower are not supported.
nathan-h100-1:11902:11902 [7] NCCL INFO init.cc:1732 Cuda Host Alloc Size 4 pointer 0x7f7a89e00000
nathan-h100-1:11902:11975 [7] NCCL INFO NET/IB : No device found.
nathan-h100-1:11902:11975 [7] NCCL INFO NET/Socket : Using [0]enp0s12:10.0.0.6<0> [1]enp6s0f0:10.0.1.4<0> [2]enp7s0f0:10.0.2.4<0> [3]enp13s0f0:10.0.3.2<0> [4]enp14s0f0:10.0.5.2<0> [5]enp134s0f0:10.0.6.2<0> [6]enp135s0f0:10.0.7.2<0> [7]enp141s0f0:10.0.8.2<0>
nathan-h100-1:11902:11975 [7] NCCL INFO Using non-device net plugin version 0
nathan-h100-1:11902:11975 [7] NCCL INFO Using network Socket
nathan-h100-1:11902:11975 [7] NCCL INFO comm 0x562d61d79a30 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId 8c000 commId 0xec21f1f46e3c366f - Init START
nathan-h100-1:11902:11975 [7] NCCL INFO MNNVL busId 0x8c000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
nathan-h100-1:11902:11975 [7] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:11902:11975 [7] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:11902:11975 [7] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:11902:11975 [7] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:11902:11975 [7] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:11902:11975 [7] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:11902:11975 [7] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:11902:11975 [7] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:11902:11975 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/max_link_speed, ignoring
nathan-h100-1:11902:11975 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/../max_link_speed, ignoring
nathan-h100-1:11902:11975 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/max_link_width, ignoring
nathan-h100-1:11902:11975 [7] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/../max_link_width, ignoring
nathan-h100-1:11902:11975 [7] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'enp0s12'
nathan-h100-1:11902:11975 [7] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 1 'enp6s0f0'
nathan-h100-1:11902:11975 [7] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 2 'enp7s0f0'
nathan-h100-1:11902:11975 [7] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 3 'enp13s0f0'
nathan-h100-1:11902:11975 [7] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 4 'enp14s0f0'
nathan-h100-1:11902:11975 [7] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 5 'enp134s0f0'
nathan-h100-1:11902:11975 [7] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 6 'enp135s0f0'
nathan-h100-1:11902:11975 [7] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 7 'enp141s0f0'
nathan-h100-1:11902:11975 [7] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
nathan-h100-1:11902:11975 [7] NCCL INFO === System : maxBw 24.0 totalBw 370.8 ===
nathan-h100-1:11902:11975 [7] NCCL INFO CPU/0 (1/1/2)
nathan-h100-1:11902:11975 [7] NCCL INFO + PCI[24.0] - PCI/2000 (10b5879610b58796)
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - GPU/4000 (0)
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - GPU/5000 (1)
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - NIC/6000
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NET[25.0] - NET/1 (1/0/25.000000)
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - NIC/7000
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NET[25.0] - NET/2 (2/0/25.000000)
nathan-h100-1:11902:11975 [7] NCCL INFO + PCI[24.0] - PCI/9000 (10b5879610b58796)
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - GPU/B000 (2)
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - GPU/C000 (3)
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - NIC/D000
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NET[25.0] - NET/3 (3/0/25.000000)
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - NIC/E000
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NET[25.0] - NET/4 (4/0/25.000000)
nathan-h100-1:11902:11975 [7] NCCL INFO + PCI[12.0] - NIC/C0
nathan-h100-1:11902:11975 [7] NCCL INFO               + NET[25.0] - NET/0 (0/0/25.000000)
nathan-h100-1:11902:11975 [7] NCCL INFO + SYS[10.0] - CPU/1
nathan-h100-1:11902:11975 [7] NCCL INFO CPU/1 (1/1/2)
nathan-h100-1:11902:11975 [7] NCCL INFO + PCI[24.0] - PCI/82000 (10b5879610b58796)
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - GPU/84000 (4)
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - GPU/85000 (5)
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - NIC/86000
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NET[25.0] - NET/5 (5/0/25.000000)
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - NIC/87000
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NET[25.0] - NET/6 (6/0/25.000000)
nathan-h100-1:11902:11975 [7] NCCL INFO + PCI[24.0] - PCI/89000 (10b5879610b58796)
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - GPU/8B000 (6)
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - GPU/8C000 (7)
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:11902:11975 [7] NCCL INFO               + PCI[24.0] - NIC/8D000
nathan-h100-1:11902:11975 [7] NCCL INFO                             + NET[25.0] - NET/7 (7/0/25.000000)
nathan-h100-1:11902:11975 [7] NCCL INFO + SYS[10.0] - CPU/0
nathan-h100-1:11902:11975 [7] NCCL INFO ==========================================
nathan-h100-1:11902:11975 [7] NCCL INFO GPU/4000 :GPU/4000 (0/5000.000000/LOC) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (4/12.000000/PHB) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (5/24.000000/PHB) NET/4 (5/24.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:11902:11975 [7] NCCL INFO GPU/5000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (0/5000.000000/LOC) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (4/12.000000/PHB) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (5/24.000000/PHB) NET/4 (5/24.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:11902:11975 [7] NCCL INFO GPU/B000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (0/5000.000000/LOC) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (4/12.000000/PHB) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (5/24.000000/PHB) NET/4 (5/24.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:11902:11975 [7] NCCL INFO GPU/C000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (0/5000.000000/LOC) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (2/24.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (4/12.000000/PHB) NET/1 (5/24.000000/PHB) NET/2 (5/24.000000/PHB) NET/3 (5/24.000000/PHB) NET/4 (5/24.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:11902:11975 [7] NCCL INFO GPU/84000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (0/5000.000000/LOC) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) NET/0 (5/10.000000/SYS) NET/1 (6/10.000000/SYS) NET/2 (6/10.000000/SYS) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS) NET/5 (5/24.000000/PHB) NET/6 (5/24.000000/PHB) NET/7 (5/24.000000/PHB)
nathan-h100-1:11902:11975 [7] NCCL INFO GPU/85000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (0/5000.000000/LOC) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) NET/0 (5/10.000000/SYS) NET/1 (6/10.000000/SYS) NET/2 (6/10.000000/SYS) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS) NET/5 (5/24.000000/PHB) NET/6 (5/24.000000/PHB) NET/7 (5/24.000000/PHB)
nathan-h100-1:11902:11975 [7] NCCL INFO GPU/8B000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (0/5000.000000/LOC) GPU/8C000 (2/370.800018/NVL) NVS/0 (1/370.800018/NVL) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) NET/0 (5/10.000000/SYS) NET/1 (6/10.000000/SYS) NET/2 (6/10.000000/SYS) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS) NET/5 (5/24.000000/PHB) NET/6 (5/24.000000/PHB) NET/7 (5/24.000000/PHB)
nathan-h100-1:11902:11975 [7] NCCL INFO GPU/8C000 :GPU/4000 (2/370.800018/NVL) GPU/5000 (2/370.800018/NVL) GPU/B000 (2/370.800018/NVL) GPU/C000 (2/370.800018/NVL) GPU/84000 (2/370.800018/NVL) GPU/85000 (2/370.800018/NVL) GPU/8B000 (2/370.800018/NVL) GPU/8C000 (0/5000.000000/LOC) NVS/0 (1/370.800018/NVL) CPU/0 (3/10.000000/SYS) CPU/1 (2/24.000000/PHB) NET/0 (5/10.000000/SYS) NET/1 (6/10.000000/SYS) NET/2 (6/10.000000/SYS) NET/3 (6/10.000000/SYS) NET/4 (6/10.000000/SYS) NET/5 (5/24.000000/PHB) NET/6 (5/24.000000/PHB) NET/7 (5/24.000000/PHB)
nathan-h100-1:11902:11975 [7] NCCL INFO NET/0 :GPU/4000 (4/12.000000/PHB) GPU/5000 (4/12.000000/PHB) GPU/B000 (4/12.000000/PHB) GPU/C000 (4/12.000000/PHB) GPU/84000 (5/10.000000/SYS) GPU/85000 (5/10.000000/SYS) GPU/8B000 (5/10.000000/SYS) GPU/8C000 (5/10.000000/SYS) CPU/0 (2/12.000000/PHB) CPU/1 (3/10.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (5/12.000000/PHB) NET/2 (5/12.000000/PHB) NET/3 (5/12.000000/PHB) NET/4 (5/12.000000/PHB) NET/5 (6/10.000000/SYS) NET/6 (6/10.000000/SYS) NET/7 (6/10.000000/SYS)
nathan-h100-1:11902:11975 [7] NCCL INFO NET/1 :GPU/4000 (5/24.000000/PHB) GPU/5000 (5/24.000000/PHB) GPU/B000 (5/24.000000/PHB) GPU/C000 (5/24.000000/PHB) GPU/84000 (6/10.000000/SYS) GPU/85000 (6/10.000000/SYS) GPU/8B000 (6/10.000000/SYS) GPU/8C000 (6/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (0/5000.000000/LOC) NET/2 (4/24.000000/PIX) NET/3 (6/24.000000/PHB) NET/4 (6/24.000000/PHB) NET/5 (7/10.000000/SYS) NET/6 (7/10.000000/SYS) NET/7 (7/10.000000/SYS)
nathan-h100-1:11902:11975 [7] NCCL INFO NET/2 :GPU/4000 (5/24.000000/PHB) GPU/5000 (5/24.000000/PHB) GPU/B000 (5/24.000000/PHB) GPU/C000 (5/24.000000/PHB) GPU/84000 (6/10.000000/SYS) GPU/85000 (6/10.000000/SYS) GPU/8B000 (6/10.000000/SYS) GPU/8C000 (6/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (4/24.000000/PIX) NET/2 (0/5000.000000/LOC) NET/3 (6/24.000000/PHB) NET/4 (6/24.000000/PHB) NET/5 (7/10.000000/SYS) NET/6 (7/10.000000/SYS) NET/7 (7/10.000000/SYS)
nathan-h100-1:11902:11975 [7] NCCL INFO NET/3 :GPU/4000 (5/24.000000/PHB) GPU/5000 (5/24.000000/PHB) GPU/B000 (5/24.000000/PHB) GPU/C000 (5/24.000000/PHB) GPU/84000 (6/10.000000/SYS) GPU/85000 (6/10.000000/SYS) GPU/8B000 (6/10.000000/SYS) GPU/8C000 (6/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (6/24.000000/PHB) NET/2 (6/24.000000/PHB) NET/3 (0/5000.000000/LOC) NET/4 (4/24.000000/PIX) NET/5 (7/10.000000/SYS) NET/6 (7/10.000000/SYS) NET/7 (7/10.000000/SYS)
nathan-h100-1:11902:11975 [7] NCCL INFO NET/4 :GPU/4000 (5/24.000000/PHB) GPU/5000 (5/24.000000/PHB) GPU/B000 (5/24.000000/PHB) GPU/C000 (5/24.000000/PHB) GPU/84000 (6/10.000000/SYS) GPU/85000 (6/10.000000/SYS) GPU/8B000 (6/10.000000/SYS) GPU/8C000 (6/10.000000/SYS) CPU/0 (3/24.000000/PHB) CPU/1 (4/10.000000/SYS) NET/0 (5/12.000000/PHB) NET/1 (6/24.000000/PHB) NET/2 (6/24.000000/PHB) NET/3 (4/24.000000/PIX) NET/4 (0/5000.000000/LOC) NET/5 (7/10.000000/SYS) NET/6 (7/10.000000/SYS) NET/7 (7/10.000000/SYS)
nathan-h100-1:11902:11975 [7] NCCL INFO NET/5 :GPU/4000 (6/10.000000/SYS) GPU/5000 (6/10.000000/SYS) GPU/B000 (6/10.000000/SYS) GPU/C000 (6/10.000000/SYS) GPU/84000 (5/24.000000/PHB) GPU/85000 (5/24.000000/PHB) GPU/8B000 (5/24.000000/PHB) GPU/8C000 (5/24.000000/PHB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS) NET/5 (0/5000.000000/LOC) NET/6 (4/24.000000/PIX) NET/7 (6/24.000000/PHB)
nathan-h100-1:11902:11975 [7] NCCL INFO NET/6 :GPU/4000 (6/10.000000/SYS) GPU/5000 (6/10.000000/SYS) GPU/B000 (6/10.000000/SYS) GPU/C000 (6/10.000000/SYS) GPU/84000 (5/24.000000/PHB) GPU/85000 (5/24.000000/PHB) GPU/8B000 (5/24.000000/PHB) GPU/8C000 (5/24.000000/PHB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS) NET/5 (4/24.000000/PIX) NET/6 (0/5000.000000/LOC) NET/7 (6/24.000000/PHB)
nathan-h100-1:11902:11975 [7] NCCL INFO NET/7 :GPU/4000 (6/10.000000/SYS) GPU/5000 (6/10.000000/SYS) GPU/B000 (6/10.000000/SYS) GPU/C000 (6/10.000000/SYS) GPU/84000 (5/24.000000/PHB) GPU/85000 (5/24.000000/PHB) GPU/8B000 (5/24.000000/PHB) GPU/8C000 (5/24.000000/PHB) CPU/0 (4/10.000000/SYS) CPU/1 (3/24.000000/PHB) NET/0 (6/10.000000/SYS) NET/1 (7/10.000000/SYS) NET/2 (7/10.000000/SYS) NET/3 (7/10.000000/SYS) NET/4 (7/10.000000/SYS) NET/5 (6/24.000000/PHB) NET/6 (6/24.000000/PHB) NET/7 (0/5000.000000/LOC)
nathan-h100-1:11902:11975 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffffff,f0000000,000000ff,ffffffff,fff00000,00000000
nathan-h100-1:11902:11975 [7] NCCL INFO NVLS multicast support is available on dev 7
nathan-h100-1:11902:11975 [7] NCCL INFO Pattern 4, crossNic 1, nChannels 1, bw 20.000000/20.000000, type NVL/PHB, sameChannels 1
nathan-h100-1:11902:11975 [7] NCCL INFO  0 : NET/1 GPU/0 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 GPU/1 GPU/2 NET/3
nathan-h100-1:11902:11975 [7] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 40.000000/20.000000, type NVL/PHB, sameChannels 1
nathan-h100-1:11902:11975 [7] NCCL INFO  0 : NET/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 NET/1
nathan-h100-1:11902:11975 [7] NCCL INFO Pattern 5, crossNic 0, nChannels 7, bw 3.000000/3.000000, type NVL/PHB, sameChannels 0
nathan-h100-1:11902:11975 [7] NCCL INFO  0 : NET/1 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/1
nathan-h100-1:11902:11975 [7] NCCL INFO  1 : NET/3 GPU/1 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/3
nathan-h100-1:11902:11975 [7] NCCL INFO  2 : NET/2 GPU/2 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/2
nathan-h100-1:11902:11975 [7] NCCL INFO  3 : NET/4 GPU/3 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/4
nathan-h100-1:11902:11975 [7] NCCL INFO  4 : NET/6 GPU/4 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/6
nathan-h100-1:11902:11975 [7] NCCL INFO  5 : NET/7 GPU/5 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/7
nathan-h100-1:11902:11975 [7] NCCL INFO  6 : NET/5 GPU/6 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/5
nathan-h100-1:11902:11975 [7] NCCL INFO comm 0x562d61d79a30 rank 7 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0
nathan-h100-1:11902:11975 [7] NCCL INFO NVLS Head  0:  0  8
nathan-h100-1:11902:11975 [7] NCCL INFO NVLS Head  1:  1  9
nathan-h100-1:11902:11975 [7] NCCL INFO NVLS Head  2:  2 10
nathan-h100-1:11902:11975 [7] NCCL INFO NVLS Head  3:  3 11
nathan-h100-1:11902:11975 [7] NCCL INFO NVLS Head  4:  4 12
nathan-h100-1:11902:11975 [7] NCCL INFO NVLS Head  5:  5 13
nathan-h100-1:11902:11975 [7] NCCL INFO NVLS Head  6:  6 14
nathan-h100-1:11902:11975 [7] NCCL INFO NVLS Trees : 16/-1->7->-1 16/-1->7->-1
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 00 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 01 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 02 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 03 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 04 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 05 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 06 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 07 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 08 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 09 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 10 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 11 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 12 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 13 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 14 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Ring 15 : 0 -> 7 -> 6
nathan-h100-1:11902:11975 [7] NCCL INFO Trees [0] 0/-1/-1->7->6 [1] 0/-1/-1->7->6 [2] 0/-1/-1->7->6 [3] 0/-1/-1->7->6 [4] 0/-1/-1->7->6 [5] 0/-1/-1->7->6 [6] 0/-1/-1->7->6 [7] 0/-1/-1->7->6 [8] 0/-1/-1->7->6 [9] 0/-1/-1->7->6 [10] 0/-1/-1->7->6 [11] 0/-1/-1->7->6 [12] 0/-1/-1->7->6 [13] 0/-1/-1->7->6 [14] 0/-1/-1->7->6 [15] 0/-1/-1->7->6
nathan-h100-1:11902:11975 [7] NCCL INFO P2P Chunksize set to 131072
nathan-h100-1:11902:11975 [7] NCCL INFO UDS: Creating service thread comm 0x562d61d79a30 rank 7

# truncated

nathan-h100-1:11902:11902 [7] NCCL INFO AllReduce: opCount 10 sendbuff 0x7f7946800000 recvbuff 0x7f7946800000 count 1 datatype 1 op 0 root 0 comm 0x562d61d79a30 [nranks=16] stream 0x562d61d7d1d0
nathan-h100-1:11902:12007 [7] NCCL INFO [Proxy Service UDS] exit: stop 0 abortFlag 1
nathan-h100-1:11902:12004 [7] NCCL INFO [Service thread] Connection closed by localRank 7
nathan-h100-1:11902:12257 [7] NCCL INFO NVLS Unbind MC handle 7f7a4997f7c0 size 1610612736 dev 7
nathan-h100-1:11902:12257 [7] NCCL INFO NVLS Unmap mem UC handle 0x7f7a4997ffe0(0xa40000000) MC handle 0x7f7a4997f7c0(0xaa0000000)
nathan-h100-1:11902:12257 [0] NCCL INFO comm 0x562d61d79a30 rank 7 nranks 16 cudaDev 7 busId 8c000 - Abort COMPLETE
Benchmarking script
# bench.py

import os
import torch
import torch.distributed as dist

WARMUP_ITERS, TRIALS = 5, 50

# these emulate the payload which will become a M * N * 4-sized tensor below
N = 500000
M = 2000


def sync_all():
    torch.cuda.synchronize()
    dist.barrier()


def timed_allreduce(mat, start_event, end_event, warmup_iters, iters):
    sync_all()
    for _ in range(warmup_iters):
        dist.all_reduce(mat)
    sync_all()

    start_event.record()
    for _ in range(iters):
        dist.all_reduce(mat)
    end_event.record()

    sync_all()
    duration = start_event.elapsed_time(end_event) / 1000
    avg_duration = duration / iters

    n = dist.get_world_size()
    size = M * N * 4 # 4 is 4 bytes in fp32
    # note that this is following the same math as NVIDIA/nccl-tests
    algbw = torch.tensor([size / avg_duration]).cuda(local_rank)

    # calculate mean across all ranks
    dist.reduce(algbw, dst=0, op=dist.ReduceOp.SUM)
    algbw /= n

    return algbw.item()

def run(local_rank):
    is_global_rank_0 = dist.get_rank() == 0

    mat = torch.rand(N, M, dtype=torch.float32).cuda(local_rank)

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)

    algbw = timed_allreduce(mat, start_event, end_event, warmup_iters=WARMUP_ITERS, iters=TRIALS)

    # the 2*(n-1)/n busbw correction factor specific to all-reduce is explained here:
    # https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#allreduce
    # busbw reflects how optimally the hardware is used
    n = dist.get_world_size()
    busbw = algbw * (2*(n - 1) / n)

    if is_global_rank_0:
        print(f"The average bandwidth of all_reduce with a {M*N*4/1e9}GB payload ({TRIALS} trials, {n} ranks):\n",
              f"algbw: {algbw/1e9:.3f} GBps ({algbw*8/1e9:.1f} Gbps)\n",
              f"busbw: {busbw/1e9:.3f} GBps ({busbw*8/1e9:.1f} Gbps)\n",
        )

def init_processes(local_rank, fn, backend='nccl'):
    torch.cuda.set_device(local_rank)
    dist.init_process_group(backend, device_id=torch.device(f"cuda:{local_rank}"))
    if dist.get_rank() == 0:
        print("Starting benchmark...")

    fn(local_rank)

    sync_all()
    dist.destroy_process_group()


if __name__ == "__main__":
    local_rank = int(os.environ["LOCAL_RANK"])
    init_processes(local_rank=local_rank, fn=run)

NCCL_ALGO=ring NCCL_TOPO_DUMP_FILE=topo.xml NCCL_DEBUG_FILE='debug.%h.%p' NCCL_DEBUG=info NCCL_DEBUG_SUBSYS=ALL LD_LIBRARY_PATH=$(pwd)/nccl-fastsocket/bazel-bin:/opt/conda/lib/python3.10/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH torchrun --nproc_per_node 8 --nnodes 2 --node_rank 0 --master_addr 10.0.0.6 --master_port 29500 --max_restarts 0 bench.py
The average bandwidth of all_reduce with a 4.0GB payload (8 trials, 16 ranks):
 algbw: 5.271 GBps (42.2 Gbps)
 busbw: 9.883 GBps (79.1 Gbps)

Please let me know if there's any other information I can provide that might be helpful. Thank you in advance for your help!

Some notes:

  • I am not running on GCP's GKE / Slurm setup, so I do not have RDMA enabled. I am just running on two spot instances of a3-megagpu-8g.
  • The NICs are all in distinct subnets (I think).
  • I'm using gVNIC 1.4.4.
  • I tried this with and without nccl-fastsocket. There is some performance improvement when using nccl-fastsocket, but it still only uses one NIC.
  • I'm using bwm-ng to monitor NIC traffic.
  • Setting NCCL_SOCKET_IFNAME does cause which NIC is used to change.
  • I set NCCL_ALGO=ring for benchmarking since I'm only using two nodes. However, removing NCCL_ALGO=ring still only causes one NIC to be used.
  • Setting NCCL_CROSS_NIC=0/1/2 did not change the number of NICs used.
@thecodingwizard
Copy link
Author

thecodingwizard commented Nov 19, 2024

ip a output
(base) ec2-user@nathan-h100-1:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1894 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:00:00:06 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.6/32 brd 10.0.0.6 scope global dynamic enp0s12
       valid_lft 85953sec preferred_lft 85953sec
    inet6 fe80::4001:aff:fe00:6/64 scope link
       valid_lft forever preferred_lft forever
3: enp6s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1894 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:00:01:04 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.4/32 brd 10.0.1.4 scope global dynamic enp6s0f0
       valid_lft 85953sec preferred_lft 85953sec
    inet6 fe80::4001:aff:fe00:104/64 scope link
       valid_lft forever preferred_lft forever
4: enp7s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1894 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:00:02:04 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.4/32 brd 10.0.2.4 scope global dynamic enp7s0f0
       valid_lft 85953sec preferred_lft 85953sec
    inet6 fe80::4001:aff:fe00:204/64 scope link
       valid_lft forever preferred_lft forever
5: enp13s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1894 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:00:03:02 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.2/32 brd 10.0.3.2 scope global dynamic enp13s0f0
       valid_lft 85953sec preferred_lft 85953sec
    inet6 fe80::4001:aff:fe00:302/64 scope link
       valid_lft forever preferred_lft forever
6: enp14s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1894 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:00:05:02 brd ff:ff:ff:ff:ff:ff
    inet 10.0.5.2/32 brd 10.0.5.2 scope global dynamic enp14s0f0
       valid_lft 85953sec preferred_lft 85953sec
    inet6 fe80::4001:aff:fe00:502/64 scope link
       valid_lft forever preferred_lft forever
7: enp134s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1894 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:00:06:02 brd ff:ff:ff:ff:ff:ff
    inet 10.0.6.2/32 brd 10.0.6.2 scope global dynamic enp134s0f0
       valid_lft 85953sec preferred_lft 85953sec
    inet6 fe80::4001:aff:fe00:602/64 scope link
       valid_lft forever preferred_lft forever
8: enp135s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1894 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:00:07:02 brd ff:ff:ff:ff:ff:ff
    inet 10.0.7.2/32 brd 10.0.7.2 scope global dynamic enp135s0f0
       valid_lft 85953sec preferred_lft 85953sec
    inet6 fe80::4001:aff:fe00:702/64 scope link
       valid_lft forever preferred_lft forever
9: enp141s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1894 qdisc mq state UP group default qlen 1000
    link/ether 42:01:0a:00:08:02 brd ff:ff:ff:ff:ff:ff
    inet 10.0.8.2/32 brd 10.0.8.2 scope global dynamic enp141s0f0
       valid_lft 85953sec preferred_lft 85953sec
    inet6 fe80::4001:aff:fe00:802/64 scope link
       valid_lft forever preferred_lft forever
10: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:3a:5a:cc:3f brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
graph dump file
<graphs version="1">
  <graph id="0" pattern="4" crossnic="1" nchannels="1" speedintra="20" speedinter="20" latencyinter="0" typeintra="NVL" typeinter="PHB" samechannels="1">
    <channel>
      <net dev="1"/>
      <gpu dev="0"/>
      <gpu dev="7"/>
      <gpu dev="6"/>
      <gpu dev="5"/>
      <gpu dev="4"/>
      <gpu dev="3"/>
      <gpu dev="1"/>
      <gpu dev="2"/>
      <net dev="3"/>
    </channel>
  </graph>
  <graph id="1" pattern="1" crossnic="0" nchannels="1" speedintra="40" speedinter="20" latencyinter="0" typeintra="NVL" typeinter="PHB" samechannels="1">
    <channel>
      <net dev="1"/>
      <gpu dev="2"/>
      <gpu dev="3"/>
      <gpu dev="4"/>
      <gpu dev="5"/>
      <gpu dev="6"/>
      <gpu dev="7"/>
      <gpu dev="0"/>
      <gpu dev="1"/>
      <net dev="1"/>
    </channel>
  </graph>
  <graph id="2" pattern="3" crossnic="0" nchannels="0" speedintra="0" speedinter="0" latencyinter="0" typeintra="LOC" typeinter="LOC" samechannels="0"/>
  <graph id="3" pattern="5" crossnic="0" nchannels="7" speedintra="3" speedinter="3" latencyinter="0" typeintra="NVL" typeinter="PHB" samechannels="0">
    <channel>
      <net dev="1"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <net dev="1"/>
    </channel>
    <channel>
      <net dev="3"/>
      <gpu dev="1"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <net dev="3"/>
    </channel>
    <channel>
      <net dev="2"/>
      <gpu dev="2"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <net dev="2"/>
    </channel>
    <channel>
      <net dev="4"/>
      <gpu dev="3"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <net dev="4"/>
    </channel>
    <channel>
      <net dev="6"/>
      <gpu dev="4"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <net dev="6"/>
    </channel>
    <channel>
      <net dev="7"/>
      <gpu dev="5"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <net dev="7"/>
    </channel>
    <channel>
      <net dev="5"/>
      <gpu dev="6"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <gpu dev="0"/>
      <net dev="5"/>
    </channel>
  </graph>
</graphs>

@wenbilliams
Copy link
Collaborator

@thecodingwizard it looks like you're not using the tcpxo plugin provided by GCP, which is necessary to get good performance on A3 mega. Here is the official user guide : https://cloud.google.com/cluster-toolkit/docs/machine-learning/a3-mega-enable-gpudirect-tcpxo

The fastsocket plugin is for earlier GPU instances than A3 / A3 mega.

@thecodingwizard
Copy link
Author

Thanks @wenbilliams! I'm deliberately not using GPUDirect-TCPXO because I'm using normal compute instances and I don't want to have to use Slurm/GKE.

I know I won't get optimal performance but is it possible to get NCCL to use all the NICs without GPUDirect? If each NIC is 100Gbps we should still be able to get close to 800Gbps of inter-node bandwidth?

@thecodingwizard
Copy link
Author

thecodingwizard commented Nov 20, 2024

I switched to a new OS but kept the same networking setup. Now it seems to use 2 NICs (I think eth0 and eth5, but I could be mistaken), but it still doesn't use all 8 available to it.

The most obvious difference I can see is the topology dump on my new OS now has a host_hash property that wasn't present on my old OS. (and the graph is very different)

topology
<system version="1">
  <cpu host_hash="0xc0c32b4e9fcb30ed" numaid="0" affinity="0000,00000000,0fffffff,ffffff00,00000000,000fffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
    <pci busid="0000:00:0c.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="" link_width="0">
      <nic>
        <net name="eth0" dev="0" speed="200000" port="0" latency="0.000000" guid="0x0" maxconn="65536" gdr="0"/>
      </nic>
    </pci>
    <pci busid="0000:02:00.0" class="0x060400" vendor="0x10b5" device="0x8796" subsystem_vendor="0x10b5" subsystem_device="0x8796" link_speed="16.0 GT/s PCIe" link_width="16">
      <pci busid="0000:04:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="0" sm="90" rank="0" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:05:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="1" sm="90" rank="1" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:06:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="eth1" dev="1" speed="200000" port="0" latency="0.000000" guid="0x1" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
      <pci busid="0000:07:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="eth2" dev="2" speed="200000" port="0" latency="0.000000" guid="0x2" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
    </pci>
    <pci busid="0000:09:00.0" class="0x060400" vendor="0x10b5" device="0x8796" subsystem_vendor="0x10b5" subsystem_device="0x8796" link_speed="16.0 GT/s PCIe" link_width="16">
      <pci busid="0000:0b:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="2" sm="90" rank="2" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:0c:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="3" sm="90" rank="3" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:0d:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="eth3" dev="3" speed="200000" port="0" latency="0.000000" guid="0x3" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
      <pci busid="0000:0e:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="eth4" dev="4" speed="200000" port="0" latency="0.000000" guid="0x4" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
    </pci>
    <nic>
      <net name="modalsvc0" dev="8" speed="10000" port="0" latency="0.000000" guid="0x8" maxconn="65536" gdr="0"/>
      <net name="tailscale0" dev="9" speed="10000" port="0" latency="0.000000" guid="0x9" maxconn="65536" gdr="0"/>
    </nic>
  </cpu>
  <cpu host_hash="0xc0c32b4e9fcb30ed" numaid="1" affinity="ffff,ffffffff,f0000000,000000ff,ffffffff,fff00000,00000000" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
    <pci busid="0000:82:00.0" class="0x060400" vendor="0x10b5" device="0x8796" subsystem_vendor="0x10b5" subsystem_device="0x8796" link_speed="16.0 GT/s PCIe" link_width="16">
      <pci busid="0000:84:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="4" sm="90" rank="4" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:85:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="5" sm="90" rank="5" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:86:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="eth5" dev="5" speed="200000" port="0" latency="0.000000" guid="0x5" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
      <pci busid="0000:87:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="eth6" dev="6" speed="200000" port="0" latency="0.000000" guid="0x6" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
    </pci>
    <pci busid="0000:89:00.0" class="0x060400" vendor="0x10b5" device="0x8796" subsystem_vendor="0x10b5" subsystem_device="0x8796" link_speed="16.0 GT/s PCIe" link_width="16">
      <pci busid="0000:8b:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="6" sm="90" rank="6" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:8c:00.0" class="0x030200" vendor="0x10de" device="0x2330" subsystem_vendor="0x10de" subsystem_device="0x16c1" link_speed="16.0 GT/s PCIe" link_width="16">
        <gpu dev="7" sm="90" rank="7" gdr="1">
          <nvlink target="fffffff:ff:ff.0" count="18" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:8d:00.0" class="0x020000" vendor="0x1ae0" device="0x0042" subsystem_vendor="0x1ae0" subsystem_device="0x0058" link_speed="16.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="eth7" dev="7" speed="200000" port="0" latency="0.000000" guid="0x7" maxconn="65536" gdr="0"/>
        </nic>
      </pci>
    </pci>
  </cpu>
</system>
graph
<graphs version="1">
  <graph id="0" pattern="4" crossnic="0" nchannels="2" speedintra="20" speedinter="20" latencyinter="0" typeintra="NVL" typeinter="PHB" samechannels="0">
    <channel>
      <net dev="0x1"/>
      <gpu dev="0x2"/>
      <gpu dev="0x1"/>
      <gpu dev="0"/>
      <gpu dev="0x7"/>
      <gpu dev="0x6"/>
      <gpu dev="0x5"/>
      <gpu dev="0x4"/>
      <gpu dev="0x3"/>
      <net dev="0x1"/>
    </channel>
    <channel>
      <net dev="0x6"/>
      <gpu dev="0x6"/>
      <gpu dev="0x5"/>
      <gpu dev="0x4"/>
      <gpu dev="0x3"/>
      <gpu dev="0x2"/>
      <gpu dev="0x1"/>
      <gpu dev="0"/>
      <gpu dev="0x7"/>
      <net dev="0x6"/>
    </channel>
  </graph>
  <graph id="1" pattern="1" crossnic="0" nchannels="2" speedintra="40" speedinter="20" latencyinter="0" typeintra="NVL" typeinter="PHB" samechannels="0">
    <channel>
      <net dev="0x1"/>
      <gpu dev="0x2"/>
      <gpu dev="0x3"/>
      <gpu dev="0x4"/>
      <gpu dev="0x5"/>
      <gpu dev="0x6"/>
      <gpu dev="0x7"/>
      <gpu dev="0"/>
      <gpu dev="0x1"/>
      <net dev="0x1"/>
    </channel>
    <channel>
      <net dev="0x6"/>
      <gpu dev="0x6"/>
      <gpu dev="0x7"/>
      <gpu dev="0"/>
      <gpu dev="0x1"/>
      <gpu dev="0x2"/>
      <gpu dev="0x3"/>
      <gpu dev="0x4"/>
      <gpu dev="0x5"/>
      <net dev="0x6"/>
    </channel>
  </graph>
  <graph id="2" pattern="3" crossnic="0" nchannels="0" speedintra="0" speedinter="0" latencyinter="0" typeintra="LOC" typeinter="LOC" samechannels="0"/>
  <graph id="3" pattern="5" crossnic="0" nchannels="7" speedintra="3" speedinter="3" latencyinter="0" typeintra="NVL" typeinter="PHB" samechannels="0">
    <channel>
      <net dev="0x1"/>
      <gpu dev="0"/>
      <net dev="0x1"/>
    </channel>
    <channel>
      <net dev="0x3"/>
      <gpu dev="0x1"/>
      <net dev="0x3"/>
    </channel>
    <channel>
      <net dev="0x2"/>
      <gpu dev="0x2"/>
      <net dev="0x2"/>
    </channel>
    <channel>
      <net dev="0x4"/>
      <gpu dev="0x3"/>
      <net dev="0x4"/>
    </channel>
    <channel>
      <net dev="0x6"/>
      <gpu dev="0x4"/>
      <net dev="0x6"/>
    </channel>
    <channel>
      <net dev="0x7"/>
      <gpu dev="0x5"/>
      <net dev="0x7"/>
    </channel>
    <channel>
      <net dev="0x5"/>
      <gpu dev="0x6"/>
      <net dev="0x5"/>
    </channel>
  </graph>
</graphs>
debug logs
nathan-h100-1:13878:13878 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0,eth1,eth2,eth3,eth4,eth5,eth6,eth7
nathan-h100-1:13878:13878 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0,eth1,eth2,eth3,eth4,eth5,eth6,eth7
nathan-h100-1:13878:13878 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.8<0>
nathan-h100-1:13878:13878 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
nathan-h100-1:13878:13878 [0] NCCL INFO NET/Plugin: Loaded net plugin FastSocket (v6)
nathan-h100-1:13878:13878 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
nathan-h100-1:13878:13878 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
nathan-h100-1:13878:13878 [0] NCCL INFO cudaDriverVersion 12040
nathan-h100-1:13878:13878 [0] NCCL INFO NCCL version 2.21.5+cuda12.4
nathan-h100-1:13878:13878 [0] NCCL INFO init.cc:1785 Cuda Host Alloc Size 4 pointer 0x7f149de00000
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : Tx CPU start: -2
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : Rx CPU start: -2
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : Flow placement enabled.
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : queue skip: 0
nathan-h100-1:13878:13953 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0,eth1,eth2,eth3,eth4,eth5,eth6,eth7
nathan-h100-1:13878:13953 [0] NCCL INFO NCCL_SOCKET_IFNAME set to eth0,eth1,eth2,eth3,eth4,eth5,eth6,eth7
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : Using [0]eth0:10.0.0.8<0> [1]eth1:10.0.1.4<0> [2]eth2:10.0.2.4<0> [3]eth3:10.0.3.2<0> [4]eth4:10.0.5.2<0> [5]eth5:10.0.6.2<0> [6]eth6:10.0.7.2<0> [7]eth7:10.0.8.2<0>
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket plugin initialized
nathan-h100-1:13878:13953 [0] NCCL INFO Using non-device net plugin version 0
nathan-h100-1:13878:13953 [0] NCCL INFO Using network FastSocket
nathan-h100-1:13878:13953 [0] NCCL INFO ncclCommInitRank comm 0x5621491aec20 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 4000 commId 0x2ff060fc93260544 - Init START
nathan-h100-1:13878:13953 [0] NCCL INFO MNNVL busId 0x4000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
nathan-h100-1:13878:13953 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:13878:13953 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:13878:13953 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:13878:13953 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:13878:13953 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:13878:13953 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:13878:13953 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:13878:13953 [0] NCCL INFO Could not find real path of /sys/class/pci_bus/fffffff/../../fffffff:ff:f
nathan-h100-1:13878:13953 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/max_link_speed, ignoring
nathan-h100-1:13878:13953 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/../max_link_speed, ignoring
nathan-h100-1:13878:13953 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/max_link_width, ignoring
nathan-h100-1:13878:13953 [0] NCCL INFO Topology detection : could not read /sys/devices/pci0000:00/0000:00:0c.0/../max_link_width, ignoring
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 0 'eth0'
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 1 'eth1'
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 2 'eth2'
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 3 'eth3'
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 4 'eth4'
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 5 'eth5'
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 6 'eth6'
nathan-h100-1:13878:13953 [0] NCCL INFO NET/FastSocket : GPU Direct RDMA Disabled for HCA 7 'eth7'
nathan-h100-1:13878:13953 [0] NCCL INFO NCCL_TOPO_DUMP_FILE set by environment to topo.xml
nathan-h100-1:13878:13953 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
nathan-h100-1:13878:13953 [0] NCCL INFO === System : maxBw 24.0 totalBw 370.8 ===
nathan-h100-1:13878:13953 [0] NCCL INFO CPU/0-0 (1/1/2)
nathan-h100-1:13878:13953 [0] NCCL INFO + PCI[24.0] - PCI/0-2000 (10b5879610b58796)
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - GPU/0-4000 (0)
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - GPU/0-5000 (1)
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - NIC/0-6000
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NET[25.0] - NET/1 (1/0/25.000000)
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - NIC/0-7000
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NET[25.0] - NET/2 (2/0/25.000000)
nathan-h100-1:13878:13953 [0] NCCL INFO + PCI[24.0] - PCI/0-9000 (10b5879610b58796)
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - GPU/0-b000 (2)
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - GPU/0-c000 (3)
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - NIC/0-d000
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NET[25.0] - NET/3 (3/0/25.000000)
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - NIC/0-e000
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NET[25.0] - NET/4 (4/0/25.000000)
nathan-h100-1:13878:13953 [0] NCCL INFO + PCI[12.0] - NIC/0-c0
nathan-h100-1:13878:13953 [0] NCCL INFO               + NET[25.0] - NET/0 (0/0/25.000000)
nathan-h100-1:13878:13953 [0] NCCL INFO + SYS[10.0] - CPU/1
nathan-h100-1:13878:13953 [0] NCCL INFO CPU/0-1 (1/1/2)
nathan-h100-1:13878:13953 [0] NCCL INFO + PCI[24.0] - PCI/0-82000 (10b5879610b58796)
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - GPU/0-84000 (4)
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - GPU/0-85000 (5)
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - NIC/0-86000
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NET[25.0] - NET/5 (5/0/25.000000)
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - NIC/0-87000
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NET[25.0] - NET/6 (6/0/25.000000)
nathan-h100-1:13878:13953 [0] NCCL INFO + PCI[24.0] - PCI/0-89000 (10b5879610b58796)
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - GPU/0-8b000 (6)
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - GPU/0-8c000 (7)
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NVL[370.8] - NVS/0
nathan-h100-1:13878:13953 [0] NCCL INFO               + PCI[24.0] - NIC/0-8d000
nathan-h100-1:13878:13953 [0] NCCL INFO                             + NET[25.0] - NET/7 (7/0/25.000000)
nathan-h100-1:13878:13953 [0] NCCL INFO + SYS[10.0] - CPU/0
nathan-h100-1:13878:13953 [0] NCCL INFO ==========================================
nathan-h100-1:13878:13953 [0] NCCL INFO GPU/4000 :GPU/0-4000 (0/5000.0/LOC) GPU/0-5000 (2/370.8/NVL) GPU/0-b000 (2/370.8/NVL) GPU/0-c000 (2/370.8/NVL) GPU/0-84000 (2/370.8/NVL) GPU/0-85000 (2/370.8/NVL) GPU/0-8b000 (2/370.8/NVL) GPU/0-8c000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (2/24.0/PHB) CPU/0-1 (3/10.0/SYS) NET/0-0 (4/12.0/PHB) NET/0-1 (5/24.0/PHB) NET/0-2 (5/24.0/PHB) NET/0-3 (5/24.0/PHB) NET/0-4 (5/24.0/PHB) NET/0-5 (6/10.0/SYS) NET/0-6 (6/10.0/SYS) NET/0-7 (6/10.0/SYS)
nathan-h100-1:13878:13953 [0] NCCL INFO GPU/5000 :GPU/0-4000 (2/370.8/NVL) GPU/0-5000 (0/5000.0/LOC) GPU/0-b000 (2/370.8/NVL) GPU/0-c000 (2/370.8/NVL) GPU/0-84000 (2/370.8/NVL) GPU/0-85000 (2/370.8/NVL) GPU/0-8b000 (2/370.8/NVL) GPU/0-8c000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (2/24.0/PHB) CPU/0-1 (3/10.0/SYS) NET/0-0 (4/12.0/PHB) NET/0-1 (5/24.0/PHB) NET/0-2 (5/24.0/PHB) NET/0-3 (5/24.0/PHB) NET/0-4 (5/24.0/PHB) NET/0-5 (6/10.0/SYS) NET/0-6 (6/10.0/SYS) NET/0-7 (6/10.0/SYS)
nathan-h100-1:13878:13953 [0] NCCL INFO GPU/B000 :GPU/0-4000 (2/370.8/NVL) GPU/0-5000 (2/370.8/NVL) GPU/0-b000 (0/5000.0/LOC) GPU/0-c000 (2/370.8/NVL) GPU/0-84000 (2/370.8/NVL) GPU/0-85000 (2/370.8/NVL) GPU/0-8b000 (2/370.8/NVL) GPU/0-8c000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (2/24.0/PHB) CPU/0-1 (3/10.0/SYS) NET/0-0 (4/12.0/PHB) NET/0-1 (5/24.0/PHB) NET/0-2 (5/24.0/PHB) NET/0-3 (5/24.0/PHB) NET/0-4 (5/24.0/PHB) NET/0-5 (6/10.0/SYS) NET/0-6 (6/10.0/SYS) NET/0-7 (6/10.0/SYS)
nathan-h100-1:13878:13953 [0] NCCL INFO GPU/C000 :GPU/0-4000 (2/370.8/NVL) GPU/0-5000 (2/370.8/NVL) GPU/0-b000 (2/370.8/NVL) GPU/0-c000 (0/5000.0/LOC) GPU/0-84000 (2/370.8/NVL) GPU/0-85000 (2/370.8/NVL) GPU/0-8b000 (2/370.8/NVL) GPU/0-8c000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (2/24.0/PHB) CPU/0-1 (3/10.0/SYS) NET/0-0 (4/12.0/PHB) NET/0-1 (5/24.0/PHB) NET/0-2 (5/24.0/PHB) NET/0-3 (5/24.0/PHB) NET/0-4 (5/24.0/PHB) NET/0-5 (6/10.0/SYS) NET/0-6 (6/10.0/SYS) NET/0-7 (6/10.0/SYS)
nathan-h100-1:13878:13953 [0] NCCL INFO GPU/84000 :GPU/0-4000 (2/370.8/NVL) GPU/0-5000 (2/370.8/NVL) GPU/0-b000 (2/370.8/NVL) GPU/0-c000 (2/370.8/NVL) GPU/0-84000 (0/5000.0/LOC) GPU/0-85000 (2/370.8/NVL) GPU/0-8b000 (2/370.8/NVL) GPU/0-8c000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (3/10.0/SYS) CPU/0-1 (2/24.0/PHB) NET/0-0 (5/10.0/SYS) NET/0-1 (6/10.0/SYS) NET/0-2 (6/10.0/SYS) NET/0-3 (6/10.0/SYS) NET/0-4 (6/10.0/SYS) NET/0-5 (5/24.0/PHB) NET/0-6 (5/24.0/PHB) NET/0-7 (5/24.0/PHB)
nathan-h100-1:13878:13953 [0] NCCL INFO GPU/85000 :GPU/0-4000 (2/370.8/NVL) GPU/0-5000 (2/370.8/NVL) GPU/0-b000 (2/370.8/NVL) GPU/0-c000 (2/370.8/NVL) GPU/0-84000 (2/370.8/NVL) GPU/0-85000 (0/5000.0/LOC) GPU/0-8b000 (2/370.8/NVL) GPU/0-8c000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (3/10.0/SYS) CPU/0-1 (2/24.0/PHB) NET/0-0 (5/10.0/SYS) NET/0-1 (6/10.0/SYS) NET/0-2 (6/10.0/SYS) NET/0-3 (6/10.0/SYS) NET/0-4 (6/10.0/SYS) NET/0-5 (5/24.0/PHB) NET/0-6 (5/24.0/PHB) NET/0-7 (5/24.0/PHB)
nathan-h100-1:13878:13953 [0] NCCL INFO GPU/8B000 :GPU/0-4000 (2/370.8/NVL) GPU/0-5000 (2/370.8/NVL) GPU/0-b000 (2/370.8/NVL) GPU/0-c000 (2/370.8/NVL) GPU/0-84000 (2/370.8/NVL) GPU/0-85000 (2/370.8/NVL) GPU/0-8b000 (0/5000.0/LOC) GPU/0-8c000 (2/370.8/NVL) NVS/0-0 (1/370.8/NVL) CPU/0-0 (3/10.0/SYS) CPU/0-1 (2/24.0/PHB) NET/0-0 (5/10.0/SYS) NET/0-1 (6/10.0/SYS) NET/0-2 (6/10.0/SYS) NET/0-3 (6/10.0/SYS) NET/0-4 (6/10.0/SYS) NET/0-5 (5/24.0/PHB) NET/0-6 (5/24.0/PHB) NET/0-7 (5/24.0/PHB)
nathan-h100-1:13878:13953 [0] NCCL INFO GPU/8C000 :GPU/0-4000 (2/370.8/NVL) GPU/0-5000 (2/370.8/NVL) GPU/0-b000 (2/370.8/NVL) GPU/0-c000 (2/370.8/NVL) GPU/0-84000 (2/370.8/NVL) GPU/0-85000 (2/370.8/NVL) GPU/0-8b000 (2/370.8/NVL) GPU/0-8c000 (0/5000.0/LOC) NVS/0-0 (1/370.8/NVL) CPU/0-0 (3/10.0/SYS) CPU/0-1 (2/24.0/PHB) NET/0-0 (5/10.0/SYS) NET/0-1 (6/10.0/SYS) NET/0-2 (6/10.0/SYS) NET/0-3 (6/10.0/SYS) NET/0-4 (6/10.0/SYS) NET/0-5 (5/24.0/PHB) NET/0-6 (5/24.0/PHB) NET/0-7 (5/24.0/PHB)
nathan-h100-1:13878:13953 [0] NCCL INFO NET/0 :GPU/0-4000 (4/12.0/PHB) GPU/0-5000 (4/12.0/PHB) GPU/0-b000 (4/12.0/PHB) GPU/0-c000 (4/12.0/PHB) GPU/0-84000 (5/10.0/SYS) GPU/0-85000 (5/10.0/SYS) GPU/0-8b000 (5/10.0/SYS) GPU/0-8c000 (5/10.0/SYS) CPU/0-0 (2/12.0/PHB) CPU/0-1 (3/10.0/SYS) NET/0-0 (0/5000.0/LOC) NET/0-1 (5/12.0/PHB) NET/0-2 (5/12.0/PHB) NET/0-3 (5/12.0/PHB) NET/0-4 (5/12.0/PHB) NET/0-5 (6/10.0/SYS) NET/0-6 (6/10.0/SYS) NET/0-7 (6/10.0/SYS)
nathan-h100-1:13878:13953 [0] NCCL INFO NET/1 :GPU/0-4000 (5/24.0/PHB) GPU/0-5000 (5/24.0/PHB) GPU/0-b000 (5/24.0/PHB) GPU/0-c000 (5/24.0/PHB) GPU/0-84000 (6/10.0/SYS) GPU/0-85000 (6/10.0/SYS) GPU/0-8b000 (6/10.0/SYS) GPU/0-8c000 (6/10.0/SYS) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) NET/0-0 (5/12.0/PHB) NET/0-1 (0/5000.0/LOC) NET/0-2 (4/24.0/PIX) NET/0-3 (6/24.0/PHB) NET/0-4 (6/24.0/PHB) NET/0-5 (7/10.0/SYS) NET/0-6 (7/10.0/SYS) NET/0-7 (7/10.0/SYS)
nathan-h100-1:13878:13953 [0] NCCL INFO NET/2 :GPU/0-4000 (5/24.0/PHB) GPU/0-5000 (5/24.0/PHB) GPU/0-b000 (5/24.0/PHB) GPU/0-c000 (5/24.0/PHB) GPU/0-84000 (6/10.0/SYS) GPU/0-85000 (6/10.0/SYS) GPU/0-8b000 (6/10.0/SYS) GPU/0-8c000 (6/10.0/SYS) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) NET/0-0 (5/12.0/PHB) NET/0-1 (4/24.0/PIX) NET/0-2 (0/5000.0/LOC) NET/0-3 (6/24.0/PHB) NET/0-4 (6/24.0/PHB) NET/0-5 (7/10.0/SYS) NET/0-6 (7/10.0/SYS) NET/0-7 (7/10.0/SYS)
nathan-h100-1:13878:13953 [0] NCCL INFO NET/3 :GPU/0-4000 (5/24.0/PHB) GPU/0-5000 (5/24.0/PHB) GPU/0-b000 (5/24.0/PHB) GPU/0-c000 (5/24.0/PHB) GPU/0-84000 (6/10.0/SYS) GPU/0-85000 (6/10.0/SYS) GPU/0-8b000 (6/10.0/SYS) GPU/0-8c000 (6/10.0/SYS) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) NET/0-0 (5/12.0/PHB) NET/0-1 (6/24.0/PHB) NET/0-2 (6/24.0/PHB) NET/0-3 (0/5000.0/LOC) NET/0-4 (4/24.0/PIX) NET/0-5 (7/10.0/SYS) NET/0-6 (7/10.0/SYS) NET/0-7 (7/10.0/SYS)
nathan-h100-1:13878:13953 [0] NCCL INFO NET/4 :GPU/0-4000 (5/24.0/PHB) GPU/0-5000 (5/24.0/PHB) GPU/0-b000 (5/24.0/PHB) GPU/0-c000 (5/24.0/PHB) GPU/0-84000 (6/10.0/SYS) GPU/0-85000 (6/10.0/SYS) GPU/0-8b000 (6/10.0/SYS) GPU/0-8c000 (6/10.0/SYS) CPU/0-0 (3/24.0/PHB) CPU/0-1 (4/10.0/SYS) NET/0-0 (5/12.0/PHB) NET/0-1 (6/24.0/PHB) NET/0-2 (6/24.0/PHB) NET/0-3 (4/24.0/PIX) NET/0-4 (0/5000.0/LOC) NET/0-5 (7/10.0/SYS) NET/0-6 (7/10.0/SYS) NET/0-7 (7/10.0/SYS)
nathan-h100-1:13878:13953 [0] NCCL INFO NET/5 :GPU/0-4000 (6/10.0/SYS) GPU/0-5000 (6/10.0/SYS) GPU/0-b000 (6/10.0/SYS) GPU/0-c000 (6/10.0/SYS) GPU/0-84000 (5/24.0/PHB) GPU/0-85000 (5/24.0/PHB) GPU/0-8b000 (5/24.0/PHB) GPU/0-8c000 (5/24.0/PHB) CPU/0-0 (4/10.0/SYS) CPU/0-1 (3/24.0/PHB) NET/0-0 (6/10.0/SYS) NET/0-1 (7/10.0/SYS) NET/0-2 (7/10.0/SYS) NET/0-3 (7/10.0/SYS) NET/0-4 (7/10.0/SYS) NET/0-5 (0/5000.0/LOC) NET/0-6 (4/24.0/PIX) NET/0-7 (6/24.0/PHB)
nathan-h100-1:13878:13953 [0] NCCL INFO NET/6 :GPU/0-4000 (6/10.0/SYS) GPU/0-5000 (6/10.0/SYS) GPU/0-b000 (6/10.0/SYS) GPU/0-c000 (6/10.0/SYS) GPU/0-84000 (5/24.0/PHB) GPU/0-85000 (5/24.0/PHB) GPU/0-8b000 (5/24.0/PHB) GPU/0-8c000 (5/24.0/PHB) CPU/0-0 (4/10.0/SYS) CPU/0-1 (3/24.0/PHB) NET/0-0 (6/10.0/SYS) NET/0-1 (7/10.0/SYS) NET/0-2 (7/10.0/SYS) NET/0-3 (7/10.0/SYS) NET/0-4 (7/10.0/SYS) NET/0-5 (4/24.0/PIX) NET/0-6 (0/5000.0/LOC) NET/0-7 (6/24.0/PHB)
nathan-h100-1:13878:13953 [0] NCCL INFO NET/7 :GPU/0-4000 (6/10.0/SYS) GPU/0-5000 (6/10.0/SYS) GPU/0-b000 (6/10.0/SYS) GPU/0-c000 (6/10.0/SYS) GPU/0-84000 (5/24.0/PHB) GPU/0-85000 (5/24.0/PHB) GPU/0-8b000 (5/24.0/PHB) GPU/0-8c000 (5/24.0/PHB) CPU/0-0 (4/10.0/SYS) CPU/0-1 (3/24.0/PHB) NET/0-0 (6/10.0/SYS) NET/0-1 (7/10.0/SYS) NET/0-2 (7/10.0/SYS) NET/0-3 (7/10.0/SYS) NET/0-4 (7/10.0/SYS) NET/0-5 (6/24.0/PHB) NET/0-6 (6/24.0/PHB) NET/0-7 (0/5000.0/LOC)
nathan-h100-1:13878:13953 [0] NCCL INFO Setting affinity for GPU 0 to 0fffffff,ffffff00,00000000,000fffff,ffffffff
nathan-h100-1:13878:13953 [0] NCCL INFO NVLS multicast support is available on dev 0
nathan-h100-1:13878:13953 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 20.000000/20.000000, type NVL/PHB, sameChannels 0
nathan-h100-1:13878:13953 [0] NCCL INFO  0 : NET/1 GPU/2 GPU/1 GPU/0 GPU/7 GPU/6 GPU/5 GPU/4 GPU/3 NET/1
nathan-h100-1:13878:13953 [0] NCCL INFO  1 : NET/6 GPU/6 GPU/5 GPU/4 GPU/3 GPU/2 GPU/1 GPU/0 GPU/7 NET/6
nathan-h100-1:13878:13953 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 2, bw 40.000000/20.000000, type NVL/PHB, sameChannels 0
nathan-h100-1:13878:13953 [0] NCCL INFO  0 : NET/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7 GPU/0 GPU/1 NET/1
nathan-h100-1:13878:13953 [0] NCCL INFO  1 : NET/6 GPU/6 GPU/7 GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 NET/6
nathan-h100-1:13878:13953 [0] NCCL INFO Pattern 5, crossNic 0, nChannels 7, bw 3.000000/3.000000, type NVL/PHB, sameChannels 0
nathan-h100-1:13878:13953 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/1
nathan-h100-1:13878:13953 [0] NCCL INFO  1 : NET/3 GPU/1 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/3
nathan-h100-1:13878:13953 [0] NCCL INFO  2 : NET/2 GPU/2 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/2
nathan-h100-1:13878:13953 [0] NCCL INFO  3 : NET/4 GPU/3 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/4
nathan-h100-1:13878:13953 [0] NCCL INFO  4 : NET/6 GPU/4 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/6
nathan-h100-1:13878:13953 [0] NCCL INFO  5 : NET/7 GPU/5 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/7
nathan-h100-1:13878:13953 [0] NCCL INFO  6 : NET/5 GPU/6 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 GPU/0 NET/5
nathan-h100-1:13878:13953 [0] NCCL INFO NCCL_GRAPH_DUMP_FILE set by environment to graph.xml
nathan-h100-1:13878:13953 [0] NCCL INFO comm 0x5621491aec20 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
nathan-h100-1:13878:13953 [0] NCCL INFO NVLS Head  0:  0  8
nathan-h100-1:13878:13953 [0] NCCL INFO NVLS Head  1:  1  9
nathan-h100-1:13878:13953 [0] NCCL INFO NVLS Head  2:  2 10
nathan-h100-1:13878:13953 [0] NCCL INFO NVLS Head  3:  3 11
nathan-h100-1:13878:13953 [0] NCCL INFO NVLS Head  4:  4 12
nathan-h100-1:13878:13953 [0] NCCL INFO NVLS Head  5:  5 13
nathan-h100-1:13878:13953 [0] NCCL INFO NVLS Head  6:  6 14
nathan-h100-1:13878:13953 [0] NCCL INFO NVLS Trees : 17/8/-1->0->-1 17/-1/-1->0->8
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 00/16 :    0   7   6   5   4   3  10   9   8  15  14  13  12  11   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 01/16 :    0   7  14  13  12  11  10   9   8  15   6   5   4   3   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 02/16 :    0   7   6   5   4   3  10   9   8  15  14  13  12  11   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 03/16 :    0   7  14  13  12  11  10   9   8  15   6   5   4   3   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 04/16 :    0   7   6   5   4   3  10   9   8  15  14  13  12  11   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 05/16 :    0   7  14  13  12  11  10   9   8  15   6   5   4   3   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 06/16 :    0   7   6   5   4   3  10   9   8  15  14  13  12  11   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 07/16 :    0   7  14  13  12  11  10   9   8  15   6   5   4   3   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 08/16 :    0   7   6   5   4   3  10   9   8  15  14  13  12  11   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 09/16 :    0   7  14  13  12  11  10   9   8  15   6   5   4   3   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 10/16 :    0   7   6   5   4   3  10   9   8  15  14  13  12  11   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 11/16 :    0   7  14  13  12  11  10   9   8  15   6   5   4   3   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 12/16 :    0   7   6   5   4   3  10   9   8  15  14  13  12  11   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 13/16 :    0   7  14  13  12  11  10   9   8  15   6   5   4   3   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 14/16 :    0   7   6   5   4   3  10   9   8  15  14  13  12  11   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Channel 15/16 :    0   7  14  13  12  11  10   9   8  15   6   5   4   3   2   1
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 00 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 01 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 02 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 03 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 04 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 05 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 06 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 07 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 08 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 09 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 10 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 11 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 12 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 13 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 14 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Ring 15 : 1 -> 0 -> 7
nathan-h100-1:13878:13953 [0] NCCL INFO Trees [0] 1/-1/-1->0->7 [1] 1/-1/-1->0->7 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->7 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->7 [8] 1/-1/-1->0->7 [9] 1/-1/-1->0->7 [10] 1/-1/-1->0->7 [11] 1/-1/-1->0->7 [12] 1/-1/-1->0->7 [13] 1/-1/-1->0->7 [14] 1/-1/-1->0->7 [15] 1/-1/-1->0->7
nathan-h100-1:13878:13953 [0] NCCL INFO P2P Chunksize set to 131072
nathan-h100-1:13878:13953 [0] NCCL INFO UDS: Creating service thread comm 0x5621491aec20 rank 0

...

nathan-h100-1:13878:13975 [0] NCCL INFO transport/net.cc:894 Cuda Host Alloc Size 9641984 pointer 0x7f1400a00000
nathan-h100-1:13878:13975 [0] NCCL INFO proxyProgressAsync opId=0x7f146e98fbd0 op.type=4 op.reqBuff=0x7f14b815f580 op.respSize=21040 done
nathan-h100-1:13878:13953 [0] NCCL INFO ncclPollProxyResponse Received new opId=0x7f146e98fbd0
nathan-h100-1:13878:13975 [0] NCCL INFO Received and initiated operation=Connect res=0
nathan-h100-1:13878:13953 [0] NCCL INFO resp.opId=0x7f146e98fbd0 matches expected opId=0x7f146e98fbd0
nathan-h100-1:13878:13953 [0] NCCL INFO recvConnect ncclPollProxyResponse opId=0x7f146e98fbd0
nathan-h100-1:13878:13953 [0] NCCL INFO Connected NVLS tree
nathan-h100-1:13878:13953 [0] NCCL INFO NCCL_ALGO set by environment to ring
nathan-h100-1:13878:13953 [0] NCCL INFO   Algorithm   |                            Tree                  |                            Ring                  |                   CollNetDirect                  |
nathan-h100-1:13878:13953 [0] NCCL INFO   Protocol    |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |
nathan-h100-1:13878:13953 [0] NCCL INFO  Max NThreads |            512 |            640 |            512 |            512 |            640 |            512 |              0 |              0 |            640 |
nathan-h100-1:13878:13953 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    49.8/  20.0 |    78.0/   0.0 |   456.4/  40.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    49.8/  20.0 |    78.0/   0.0 |   456.4/  40.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    23.3/  21.3 |    44.6/   0.0 |    70.0/  42.7 |     5.6/   0.0 |     5.6/   0.0 |    44.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    23.3/  21.3 |    44.6/   0.0 |    70.0/  42.7 |     5.6/   0.0 |     5.6/   0.0 |    44.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO     AllReduce |    25.2/   0.0 |    48.5/   0.0 |   448.0/   0.0 |    43.4/  10.7 |    79.4/   0.0 |   152.8/  21.3 |     5.6/   0.0 |     5.6/   0.0 |    44.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO   Algorithm   |                    CollNetChain                  |                            NVLS                  |                        NVLSTree                  |
nathan-h100-1:13878:13953 [0] NCCL INFO   Protocol    |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |             LL |          LL128 |         Simple |
nathan-h100-1:13878:13953 [0] NCCL INFO  Max NThreads |              0 |              0 |            640 |              0 |              0 |            640 |              0 |              0 |            640 |
nathan-h100-1:13878:13953 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    43.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    43.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO     AllReduce |     0.0/   0.0 |     0.0/   0.0 |    69.2/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    43.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    53.0/   0.0 |
nathan-h100-1:13878:13953 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
nathan-h100-1:13878:13953 [0] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer

...

nathan-h100-1:13878:14154 [0] NCCL INFO NVLS Unbind MC handle 7f146ea29290 size 1073741824 dev 0
nathan-h100-1:13878:14154 [0] NCCL INFO NVLS Unmap mem UC handle 0x7f146ea29ab0(0xa20000000) MC handle 0x7f146ea29290(0xa60000000)
nathan-h100-1:13878:14154 [0] NCCL INFO comm 0x5621491aec20 rank 0 nranks 16 cudaDev 0 busId 4000 - Abort COMPLETE
</details>

<details>
<summary>ip a</summary>

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:00:00:08 brd ff:ff:ff:ff:ff:ff
altname enp0s12
inet 10.0.0.8/32 scope global dynamic noprefixroute eth0
valid_lft 83262sec preferred_lft 83262sec
inet6 fe80::11ff:bc57:5e67:302/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:00:01:04 brd ff:ff:ff:ff:ff:ff
altname enp6s0f0
inet 10.0.1.4/32 scope global dynamic noprefixroute eth1
valid_lft 83262sec preferred_lft 83262sec
inet6 fe80::d90:4dbb:a718:e2a7/64 scope link noprefixroute
valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:00:02:04 brd ff:ff:ff:ff:ff:ff
altname enp7s0f0
inet 10.0.2.4/32 scope global dynamic noprefixroute eth2
valid_lft 83265sec preferred_lft 83265sec
inet6 fe80::ebc6:71b6:e1fd:b705/64 scope link noprefixroute
valid_lft forever preferred_lft forever
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:00:03:02 brd ff:ff:ff:ff:ff:ff
altname enp13s0f0
inet 10.0.3.2/32 scope global dynamic noprefixroute eth3
valid_lft 83265sec preferred_lft 83265sec
inet6 fe80::350b:9cf9:5194:ffe/64 scope link noprefixroute
valid_lft forever preferred_lft forever
6: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:00:05:02 brd ff:ff:ff:ff:ff:ff
altname enp14s0f0
inet 10.0.5.2/32 scope global dynamic noprefixroute eth4
valid_lft 83265sec preferred_lft 83265sec
inet6 fe80::fbaa:6101:1034:a9a2/64 scope link noprefixroute
valid_lft forever preferred_lft forever
7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:00:06:02 brd ff:ff:ff:ff:ff:ff
altname enp134s0f0
inet 10.0.6.2/32 scope global dynamic noprefixroute eth5
valid_lft 83265sec preferred_lft 83265sec
inet6 fe80::dc81:2515:4b79:1148/64 scope link noprefixroute
valid_lft forever preferred_lft forever
8: eth6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:00:07:02 brd ff:ff:ff:ff:ff:ff
altname enp135s0f0
inet 10.0.7.2/32 scope global dynamic noprefixroute eth6
valid_lft 83265sec preferred_lft 83265sec
inet6 fe80::5ef2:9dc6:4ecf:1f7e/64 scope link noprefixroute
valid_lft forever preferred_lft forever
9: eth7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8896 qdisc mq state UP group default qlen 1000
link/ether 42:01:0a:00:08:02 brd ff:ff:ff:ff:ff:ff
altname enp141s0f0
inet 10.0.8.2/32 scope global dynamic noprefixroute eth7
valid_lft 83265sec preferred_lft 83265sec
inet6 fe80::938d:2a15:a64e:dbbe/64 scope link noprefixroute
valid_lft forever preferred_lft forever
10: modalsvc0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether da:68:5b:d3:f1:ff brd ff:ff:ff:ff:ff:ff
inet 172.21.0.1/24 scope global modalsvc0
valid_lft forever preferred_lft forever
11: tailscale0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1280 qdisc fq_codel state UNKNOWN group default qlen 500
link/none
inet6 fe80::783f:6c53:752e:f806/64 scope link stable-privacy proto kernel_ll
valid_lft forever preferred_lft forever

</details>

@GeofferyGeng
Copy link

please check your pci and gdr set. make sure nccl get right system topo. when gdr is disable, graph will be established in a totally wrong way. If nccl thougt a necessary way's bw has be used, other nics will be unused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants