When the number of nodes increases, the bandwidth performance of alltoall is unstable #1531

fj1425fj · 2024-12-05T10:58:47Z

Hi，when I conducted the alltoall test, I found that the bandwidth performance of 8 nodes was unstable, and this situation became more obvious as the number of nodes increased. Is this a normal phenomenon?
When I was using 8 nodes, there was a bandwidth drop when the data size was 1G

# nThread 1 nGpus 1 minBytes 8388608 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2324954 on GPU-NODE09 device  0 [0x0a] NVIDIA H800
#  Rank  1 Group  0 Pid 2324955 on GPU-NODE09 device  1 [0x18] NVIDIA H800
#  Rank  2 Group  0 Pid 2324958 on GPU-NODE09 device  2 [0x3b] NVIDIA H800
......
#  Rank 61 Group  0 Pid 2278519 on GPU-NODE16 device  5 [0x90] NVIDIA H800
#  Rank 62 Group  0 Pid 2278520 on GPU-NODE16 device  6 [0xb8] NVIDIA H800
#  Rank 63 Group  0 Pid 2278521 on GPU-NODE16 device  7 [0xc1] NVIDIA H800
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     8388608         32768     float    none      -1    337.8   24.83   24.44      0    295.6   28.38   27.94    N/A
    16777216         65536     float    none      -1    502.8   33.37   32.85      0    495.0   33.89   33.36    N/A
    33554432        131072     float    none      -1    971.0   34.56   34.02      0    957.8   35.03   34.49    N/A
    67108864        262144     float    none      -1   1717.2   39.08   38.47      0   1682.1   39.89   39.27    N/A
   134217728        524288     float    none      -1   3256.9   41.21   40.57      0   3222.5   41.65   41.00    N/A
   268435456       1048576     float    none      -1   6362.4   42.19   41.53      0   6288.4   42.69   42.02    N/A
   536870912       2097152     float    none      -1    12843   41.80   41.15      0    12683   42.33   41.67    N/A
  1073741824       4194304     float    none      -1    25732   41.73   41.08      0    26580   40.40   39.77    N/A
  2147483648       8388608     float    none      -1    52985   40.53   39.90      0    48832   43.98   43.29    N/A
  4294967296      16777216     float    none      -1    85928   49.98   49.20      0    83922   51.18   50.38    N/A
  8589934592      33554432     float    none      -1   158953   54.04   53.20      0   157000   54.71   53.86    N/A
 17179869184      67108864     float    none      -1   313862   54.74   53.88      0   315642   54.43   53.58    N/A

When I use 16 nodes, the bandwidth drops quite a bit

# nThread 1 nGpus 1 minBytes 8388608 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2329420 on GPU-NODE09 device  0 [0x0a] NVIDIA H800
#  Rank  1 Group  0 Pid 2329421 on GPU-NODE09 device  1 [0x18] NVIDIA H800
#  Rank  2 Group  0 Pid 2329422 on GPU-NODE09 device  2 [0x3b] NVIDIA H800
......
#  Rank 125 Group  0 Pid 1489787 on GPU-NODE24 device  5 [0x90] NVIDIA H800
#  Rank 126 Group  0 Pid 1489788 on GPU-NODE24 device  6 [0xb8] NVIDIA H800
#  Rank 127 Group  0 Pid 1489789 on GPU-NODE24 device  7 [0xc1] NVIDIA H800
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     8388608         16384     float    none      -1   1001.7    8.37    8.31      0    486.4   17.25   17.11    N/A
    16777216         32768     float    none      -1    583.0   28.78   28.55      0    585.6   28.65   28.43    N/A
    33554432         65536     float    none      -1   1009.7   33.23   32.97      0   1014.3   33.08   32.82    N/A
    67108864        131072     float    none      -1   1861.1   36.06   35.78      0   1852.7   36.22   35.94    N/A
   134217728        262144     float    none      -1   2940.4   45.65   45.29      0   2897.7   46.32   45.96    N/A
   268435456        524288     float    none      -1   5509.6   48.72   48.34      0   5490.2   48.89   48.51    N/A
   536870912       1048576     float    none      -1    10676   50.29   49.89      0    10645   50.43   50.04    N/A
  1073741824       2097152     float    none      -1    24074   44.60   44.25      0    32737   32.80   32.54    N/A
  2147483648       4194304     float    none      -1    59920   35.84   35.56      0    49861   43.07   42.73    N/A
  4294967296       8388608     float    none      -1    94033   45.68   45.32      0    94450   45.47   45.12    N/A
  8589934592      16777216     float    none      -1   177425   48.41   48.04      0   181963   47.21   46.84    N/A
 17179869184      33554432     float    none      -1   340233   50.49   50.10      0   339337   50.63   50.23    N/A

The test command is as follows

mpirun --allow-run-as-root --host xxxx -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=xxx -x NCCL_SOCKET_IFNAME=bond_inband -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_6,mlx5_7,mlx5_8,mlx5_9 -x NCCL_IB_QPS_PER_CONNECTION=1 -x NCCL_IB_TC=160 -x UCX_NET_DEVICES=bond_inband /home/nccl-test/build/alltoall_perf -b 8M -e 16G -f 2 -i 0 -g 1 -n 20 -w 5

Other information:
Network Configuration: 8 *400Gb
Cuda Version : 12.4
Driver Version: 550.54.15
Network Type: RoCE

If I increase the number of QPS, I can alleviate this phenomenon slightly.
I would like to know why this happened, and look forward to your reply!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the number of nodes increases, the bandwidth performance of alltoall is unstable #1531

When the number of nodes increases, the bandwidth performance of alltoall is unstable #1531

fj1425fj commented Dec 5, 2024

When the number of nodes increases, the bandwidth performance of alltoall is unstable #1531

When the number of nodes increases, the bandwidth performance of alltoall is unstable #1531

Comments

fj1425fj commented Dec 5, 2024