Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the number of nodes increases, the bandwidth performance of alltoall is unstable #1531

Open
fj1425fj opened this issue Dec 5, 2024 · 0 comments

Comments

@fj1425fj
Copy link

fj1425fj commented Dec 5, 2024

Hi,when I conducted the alltoall test, I found that the bandwidth performance of 8 nodes was unstable, and this situation became more obvious as the number of nodes increased. Is this a normal phenomenon?
When I was using 8 nodes, there was a bandwidth drop when the data size was 1G

# nThread 1 nGpus 1 minBytes 8388608 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2324954 on GPU-NODE09 device  0 [0x0a] NVIDIA H800
#  Rank  1 Group  0 Pid 2324955 on GPU-NODE09 device  1 [0x18] NVIDIA H800
#  Rank  2 Group  0 Pid 2324958 on GPU-NODE09 device  2 [0x3b] NVIDIA H800
......
#  Rank 61 Group  0 Pid 2278519 on GPU-NODE16 device  5 [0x90] NVIDIA H800
#  Rank 62 Group  0 Pid 2278520 on GPU-NODE16 device  6 [0xb8] NVIDIA H800
#  Rank 63 Group  0 Pid 2278521 on GPU-NODE16 device  7 [0xc1] NVIDIA H800
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     8388608         32768     float    none      -1    337.8   24.83   24.44      0    295.6   28.38   27.94    N/A
    16777216         65536     float    none      -1    502.8   33.37   32.85      0    495.0   33.89   33.36    N/A
    33554432        131072     float    none      -1    971.0   34.56   34.02      0    957.8   35.03   34.49    N/A
    67108864        262144     float    none      -1   1717.2   39.08   38.47      0   1682.1   39.89   39.27    N/A
   134217728        524288     float    none      -1   3256.9   41.21   40.57      0   3222.5   41.65   41.00    N/A
   268435456       1048576     float    none      -1   6362.4   42.19   41.53      0   6288.4   42.69   42.02    N/A
   536870912       2097152     float    none      -1    12843   41.80   41.15      0    12683   42.33   41.67    N/A
  1073741824       4194304     float    none      -1    25732   41.73   41.08      0    26580   40.40   39.77    N/A
  2147483648       8388608     float    none      -1    52985   40.53   39.90      0    48832   43.98   43.29    N/A
  4294967296      16777216     float    none      -1    85928   49.98   49.20      0    83922   51.18   50.38    N/A
  8589934592      33554432     float    none      -1   158953   54.04   53.20      0   157000   54.71   53.86    N/A
 17179869184      67108864     float    none      -1   313862   54.74   53.88      0   315642   54.43   53.58    N/A

When I use 16 nodes, the bandwidth drops quite a bit

# nThread 1 nGpus 1 minBytes 8388608 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2329420 on GPU-NODE09 device  0 [0x0a] NVIDIA H800
#  Rank  1 Group  0 Pid 2329421 on GPU-NODE09 device  1 [0x18] NVIDIA H800
#  Rank  2 Group  0 Pid 2329422 on GPU-NODE09 device  2 [0x3b] NVIDIA H800
......
#  Rank 125 Group  0 Pid 1489787 on GPU-NODE24 device  5 [0x90] NVIDIA H800
#  Rank 126 Group  0 Pid 1489788 on GPU-NODE24 device  6 [0xb8] NVIDIA H800
#  Rank 127 Group  0 Pid 1489789 on GPU-NODE24 device  7 [0xc1] NVIDIA H800
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     8388608         16384     float    none      -1   1001.7    8.37    8.31      0    486.4   17.25   17.11    N/A
    16777216         32768     float    none      -1    583.0   28.78   28.55      0    585.6   28.65   28.43    N/A
    33554432         65536     float    none      -1   1009.7   33.23   32.97      0   1014.3   33.08   32.82    N/A
    67108864        131072     float    none      -1   1861.1   36.06   35.78      0   1852.7   36.22   35.94    N/A
   134217728        262144     float    none      -1   2940.4   45.65   45.29      0   2897.7   46.32   45.96    N/A
   268435456        524288     float    none      -1   5509.6   48.72   48.34      0   5490.2   48.89   48.51    N/A
   536870912       1048576     float    none      -1    10676   50.29   49.89      0    10645   50.43   50.04    N/A
  1073741824       2097152     float    none      -1    24074   44.60   44.25      0    32737   32.80   32.54    N/A
  2147483648       4194304     float    none      -1    59920   35.84   35.56      0    49861   43.07   42.73    N/A
  4294967296       8388608     float    none      -1    94033   45.68   45.32      0    94450   45.47   45.12    N/A
  8589934592      16777216     float    none      -1   177425   48.41   48.04      0   181963   47.21   46.84    N/A
 17179869184      33554432     float    none      -1   340233   50.49   50.10      0   339337   50.63   50.23    N/A

The test command is as follows

mpirun --allow-run-as-root --host xxxx -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=xxx -x NCCL_SOCKET_IFNAME=bond_inband -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_6,mlx5_7,mlx5_8,mlx5_9 -x NCCL_IB_QPS_PER_CONNECTION=1 -x NCCL_IB_TC=160 -x UCX_NET_DEVICES=bond_inband /home/nccl-test/build/alltoall_perf -b 8M -e 16G -f 2 -i 0 -g 1 -n 20 -w 5

Other information:
Network Configuration: 8 *400Gb
Cuda Version : 12.4
Driver Version: 550.54.15
Network Type: RoCE

If I increase the number of QPS, I can alleviate this phenomenon slightly.
I would like to know why this happened, and look forward to your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant