Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL hang issue #394

Closed
nachtsky1077 opened this issue Sep 28, 2020 · 15 comments
Closed

NCCL hang issue #394

nachtsky1077 opened this issue Sep 28, 2020 · 15 comments

Comments

@nachtsky1077
Copy link

Environment:

NCCL version 2.5.7 + cuda10.0
40 ranks with one GPU per node, each rank is a docker container

Observation:

NCCL hang during initialization process, rank0 didn't finish initialization process.

Attached the pstack log retrieved when hanging and NCCL log:
nccl.log
pstack.log

Any idea about what might cause the hang? Thanks!

@sjeaugey
Copy link
Member

It seems we have the logs of all ranks indeed, except rank 0 which only reports the NCCL version and nothing else. This is quite strange, and I'm not sure why that would be the case; at least we should see the beginning of the traces from ncclCommInitRank.

The only explanation I can see would be if mpirun is passing NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL properly to all ranks except rank 0 which is spawned locally. Now if you use mpirun -x ... that should not happen so I'm not sure why that would be the case.

That said, it seems all ranks exit from ncclCommInitRank and start calling broadcast, so rank 0 has somehow participated in that init phase otherwise all ranks would be stuck in init as well. And the stack.log you attached confirms that we are stuck in ncclBroadcast, if that stack is taken from rank 0.

But it would be good to fix that to make sure the environment is indeed the same on all ranks since that could be a cause for the hang.

@nachtsky1077
Copy link
Author

Thanks sjeaugey. Also I forgot to mention that the GPU utility was somehow at 100% during the hang.

So if I got it right, all the ranks had to successfully exit from ncclCommInitRank, which means they all finished initialization process, before the broadcast could happen. Based on this, there's no need to check if rank0 was stuck somewhere in initialization, although it is weird that we are not seeing NCCL log showing rank0 finished initialization. Am I right?

The environment should be the same as far as I know, but I'll double check and get back if I found anything strange.

Thanks!

@nachtsky1077
Copy link
Author

Hi Sylvain,

For the rank 0 log issue, I double checked some working cases and they don't print log on rank0 either for my case, so I guess it has nothing to do with the hang issue, it might be werid though.

I did some more test and just noticed that for those hang cases, I'm getting this message "transport/net_ib.cc:80 NCCL WARN NET/IB : Got async event : GID table change" on some of the ranks. I did some search and found out that it's an IB event " IBV_EVENT_GID_CHANGE" which means the GID index was changed. I'm wondering will this cause a hang problem, or will NCCL handle the cases when GID index was changed, or maybe this event will not even cause a problem?

Thanks!

@nachtsky1077 nachtsky1077 changed the title NCCL hang during initialization NCCL hang issue Sep 29, 2020
@weberxie
Copy link

weberxie commented Sep 30, 2020

Same issue, I ran the nccl-tests, the command is:

mpirun --hostfile hostfile \
-bind-to none \
-map-by slot \
--display-map --tag-output --timestamp-output \
--mca pml ob1 --mca btl_vader_single_copy_mechanism none --mca btl_openib_cpc_include rdmacm --mca btl_openib_rroce_enable 1 --mca btl_tcp_if_exclude lo,docker0 --mca orte_base_help_aggregate 0 --mca btl_openib_receive_queues P,256,256::S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,131072,1024,1008,64 \
--mca btl openib,self,vader \
-x NCCL_SOCKET_IFNAME=^lo,docker0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_0:1 -x NCCL_IB_GID_INDEX=3 -x HOROVOD_MPI_THREADS_DISABLE=1 -x PATH -x PYTHONPATH -x LD_LIBRARY_PATH -x NCCL_NET_GDR_READ=0
broadcast_perf -b 8 -e 128M -f 2 -g 1

then the test hang, and the pstack log of rank0 is:

Thread 7 (Thread 0x7f637d9ec700 (LWP 2490)):
#0  0x00007f63abd2f727 in sched_yield () from /usr/lib64/libc.so.6
#1  0x00007f63ace8a533 in persistentThread (comm_=0x7f6378000d90) at proxy.cc:234
#2  0x00007f63aca44e65 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007f63abd4a88d in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x7f637effd700 (LWP 2488)):
#0  0x00007f63abd3fbed in poll () from /usr/lib64/libc.so.6
#1  0x00007f639762f323 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#2  0x00007f6397691acd in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#3  0x00007f6397631988 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#4  0x00007f63aca44e65 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f63abd4a88d in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x7f637ffff700 (LWP 2486)):
#0  0x00007f63aca4b71d in read () from /usr/lib64/libpthread.so.0
#1  0x00007f63a0d54926 in ibv_get_async_event () from /usr/local/nvidia/cpu_lib/libibverbs.so.1
#2  0x00007f63ace8d2b2 in wrap_ibv_get_async_event (context=context@entry=0x7f639adc5310, event=event@entry=0x7f637fff9670) at misc/ibvwrap.cc:220
#3  0x00007f63ace9783e in ncclIbAsyncThreadMain (args=0x7f639adc5310) at transport/net_ib.cc:69
#4  0x00007f63aca44e65 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f63abd4a88d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7f63911d2700 (LWP 2485)):
#0  0x00007f63abd4bd1f in accept4 () from /usr/lib64/libc.so.6
#1  0x00007f63976302ca in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#2  0x00007f63976228dd in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#3  0x00007f6397631988 in ?? () from /usr/local/nvidia/cpu_lib/libcuda.so.1
#4  0x00007f63aca44e65 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f63abd4a88d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7f63a37eb700 (LWP 2475)):
#0  0x00007f63abd4ae63 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007f63ab703ba3 in epoll_dispatch (base=0x98c7f0, tv=<optimized out>) at epoll.c:407
#2  0x00007f63ab7075f0 in opal_libevent2022_event_base_loop (base=0x98c7f0, flags=1) at event.c:1630
#3  0x00007f63a99c82be in progress_engine () from /usr/local/ompi/lib/openmpi/mca_pmix_pmix3x.so
#4  0x00007f63aca44e65 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f63abd4a88d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7f63aa44d700 (LWP 2474)):
#0  0x00007f63abd3fbed in poll () from /usr/lib64/libc.so.6
#1  0x00007f63ab70f936 in poll_dispatch (base=0x9340f0, tv=0x7f63aa447620) at poll.c:165
#2  0x00007f63ab7075f0 in opal_libevent2022_event_base_loop (base=0x9340f0, flags=1) at event.c:1630
#3  0x00007f63ab6c2fce in progress_engine () from /usr/local/ompi/lib/libopen-pal.so.40
#4  0x00007f63aca44e65 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f63abd4a88d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7f63b62e9fc0 (LWP 2473)):
#0  0x00007f63a06850e4 in mlx5_poll_cq_1 () from /usr/local/nvidia/cpu_lib/libmlx5-rdmav2.so
#1  0x00007f63a1399413 in poll_device () from /usr/local/ompi/lib/openmpi/mca_btl_openib.so
#2  0x00007f63a139a1c2 in btl_openib_component_progress () from /usr/local/ompi/lib/openmpi/mca_btl_openib.so
#3  0x00007f63ab6bd14c in opal_progress () from /usr/local/ompi/lib/libopen-pal.so.40
#4  0x00007f63b59917c5 in ompi_request_default_wait () from /usr/local/ompi/lib/libmpi.so.40
#5  0x00007f63b59e9df9 in ompi_coll_base_barrier_intra_bruck () from /usr/local/ompi/lib/libmpi.so.40
#6  0x00007f63b59a6337 in PMPI_Barrier () from /usr/local/ompi/lib/libmpi.so.40
#7  0x0000000000406bc8 in Barrier (args=<optimized out>) at common.cu:265
#8  BenchTime (args=args@entry=0x7fff0cb0c7e0, type=type@entry=ncclFloat32, op=op@entry=ncclSum, root=root@entry=0, in_place=in_place@entry=0) at common.cu:406
#9  0x0000000000407afd in TimeTest (args=args@entry=0x7fff0cb0c7e0, type=ncclFloat32, typeName=0x40d677 "float", op=op@entry=ncclSum, opName=opName@entry=0x40d298 "", root=root@entry=0) at common.cu:507
#10 0x00000000004032c8 in BroadcastRunTest (args=0x7fff0cb0c7e0, root=<optimized out>, type=ncclFloat32, typeName=0x40d677 "float", op=<optimized out>, opName=<optimized out>) at broadcast.cu:109
#11 0x00000000004039ea in threadRunTests (args=0x7fff0cb0c7e0) at common.cu:520
#12 0x0000000000409661 in run () at common.cu:843
#13 0x0000000000402170 in main (argc=9, argv=0x7fff0cb0e028) at common.cu:696

and there are only part of nccl-test logs:

                                                     out-of-place                       in-place
       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
          8             2   float     sum    122.0    0.00    0.00  2e-07    147.1    0.00    0.00  2e-07
         16             4   float     sum    115.7    0.00    0.00  2e-07    223.8    0.00    0.00  2e-07
         32             8   float     sum    111.9    0.00    0.00  2e-07    121.4    0.00    0.00  4e-07
         64            16   float     sum    109.7    0.00    0.00  4e-07    111.3    0.00    0.00  4e-07
        128            32   float     sum    111.5    0.00    0.00  4e-07    106.0    0.00    0.00  4e-07
        256            64   float     sum    109.8    0.00    0.00  4e-07    110.6    0.00    0.00  4e-07
        512           128   float     sum    111.0    0.00    0.01  4e-07    118.0    0.00    0.01  2e-07
       1024           256   float     sum    119.6    0.01    0.02  5e-07    120.3    0.01    0.02  5e-07
       2048           512   float     sum    127.5    0.02    0.03  5e-07    131.6    0.02    0.03  5e-07
       4096          1024   float     sum    154.0    0.03    0.05  5e-07    153.8    0.03    0.05  5e-07
       8192          2048   float     sum    183.8    0.04    0.09  5e-07    178.6    0.05    0.09  5e-07

@sjeaugey
Copy link
Member

I did some more test and just noticed that for those hang cases, I'm getting this message "transport/net_ib.cc:80 NCCL WARN NET/IB : Got async event : GID table change" on some of the ranks.

This could indeed cause a hang. It would be good to understand what changed in the GID table and why. Did some interfaces appear/disappear ? Or did some IP address change ?

@nachtsky1077
Copy link
Author

I did some more test and just noticed that for those hang cases, I'm getting this message "transport/net_ib.cc:80 NCCL WARN NET/IB : Got async event : GID table change" on some of the ranks.

This could indeed cause a hang. It would be good to understand what changed in the GID table and why. Did some interfaces appear/disappear ? Or did some IP address change ?

Unfortunately, I'm not able to root cause the GID index change currently (I didn't see any change by checking files under the path /sys/class/infiniband/mlx5_..., still digging into it though). One thing that I noticed is that I did see some errors under the hw_counters folder:
bash-4.2# grep "" /sys/class/infiniband/mlx5_0/ports/1/hw_counters/*
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/duplicate_request:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/implied_nak_seq_err:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan:10
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/local_ack_timeout_err:91316
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/np_cnp_sent:17269947
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/np_ecn_marked_roce_packets:11293938
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/out_of_buffer:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/out_of_sequence:6140236
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/packet_seq_err:929974
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/req_cqe_error:2
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/req_cqe_flush_error:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/req_remote_access_errors:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/req_remote_invalid_request:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/resp_cqe_error:20540
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/resp_cqe_flush_error:20443
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/resp_local_length_error:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/resp_remote_access_errors:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rnr_nak_retry_err:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rp_cnp_handled:11571119
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rp_cnp_ignored:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rx_atomic_requests:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rx_dct_connect:0
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rx_icrc_encapsulated:32
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rx_read_requests:83
/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rx_write_requests:989976260

In the meantime, I tried out nccl v2.7.8-1 in my case and the hang issue is gone. Hence, my question is that whether there're differences between v2.5.7-1 and v2.7.8-1 which might cause the behavior?

@sjeaugey
Copy link
Member

sjeaugey commented Oct 9, 2020

I do not see any change between 2.5 and 2.7 which would explain a different behavior upon a GID_INDEX change.

@weberxie
Copy link

I have reproduced the hang issue with GDR enabled, it seems that GID_INDEX change isn't the root cause.

Environment:

2 nodes with V100 * 8,
NCCL Version: 2.7.8-1
CUDA Version: 10.0
Driver Version: 450.51.06

If I ran the nccl-test with 2 docker containers, 8 GPUs per container, then it can run normally.
If I ran the nccl-test with 16 docker containers on the same 2 nodes, 1 GPU per container, then it will hang, the GPU util is 100% and CPU is 200%, what is interesting is, it hangs at size is 16384 every time.

the nccl-test log is:
Screen Shot 2020-10-21 at 7 14 21 PM

If I ran the same mode, but disable the GDR by set NCCL_NET_GDR_READ=0 and NCCL_NET_GDR_LEVEL=0, then the nccl-test will run successfully.

@sjeaugey
Copy link
Member

@weberxie it would seem GPU Direct RDMA is not functional on your setup. This could be due to ACS being enabled, or something else causing your PCI switches to not correctly process PCI peer-to-peer requests.

@sjeaugey
Copy link
Member

Actually @weberxie GPU Direct being broken may not explain why it works when launching a single container with 8 GPUs, but not when launching 1 GPU per container.

Could you try applying the attached patch on top of 2.7.8 and see if it fixes the issue ?

Alternatively, as a workaround, can you try setting NCCL_PROTO=^LL128 ?

patch.394.txt

@weberxie
Copy link

@sjeaugey Thanks.

These two ways are all fix this issue. Could you explain that why they work fine?

And, the patch may forget to add allGather3Data[rank].ring.typeInter = ringGraph.typeInter;.

@sjeaugey
Copy link
Member

Thanks for the confirmation. Attaching the fixed patch for reference. I forgot one line when porting it back to 2.7.
patch.394.2.txt

The explanation is a bit complicated, but in short, when using 1 process per node, we may enable LL128 for inter-node communication, but only for the GPUs which are close to the NIC, i.e. use GPU Direct RDMA. So half the ranks would have LL128 enabled and half the others would not, causing a protocol mismatch and a hang.

sjeaugey added a commit that referenced this issue Nov 6, 2020
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
@kehuanfeng
Copy link

kehuanfeng commented Nov 13, 2020

In the meantime, I tried out nccl v2.7.8-1 in my case and the hang issue is gone. Hence, my question is that whether there're differences between v2.5.7-1 and v2.7.8-1 which might cause the behavior?

@nachtsky1077 we are seeing similar issue that GID table changed during training and the process hang.
Have you figured out why GID table change? When you upgrade to v2.7.8-1, does it fix your hang issue or fix the issue that GID change? In other words, when you use v2.7.8-1, does GID never change again? (gid index shown from 'show_gids')

node01: VM-0-14-centos:106:196 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change
node01:
node01: VM-0-14-centos:107:194 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change
node01:
node01: VM-0-14-centos:108:593 [0] transport/net_ib.cc:816 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 32633, vendor err 129
node01: VM-0-14-centos:108:593 [0] NCCL INFO include/net.h:28 -> 2
node01: VM-0-14-centos:108:593 [0] NCCL INFO transport/net.cc:295 -> 2
node01: VM-0-14-centos:108:593 [0] NCCL INFO transport.cc:179 -> 2 [Proxy Thread]
node01:
node01: VM-0-14-centos:109:594 [0] transport/net_ib.cc:816 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 32566, vendor err 129
node01: VM-0-14-centos:109:594 [0] NCCL INFO include/net.h:28 -> 2
node01: VM-0-14-centos:109:594 [0] NCCL INFO transport/net.cc:345 -> 2
node01: VM-0-14-centos:109:594 [0] NCCL INFO transport.cc:179 -> 2 [Proxy Thread]
node02:
node02: VM-0-9-centos:80:562 [0] transport/net_ib.cc:816 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 32542, vendor err 129
node02: VM-0-9-centos:80:562 [0] NCCL INFO include/net.h:28 -> 2
node02: VM-0-9-centos:80:562 [0] NCCL INFO transport/net.cc:345 -> 2
node02: VM-0-9-centos:80:562 [0] NCCL INFO transport.cc:179 -> 2 [Proxy Thread]
node02:
node02: VM-0-9-centos:80:141 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change
node02:
node02: VM-0-9-centos:77:149 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change
node02:

@sjeaugey Is there any operation in NCCL which would trigger GID to change?

@sjeaugey
Copy link
Member

No, NCCL should not cause GID changes.
The only reason I can see which would cause a GID change would be the IP address of a NIC changing, which quite logically causes a timeout since we won't be able to reach the remote IP address.

@nachtsky1077
Copy link
Author

In the meantime, I tried out nccl v2.7.8-1 in my case and the hang issue is gone. Hence, my question is that whether there're differences between v2.5.7-1 and v2.7.8-1 which might cause the behavior?

@nachtsky1077 we are seeing similar issue that GID table changed during training and the process hang.
Have you figured out why GID table change? When you upgrade to v2.7.8-1, does it fix your hang issue or fix the issue that GID change? In other words, when you use v2.7.8-1, does GID never change again? (gid index shown from 'show_gids')

node01: VM-0-14-centos:106:196 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change
node01:
node01: VM-0-14-centos:107:194 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change
node01:
node01: VM-0-14-centos:108:593 [0] transport/net_ib.cc:816 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 32633, vendor err 129
node01: VM-0-14-centos:108:593 [0] NCCL INFO include/net.h:28 -> 2
node01: VM-0-14-centos:108:593 [0] NCCL INFO transport/net.cc:295 -> 2
node01: VM-0-14-centos:108:593 [0] NCCL INFO transport.cc:179 -> 2 [Proxy Thread]
node01:
node01: VM-0-14-centos:109:594 [0] transport/net_ib.cc:816 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 32566, vendor err 129
node01: VM-0-14-centos:109:594 [0] NCCL INFO include/net.h:28 -> 2
node01: VM-0-14-centos:109:594 [0] NCCL INFO transport/net.cc:345 -> 2
node01: VM-0-14-centos:109:594 [0] NCCL INFO transport.cc:179 -> 2 [Proxy Thread]
node02:
node02: VM-0-9-centos:80:562 [0] transport/net_ib.cc:816 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 32542, vendor err 129
node02: VM-0-9-centos:80:562 [0] NCCL INFO include/net.h:28 -> 2
node02: VM-0-9-centos:80:562 [0] NCCL INFO transport/net.cc:345 -> 2
node02: VM-0-9-centos:80:562 [0] NCCL INFO transport.cc:179 -> 2 [Proxy Thread]
node02:
node02: VM-0-9-centos:80:141 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change
node02:
node02: VM-0-9-centos:77:149 [0] transport/net_ib.cc:73 NCCL WARN NET/IB : Got async event : GID table change
node02:

@sjeaugey Is there any operation in NCCL which would trigger GID to change?

The version upgrade fix was related to GDR issue mentioned by @weberxie, which has nothing to do with GID index table change. I haven't got a chance to dig into cause the GID table change.

mackrorysd pushed a commit to mackrorysd/nccl that referenced this issue Apr 13, 2021
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix NVIDIA#379 : topology injection failing when using less GPUs than
described in the XML.
Fix NVIDIA#394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
yinwaii pushed a commit to yinwaii/nccl that referenced this issue Nov 17, 2022
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix NVIDIA#379 : topology injection failing when using less GPUs than
described in the XML.
Fix NVIDIA#394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants