local access violation work queue error when upgrade to v2.20.3-1 #1524

gangxie112 · 2024-11-26T02:30:43Z

hello,

I'm doing the nccl tests with my mlx 455 NIC. And find that after I upgrade the nccl version to v2.20.3-1, the test is broken with the following errors. all the earlier versions are OK. Is there any breaking change related to RDMA in this version? How to use the latest version? Any compatibility here?

`ubuntu20-server-2:94639:94647 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : local access violation work queue error

ubuntu20-server-2:94639:94647 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : local access violation work queue error

ubuntu20-server-2:94639:94652 [0] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 192.168.1.2<57888> with status=5 opcode=1 len=0 vendor err 249 (Recv) localGid fe80::e61d:2dff:fef2:9c94 remoteGidsfe80::e61d:2dff:fef2:9fa0
ubuntu20-server-2:94639:94652 [0] NCCL INFO transport/net.cc:1298 -> 6
ubuntu20-server-2:94639:94652 [0] NCCL INFO proxy.cc:694 -> 6
ubuntu20-server-2:94639:94652 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]`

My NIC:
CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.26.1040 Hardware version: 0 Node GUID: 0xe41d2d0300f29fa0 System image GUID: 0xe41d2d0300f29fa0 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xe61d2dfffef29fa0 Link layer: Ethernet

The text was updated successfully, but these errors were encountered:

sjeaugey · 2024-11-26T08:39:36Z

It's a bit weird, but maybe older versions were not using GPU Direct RDMA, and more recent ones are trying to use it.
And GPU Direct RDMA is broken because ACS is enabled on your system?

Can you set NCCL_NET_GDR_LEVEL=0 and see if it makes the problem disappear?

gangxie112 · 2024-11-26T10:36:13Z

NCCL_NET_GDR_LEVEL=0

with NCCL_NET_GDR_LEVEL=0 still not work
upgrade the mlx driver to latest one, not work
Dumped the traffice and compared the one which could work, found there is no global route header when not work. not sure it's related.

Not work
Frame 278: 1098 bytes on wire (8784 bits), 1098 bytes captured (8784 bits)
Ethernet II, Src: MellanoxTech_f2:9f:a0 (e4:1d:2d:f2:9f:a0), Dst: MellanoxTech_f2:9c:94 (e4:1d:2d:f2:9c:94)
Internet Protocol Version 4, Src: 192.168.1.2, Dst: 192.168.1.3
User Datagram Protocol, Src Port: 58239, Dst Port: 4791
InfiniBand
Base Transport Header
Opcode: Reliable Connection (RC) - RDMA WRITE First (6)
0... .... = Solicited Event: False
.1.. .... = MigReq: True
..00 .... = Pad Count: 0
.... 0000 = Header Version: 0
Partition Key: 65535
Reserved: 00
Destination Queue Pair: 0x00033e
0... .... = Acknowledge Request: False
.000 0000 = Reserved (7 bits): 0
Packet Sequence Number: 0
RETH - RDMA Extended Transport Header
Virtual Address: 0x00007f4c05930000
Remote Key: 0x0018006c
DMA Length: 524288 (0x00080000)
Invariant CRC: 0x3c94db23
Data (1024 bytes)

Work
Frame 279: 1110 bytes on wire (8880 bits), 1110 bytes captured (8880 bits)
Ethernet II, Src: MellanoxTech_f2:9f:a0 (e4:1d:2d:f2:9f:a0), Dst: MellanoxTech_f2:9c:94 (e4:1d:2d:f2:9c:94)
InfiniBand
Global Route Header
0110 .... = IP Version: 6
.... 0000 0010 .... = Traffic Class: 2
.... .... .... 0000 0000 0000 0000 0000 = Flow Label: 0
Payload Length: 1056
Next Header: 27
Hop Limit: 255
Source GID: fe80::e61d:2dff:fef2:9fa0
Destination GID: fe80::e61d:2dff:fef2:9c94
Base Transport Header
Opcode: Reliable Connection (RC) - RDMA WRITE First (6)
0... .... = Solicited Event: False
.1.. .... = MigReq: True
..00 .... = Pad Count: 0
.... 0000 = Header Version: 0
Partition Key: 65535
Reserved: 00
Destination Queue Pair: 0x0003a6
0... .... = Acknowledge Request: False
.000 0000 = Reserved (7 bits): 0
Packet Sequence Number: 0
RETH - RDMA Extended Transport Header
Virtual Address: 0x00007ff407930000
Remote Key: 0x00180084
DMA Length: 524288 (0x00080000)
Invariant CRC: 0xbc200433
Data (1024 bytes)

gangxie112 · 2024-11-26T10:42:23Z

Attach the log for more information:
nccl.log

sjeaugey · 2024-11-26T10:42:51Z

Hum, maybe it's tied to ECE being added in 2.19, for which your system is misconfigured.

Could you try with NCCL 2.23 and set NCCL_ECE_ENABLE=0?

sjeaugey · 2024-11-26T10:48:36Z

Looks like query_ece seems to think it is supported here:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 430 mtu 3 query_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0} GID 0 (80FE/949CF2FEFF2D1DE6) fifoRkey=0x178c7a fifoLkey=0x178c7a

But then it fails:

ubuntu20-server-2:3782:3793 [0] NCCL INFO Call to ibv_set_ece failed with error Operation not supported errno 95

So later we see ECE not being supported:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: IbDev 0 Port 1 qpn 419 set_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0}

Not sure what's happening exactly but it looks like a probable root cause.

gangxie112 · 2024-11-26T13:17:35Z

Looks like query_ece seems to think it is supported here:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 430 mtu 3 query_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0} GID 0 (80FE/949CF2FEFF2D1DE6) fifoRkey=0x178c7a fifoLkey=0x178c7a

But then it fails:

ubuntu20-server-2:3782:3793 [0] NCCL INFO Call to ibv_set_ece failed with error Operation not supported errno 95

So later we see ECE not being supported:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: IbDev 0 Port 1 qpn 419 set_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0}

Not sure what's happening exactly but it looks like a probable root cause.

Not the root cause. I noticed the failure at the first beginning. So I compared it with the one which worked and found this message still exist.
The attached log is the one which worked.nccl-ok.log

gangxie112 · 2024-11-26T13:22:03Z

The GID difference in the log is not a issue either. I tried all the GIDs,

gangxie112 · 2024-11-27T09:13:00Z

After track down the commit, find b647562 (the older one is ok b6d7438) introduced the issue.
@sjeaugey, this commit seems have a lot of changes, any idea about the possible cause?

gangxie112 · 2024-11-27T09:36:17Z

@sjeaugey I think I find the root cause after review the diff between the 2 commits mentioned above. There is a change not to adjust the mtu. My 2 servers were misconfigured with different MTU. After correct this, it works.

So why not to adjust the mtu? Suggest to log a warn after find the difference at least.

-  // Adjust the MTU
-  remQpInfo.mtu = (enum ibv_mtu)std::min(remQpInfo.mtu, portAttr.active_mtu);
+  // Copy remDevInfo for things like remGidInfo, remFifoAddr, etc.
+  for (int i = 0; i < remMeta.ndevs; i++) {
+    rComm->base.remDevs[i] = remMeta.devs[i];
+    rComm->base.remDevs[i].remoteGid.global.interface_id  = rComm->base.remDevs[i].iid;
+    rComm->base.remDevs[i].remoteGid.global.subnet_prefix = rComm->base.remDevs[i].spn;
+  }

guunergooner · 2024-12-03T09:01:18Z

@sjeaugey I think I find the root cause after review the diff between the 2 commits mentioned above. There is a change not to adjust the mtu. My 2 servers were misconfigured with different MTU. After correct this, it works.

So why not to adjust the mtu? Suggest to log a warn after find the difference at least.
-  // Adjust the MTU
-  remQpInfo.mtu = (enum ibv_mtu)std::min(remQpInfo.mtu, portAttr.active_mtu);
+  // Copy remDevInfo for things like remGidInfo, remFifoAddr, etc.
+  for (int i = 0; i < remMeta.ndevs; i++) {
+    rComm->base.remDevs[i] = remMeta.devs[i];
+    rComm->base.remDevs[i].remoteGid.global.interface_id  = rComm->base.remDevs[i].iid;
+    rComm->base.remDevs[i].remoteGid.global.subnet_prefix = rComm->base.remDevs[i].spn;
+  }

We have also encountered the same error reporting problem. The MTU (Maximum Transmission Unit) configuration of one server is 1500, while that of another server is 4200. Tests with NCCL version 2.20.5 + cuda12.4 will report errors, but there are no errors reported when testing with NCCL version 2.18.1 + cuda12.1.

sjeaugey · 2024-12-03T09:35:38Z

Thanks for the feedback, we'll check and fix that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local access violation work queue error when upgrade to v2.20.3-1 #1524

local access violation work queue error when upgrade to v2.20.3-1 #1524

gangxie112 commented Nov 26, 2024

sjeaugey commented Nov 26, 2024

gangxie112 commented Nov 26, 2024

gangxie112 commented Nov 26, 2024

sjeaugey commented Nov 26, 2024

sjeaugey commented Nov 26, 2024

gangxie112 commented Nov 26, 2024

gangxie112 commented Nov 26, 2024

gangxie112 commented Nov 27, 2024

gangxie112 commented Nov 27, 2024 •

edited

Loading

guunergooner commented Dec 3, 2024

sjeaugey commented Dec 3, 2024

local access violation work queue error when upgrade to v2.20.3-1 #1524

local access violation work queue error when upgrade to v2.20.3-1 #1524

Comments

gangxie112 commented Nov 26, 2024

sjeaugey commented Nov 26, 2024

gangxie112 commented Nov 26, 2024

gangxie112 commented Nov 26, 2024

sjeaugey commented Nov 26, 2024

sjeaugey commented Nov 26, 2024

gangxie112 commented Nov 26, 2024

gangxie112 commented Nov 26, 2024

gangxie112 commented Nov 27, 2024

gangxie112 commented Nov 27, 2024 • edited Loading

guunergooner commented Dec 3, 2024

sjeaugey commented Dec 3, 2024

gangxie112 commented Nov 27, 2024 •

edited

Loading