Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local access violation work queue error when upgrade to v2.20.3-1 #1524

Open
gangxie112 opened this issue Nov 26, 2024 · 11 comments
Open

local access violation work queue error when upgrade to v2.20.3-1 #1524

gangxie112 opened this issue Nov 26, 2024 · 11 comments

Comments

@gangxie112
Copy link

hello,

I'm doing the nccl tests with my mlx 455 NIC. And find that after I upgrade the nccl version to v2.20.3-1, the test is broken with the following errors. all the earlier versions are OK. Is there any breaking change related to RDMA in this version? How to use the latest version? Any compatibility here?

`ubuntu20-server-2:94639:94647 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : local access violation work queue error

ubuntu20-server-2:94639:94647 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : local access violation work queue error

ubuntu20-server-2:94639:94652 [0] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 192.168.1.2<57888> with status=5 opcode=1 len=0 vendor err 249 (Recv) localGid fe80::e61d:2dff:fef2:9c94 remoteGidsfe80::e61d:2dff:fef2:9fa0
ubuntu20-server-2:94639:94652 [0] NCCL INFO transport/net.cc:1298 -> 6
ubuntu20-server-2:94639:94652 [0] NCCL INFO proxy.cc:694 -> 6
ubuntu20-server-2:94639:94652 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]`

My NIC:
CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.26.1040 Hardware version: 0 Node GUID: 0xe41d2d0300f29fa0 System image GUID: 0xe41d2d0300f29fa0 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xe61d2dfffef29fa0 Link layer: Ethernet

@sjeaugey
Copy link
Member

It's a bit weird, but maybe older versions were not using GPU Direct RDMA, and more recent ones are trying to use it.
And GPU Direct RDMA is broken because ACS is enabled on your system?

Can you set NCCL_NET_GDR_LEVEL=0 and see if it makes the problem disappear?

@gangxie112
Copy link
Author

NCCL_NET_GDR_LEVEL=0

  1. with NCCL_NET_GDR_LEVEL=0 still not work
  2. upgrade the mlx driver to latest one, not work
  3. Dumped the traffice and compared the one which could work, found there is no global route header when not work. not sure it's related.

Not work
Frame 278: 1098 bytes on wire (8784 bits), 1098 bytes captured (8784 bits)
Ethernet II, Src: MellanoxTech_f2:9f:a0 (e4:1d:2d:f2:9f:a0), Dst: MellanoxTech_f2:9c:94 (e4:1d:2d:f2:9c:94)
Internet Protocol Version 4, Src: 192.168.1.2, Dst: 192.168.1.3
User Datagram Protocol, Src Port: 58239, Dst Port: 4791
InfiniBand
Base Transport Header
Opcode: Reliable Connection (RC) - RDMA WRITE First (6)
0... .... = Solicited Event: False
.1.. .... = MigReq: True
..00 .... = Pad Count: 0
.... 0000 = Header Version: 0
Partition Key: 65535
Reserved: 00
Destination Queue Pair: 0x00033e
0... .... = Acknowledge Request: False
.000 0000 = Reserved (7 bits): 0
Packet Sequence Number: 0
RETH - RDMA Extended Transport Header
Virtual Address: 0x00007f4c05930000
Remote Key: 0x0018006c
DMA Length: 524288 (0x00080000)
Invariant CRC: 0x3c94db23
Data (1024 bytes)

Work
Frame 279: 1110 bytes on wire (8880 bits), 1110 bytes captured (8880 bits)
Ethernet II, Src: MellanoxTech_f2:9f:a0 (e4:1d:2d:f2:9f:a0), Dst: MellanoxTech_f2:9c:94 (e4:1d:2d:f2:9c:94)
InfiniBand
Global Route Header
0110 .... = IP Version: 6
.... 0000 0010 .... = Traffic Class: 2
.... .... .... 0000 0000 0000 0000 0000 = Flow Label: 0
Payload Length: 1056
Next Header: 27
Hop Limit: 255
Source GID: fe80::e61d:2dff:fef2:9fa0
Destination GID: fe80::e61d:2dff:fef2:9c94
Base Transport Header
Opcode: Reliable Connection (RC) - RDMA WRITE First (6)
0... .... = Solicited Event: False
.1.. .... = MigReq: True
..00 .... = Pad Count: 0
.... 0000 = Header Version: 0
Partition Key: 65535
Reserved: 00
Destination Queue Pair: 0x0003a6
0... .... = Acknowledge Request: False
.000 0000 = Reserved (7 bits): 0
Packet Sequence Number: 0
RETH - RDMA Extended Transport Header
Virtual Address: 0x00007ff407930000
Remote Key: 0x00180084
DMA Length: 524288 (0x00080000)
Invariant CRC: 0xbc200433
Data (1024 bytes)

@gangxie112
Copy link
Author

Attach the log for more information:
nccl.log

@sjeaugey
Copy link
Member

Hum, maybe it's tied to ECE being added in 2.19, for which your system is misconfigured.

Could you try with NCCL 2.23 and set NCCL_ECE_ENABLE=0?

@sjeaugey
Copy link
Member

Looks like query_ece seems to think it is supported here:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 430 mtu 3 query_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0} GID 0 (80FE/949CF2FEFF2D1DE6) fifoRkey=0x178c7a fifoLkey=0x178c7a

But then it fails:

ubuntu20-server-2:3782:3793 [0] NCCL INFO Call to ibv_set_ece failed with error Operation not supported errno 95

So later we see ECE not being supported:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: IbDev 0 Port 1 qpn 419 set_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0}

Not sure what's happening exactly but it looks like a probable root cause.

@gangxie112
Copy link
Author

Looks like query_ece seems to think it is supported here:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 430 mtu 3 query_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0} GID 0 (80FE/949CF2FEFF2D1DE6) fifoRkey=0x178c7a fifoLkey=0x178c7a

But then it fails:

ubuntu20-server-2:3782:3793 [0] NCCL INFO Call to ibv_set_ece failed with error Operation not supported errno 95

So later we see ECE not being supported:

ubuntu20-server-2:3782:3793 [0] NCCL INFO NET/IB: IbDev 0 Port 1 qpn 419 set_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0}

Not sure what's happening exactly but it looks like a probable root cause.

Not the root cause. I noticed the failure at the first beginning. So I compared it with the one which worked and found this message still exist.
The attached log is the one which worked.nccl-ok.log

@gangxie112
Copy link
Author

The GID difference in the log is not a issue either. I tried all the GIDs,

@gangxie112
Copy link
Author

After track down the commit, find b647562 (the older one is ok b6d7438) introduced the issue.
@sjeaugey, this commit seems have a lot of changes, any idea about the possible cause?

@gangxie112
Copy link
Author

gangxie112 commented Nov 27, 2024

@sjeaugey I think I find the root cause after review the diff between the 2 commits mentioned above. There is a change not to adjust the mtu. My 2 servers were misconfigured with different MTU. After correct this, it works.

So why not to adjust the mtu? Suggest to log a warn after find the difference at least.

-  // Adjust the MTU
-  remQpInfo.mtu = (enum ibv_mtu)std::min(remQpInfo.mtu, portAttr.active_mtu);
+  // Copy remDevInfo for things like remGidInfo, remFifoAddr, etc.
+  for (int i = 0; i < remMeta.ndevs; i++) {
+    rComm->base.remDevs[i] = remMeta.devs[i];
+    rComm->base.remDevs[i].remoteGid.global.interface_id  = rComm->base.remDevs[i].iid;
+    rComm->base.remDevs[i].remoteGid.global.subnet_prefix = rComm->base.remDevs[i].spn;
+  }

@guunergooner
Copy link

@sjeaugey I think I find the root cause after review the diff between the 2 commits mentioned above. There is a change not to adjust the mtu. My 2 servers were misconfigured with different MTU. After correct this, it works.

So why not to adjust the mtu? Suggest to log a warn after find the difference at least.

-  // Adjust the MTU
-  remQpInfo.mtu = (enum ibv_mtu)std::min(remQpInfo.mtu, portAttr.active_mtu);
+  // Copy remDevInfo for things like remGidInfo, remFifoAddr, etc.
+  for (int i = 0; i < remMeta.ndevs; i++) {
+    rComm->base.remDevs[i] = remMeta.devs[i];
+    rComm->base.remDevs[i].remoteGid.global.interface_id  = rComm->base.remDevs[i].iid;
+    rComm->base.remDevs[i].remoteGid.global.subnet_prefix = rComm->base.remDevs[i].spn;
+  }

We have also encountered the same error reporting problem. The MTU (Maximum Transmission Unit) configuration of one server is 1500, while that of another server is 4200. Tests with NCCL version 2.20.5 + cuda12.4 will report errors, but there are no errors reported when testing with NCCL version 2.18.1 + cuda12.1.

@sjeaugey
Copy link
Member

sjeaugey commented Dec 3, 2024

Thanks for the feedback, we'll check and fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants