-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
local access violation work queue error when upgrade to v2.20.3-1 #1524
Comments
It's a bit weird, but maybe older versions were not using GPU Direct RDMA, and more recent ones are trying to use it. Can you set |
Not work Work |
Attach the log for more information: |
Hum, maybe it's tied to ECE being added in 2.19, for which your system is misconfigured. Could you try with NCCL 2.23 and set |
Looks like query_ece seems to think it is supported here:
But then it fails:
So later we see ECE not being supported:
Not sure what's happening exactly but it looks like a probable root cause. |
Not the root cause. I noticed the failure at the first beginning. So I compared it with the one which worked and found this message still exist. |
The GID difference in the log is not a issue either. I tried all the GIDs, |
@sjeaugey I think I find the root cause after review the diff between the 2 commits mentioned above. There is a change not to adjust the mtu. My 2 servers were misconfigured with different MTU. After correct this, it works. So why not to adjust the mtu? Suggest to log a warn after find the difference at least.
|
We have also encountered the same error reporting problem. The MTU (Maximum Transmission Unit) configuration of one server is 1500, while that of another server is 4200. Tests with NCCL version 2.20.5 + cuda12.4 will report errors, but there are no errors reported when testing with NCCL version 2.18.1 + cuda12.1. |
Thanks for the feedback, we'll check and fix that. |
hello,
I'm doing the nccl tests with my mlx 455 NIC. And find that after I upgrade the nccl version to v2.20.3-1, the test is broken with the following errors. all the earlier versions are OK. Is there any breaking change related to RDMA in this version? How to use the latest version? Any compatibility here?
`ubuntu20-server-2:94639:94647 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : local access violation work queue error
ubuntu20-server-2:94639:94647 [0] transport/net_ib.cc:100 NCCL WARN NET/IB : mlx5_0:1 Got async event : local access violation work queue error
ubuntu20-server-2:94639:94652 [0] transport/net_ib.cc:1696 NCCL WARN NET/IB : Got completion from peer 192.168.1.2<57888> with status=5 opcode=1 len=0 vendor err 249 (Recv) localGid fe80::e61d:2dff:fef2:9c94 remoteGidsfe80::e61d:2dff:fef2:9fa0
ubuntu20-server-2:94639:94652 [0] NCCL INFO transport/net.cc:1298 -> 6
ubuntu20-server-2:94639:94652 [0] NCCL INFO proxy.cc:694 -> 6
ubuntu20-server-2:94639:94652 [0] NCCL INFO proxy.cc:874 -> 6 [Progress Thread]`
My NIC:
CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.26.1040 Hardware version: 0 Node GUID: 0xe41d2d0300f29fa0 System image GUID: 0xe41d2d0300f29fa0 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xe61d2dfffef29fa0 Link layer: Ethernet
The text was updated successfully, but these errors were encountered: