Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime ERROR: NCCL WARN Cuda failure 'named symbol not found. unhandled cuda error (run with NCCL_DEBUG=INFO for details) #1528

Open
Seqaeon opened this issue Dec 2, 2024 · 4 comments

Comments

@Seqaeon
Copy link

Seqaeon commented Dec 2, 2024

Using 4 GPUs!
c76d92cccb6e:72:72 [0] NCCL INFO cudaDriverVersion 12060
c76d92cccb6e:72:72 [0] NCCL INFO Bootstrap : Using eth0:172.19.2.2<0>
c76d92cccb6e:72:72 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
c76d92cccb6e:72:72 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin symbol (>= v5). ncclNetPlugin symbols v4 and lower are not supported.
NCCL version 2.20.5+cuda12.3

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] NCCL INFO init.cc:1475 -> 1
c76d92cccb6e:72:457 [1] NCCL INFO group.cc:64 -> 1 [Async thread]

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] NCCL INFO init.cc:1475 -> 1
c76d92cccb6e:72:459 [3] NCCL INFO group.cc:64 -> 1 [Async thread]

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] NCCL INFO init.cc:1475 -> 1
c76d92cccb6e:72:456 [0] NCCL INFO group.cc:64 -> 1 [Async thread]

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

c76d92cccb6e:72:458 [2] enqu

RuntimeError Traceback (most recent call last)
Cell In[22], line 35
32 X_batch,y_batch,w_batch = xi,yi,wi
34 optimizer.zero_grad()
---> 35 y_pred = model(X_batch).squeeze()
37 loss = (criterion(y_pred, y_batch) * w_batch).mean()
38 loss.backward()

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None

File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:185, in DataParallel.forward(self, *inputs, **kwargs)
183 if len(self.device_ids) == 1:
184 return self.module(*inputs[0], **module_kwargs[0])
--> 185 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
186 outputs = self.parallel_apply(replicas, inputs, module_kwargs)
187 return self.gather(outputs, self.output_device)

File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:190, in DataParallel.replicate(self, module, device_ids)
189 def replicate(self, module: T, device_ids: Sequence[Union[int, torch.device]]) -> List[T]:
--> 190 return replicate(module, device_ids, not torch.is_grad_enabled())

File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/replicate.py:110, in replicate(network, devices, detach)
108 params = list(network.parameters())
109 param_indices = {param: idx for idx, param in enumerate(params)}
--> 110 param_copies = _broadcast_coalesced_reshape(params, devices, detach)
112 buffers = list(network.buffers())
113 buffers_rg: List[torch.Tensor] = []

File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/replicate.py:83, in _broadcast_coalesced_reshape(tensors, devices, detach)
80 else:
81 # Use the autograd function to broadcast if not detach
82 if len(tensors) > 0:
---> 83 tensor_copies = Broadcast.apply(devices, *tensors)
84 return [tensor_copies[i:i + len(tensors)]
85 for i in range(0, len(tensor_copies), len(tensors))]
86 else:

File /opt/conda/lib/python3.10/site-packages/torch/autograd/function.py:574, in Function.apply(cls, *args, **kwargs)
571 if not torch._C._are_functorch_transforms_active():
572 # See NOTE: [functorch vjp and autograd interaction]
573 args = _functorch.utils.unwrap_dead_wrappers(args)
--> 574 return super().apply(*args, **kwargs) # type: ignore[misc]
576 if not is_setup_ctx_defined:
577 raise RuntimeError(
578 "In order to use an autograd.Function with functorch transforms "
579 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context "
580 "staticmethod. For more details, please see "
581 "https://pytorch.org/docs/main/notes/extending.func.html"
582 )

File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:23, in Broadcast.forward(ctx, target_gpus, *inputs)
21 ctx.num_inputs = len(inputs)
22 ctx.input_device = inputs[0].get_device()
---> 23 outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
24 non_differentiables = []
25 for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]):

File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/comm.py:58, in broadcast_coalesced(tensors, devices, buffer_size)
56 devices = [_get_device_index(d) for d in devices]
57 tensors = [_handle_complex(t) for t in tensors]
---> 58 return torch._C._broadcast_coalesced(tensors, devices, buffer_size)

RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

@sjeaugey
Copy link
Member

sjeaugey commented Dec 2, 2024

This error happens when NCCL has not been compiled for your GPU's architecture.

@Seqaeon
Copy link
Author

Seqaeon commented Dec 2, 2024

This error happens when NCCL has not been compiled for your GPU's architecture.

Thanks. So please how to solve that?

@sjeaugey
Copy link
Member

sjeaugey commented Dec 2, 2024

Recompile NCCL for your GPU's architecture, or use a different NCCL version which supports your GPU. What is your GPU and where does your NCCL version come from?

@Seqaeon
Copy link
Author

Seqaeon commented Dec 2, 2024

Recompile NCCL for your GPU's architecture, or use a different NCCL version which supports your GPU. What is your GPU and where does your NCCL version come from?

L4 and its from a kaggle env

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants