You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using 4 GPUs!
c76d92cccb6e:72:72 [0] NCCL INFO cudaDriverVersion 12060
c76d92cccb6e:72:72 [0] NCCL INFO Bootstrap : Using eth0:172.19.2.2<0>
c76d92cccb6e:72:72 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
c76d92cccb6e:72:72 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin symbol (>= v5). ncclNetPlugin symbols v4 and lower are not supported.
NCCL version 2.20.5+cuda12.3
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] NCCL INFO init.cc:1475 -> 1
c76d92cccb6e:72:457 [1] NCCL INFO group.cc:64 -> 1 [Async thread]
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] NCCL INFO init.cc:1475 -> 1
c76d92cccb6e:72:459 [3] NCCL INFO group.cc:64 -> 1 [Async thread]
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] NCCL INFO init.cc:1475 -> 1
c76d92cccb6e:72:456 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/replicate.py:83, in _broadcast_coalesced_reshape(tensors, devices, detach)
80 else:
81 # Use the autograd function to broadcast if not detach
82 if len(tensors) > 0:
---> 83 tensor_copies = Broadcast.apply(devices, *tensors)
84 return [tensor_copies[i:i + len(tensors)]
85 for i in range(0, len(tensor_copies), len(tensors))]
86 else:
File /opt/conda/lib/python3.10/site-packages/torch/autograd/function.py:574, in Function.apply(cls, *args, **kwargs)
571 if not torch._C._are_functorch_transforms_active():
572 # See NOTE: [functorch vjp and autograd interaction]
573 args = _functorch.utils.unwrap_dead_wrappers(args)
--> 574 return super().apply(*args, **kwargs) # type: ignore[misc]
576 if not is_setup_ctx_defined:
577 raise RuntimeError(
578 "In order to use an autograd.Function with functorch transforms "
579 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context "
580 "staticmethod. For more details, please see "
581 "https://pytorch.org/docs/main/notes/extending.func.html"
582 )
File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:23, in Broadcast.forward(ctx, target_gpus, *inputs)
21 ctx.num_inputs = len(inputs)
22 ctx.input_device = inputs[0].get_device()
---> 23 outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
24 non_differentiables = []
25 for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]):
File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/comm.py:58, in broadcast_coalesced(tensors, devices, buffer_size)
56 devices = [_get_device_index(d) for d in devices]
57 tensors = [_handle_complex(t) for t in tensors]
---> 58 return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
The text was updated successfully, but these errors were encountered:
Recompile NCCL for your GPU's architecture, or use a different NCCL version which supports your GPU. What is your GPU and where does your NCCL version come from?
Recompile NCCL for your GPU's architecture, or use a different NCCL version which supports your GPU. What is your GPU and where does your NCCL version come from?
Using 4 GPUs!
c76d92cccb6e:72:72 [0] NCCL INFO cudaDriverVersion 12060
c76d92cccb6e:72:72 [0] NCCL INFO Bootstrap : Using eth0:172.19.2.2<0>
c76d92cccb6e:72:72 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
c76d92cccb6e:72:72 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin symbol (>= v5). ncclNetPlugin symbols v4 and lower are not supported.
NCCL version 2.20.5+cuda12.3
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:457 [1] NCCL INFO init.cc:1475 -> 1
c76d92cccb6e:72:457 [1] NCCL INFO group.cc:64 -> 1 [Async thread]
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:459 [3] NCCL INFO init.cc:1475 -> 1
c76d92cccb6e:72:459 [3] NCCL INFO group.cc:64 -> 1 [Async thread]
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:456 [0] NCCL INFO init.cc:1475 -> 1
c76d92cccb6e:72:456 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
c76d92cccb6e:72:458 [2] enqu
RuntimeError Traceback (most recent call last)
Cell In[22], line 35
32 X_batch,y_batch,w_batch = xi,yi,wi
34 optimizer.zero_grad()
---> 35 y_pred = model(X_batch).squeeze()
37 loss = (criterion(y_pred, y_batch) * w_batch).mean()
38 loss.backward()
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:185, in DataParallel.forward(self, *inputs, **kwargs)
183 if len(self.device_ids) == 1:
184 return self.module(*inputs[0], **module_kwargs[0])
--> 185 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
186 outputs = self.parallel_apply(replicas, inputs, module_kwargs)
187 return self.gather(outputs, self.output_device)
File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:190, in DataParallel.replicate(self, module, device_ids)
189 def replicate(self, module: T, device_ids: Sequence[Union[int, torch.device]]) -> List[T]:
--> 190 return replicate(module, device_ids, not torch.is_grad_enabled())
File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/replicate.py:110, in replicate(network, devices, detach)
108 params = list(network.parameters())
109 param_indices = {param: idx for idx, param in enumerate(params)}
--> 110 param_copies = _broadcast_coalesced_reshape(params, devices, detach)
112 buffers = list(network.buffers())
113 buffers_rg: List[torch.Tensor] = []
File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/replicate.py:83, in _broadcast_coalesced_reshape(tensors, devices, detach)
80 else:
81 # Use the autograd function to broadcast if not detach
82 if len(tensors) > 0:
---> 83 tensor_copies = Broadcast.apply(devices, *tensors)
84 return [tensor_copies[i:i + len(tensors)]
85 for i in range(0, len(tensor_copies), len(tensors))]
86 else:
File /opt/conda/lib/python3.10/site-packages/torch/autograd/function.py:574, in Function.apply(cls, *args, **kwargs)
571 if not torch._C._are_functorch_transforms_active():
572 # See NOTE: [functorch vjp and autograd interaction]
573 args = _functorch.utils.unwrap_dead_wrappers(args)
--> 574 return super().apply(*args, **kwargs) # type: ignore[misc]
576 if not is_setup_ctx_defined:
577 raise RuntimeError(
578 "In order to use an autograd.Function with functorch transforms "
579 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context "
580 "staticmethod. For more details, please see "
581 "https://pytorch.org/docs/main/notes/extending.func.html"
582 )
File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:23, in Broadcast.forward(ctx, target_gpus, *inputs)
21 ctx.num_inputs = len(inputs)
22 ctx.input_device = inputs[0].get_device()
---> 23 outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
24 non_differentiables = []
25 for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]):
File /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/comm.py:58, in broadcast_coalesced(tensors, devices, buffer_size)
56 devices = [_get_device_index(d) for d in devices]
57 tensors = [_handle_complex(t) for t in tensors]
---> 58 return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
The text was updated successfully, but these errors were encountered: