Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL 2.22.3 core dump when specify NCCL_IB_ROCE_VERSION_NUM #1538

Open
nkflash opened this issue Dec 12, 2024 · 22 comments
Open

NCCL 2.22.3 core dump when specify NCCL_IB_ROCE_VERSION_NUM #1538

nkflash opened this issue Dec 12, 2024 · 22 comments

Comments

@nkflash
Copy link

nkflash commented Dec 12, 2024

I try RoCE v2 network with 2 node all reduce(each node has 8 gpu and 8 RoCE v2 NIC). NCCL core dump with NCCL_IB_ROCE_VERSION_NUM=2
when I use NCCL_IB_GID_INDEX instead of NCCL_IB_ROCE_VERSION_NUM, it work well.

NCCL version:

ldconfig -v | grep "libnccl.so" | tail -n1 | sed -r 's/^.*\.so\.//'

2.22.3

env like:

export CUDA_DEVICE_MAX_CONNECTIONS=1;
export NCCL_IB_DISABLE=0;
export NCCL_IB_CUDA_SUPPORT=1;
#export NCCL_IB_GID_INDEX=7;
export NCCL_DEBUG=INFO;
export NCCL_IB_ROCE_VERSION_NUM=2;
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7;

NCCL core dump like:

[job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:4682 :0:4682] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:   4677) ====
==== backtrace (tid:   4678) ====
 0 0x0000000000042520 __sigaction()  ???:0
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000050dc2 pncclRedOpDestroy()  ???:0
==== backtrace (tid:   4676) ====
 1 0x0000000000050dc2 pncclRedOpDestroy()  ???:0
 2 0x0000000000051527 pncclRedOpDestroy()  ???:0
 3 0x0000000000052703 pncclRedOpDestroy()  ???:0
 0 0x0000000000042520 __sigaction()  ???:0
 2 0x0000000000051527 pncclRedOpDestroy()  ???:0
 4 0x000000000004c818 ncclRecv()  ???:0
 1 0x0000000000050dc2 pncclRedOpDestroy()  ???:0
 3 0x0000000000052703 pncclRedOpDestroy()  ???:0
 5 0x0000000000042990 ncclAllReduce()  ???:0
 2 0x0000000000051527 pncclRedOpDestroy()  ???:0
 4 0x000000000004c818 ncclRecv()  ???:0
 6 0x00000000010db161 c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>()  ProcessGroupNCCL.cpp:0
 3 0x0000000000052703 pncclRedOpDestroy()  ???:0
 5 0x0000000000042990 ncclAllReduce()  ???:0
 7 0x00000000010dbfc0 c10d::ProcessGroupNCCL::allreduce_impl()  ???:0
 4 0x000000000004c818 ncclRecv()  ???:0
 6 0x00000000010db161 c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>()  ProcessGroupNCCL.cpp:0
 8 0x00000000010e8f1e c10d::ProcessGroupNCCL::barrier()  ???:0
 5 0x0000000000042990 ncclAllReduce()  ???:0
 7 0x00000000010dbfc0 c10d::ProcessGroupNCCL::allreduce_impl()  ???:0
 9 0x000000000515fa7f c10d::ops::(anonymous namespace)::barrierCUDA()  Ops.cpp:0
 6 0x00000000010db161 c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>()  ProcessGroupNCCL.cpp:0
 8 0x00000000010e8f1e c10d::ProcessGroupNCCL::barrier()  ???:0
10 0x000000000516f02e c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long> >, false>::call()  :0
 7 0x00000000010dbfc0 c10d::ProcessGroupNCCL::allreduce_impl()  ???:0
 9 0x000000000515fa7f c10d::ops::(anonymous namespace)::barrierCUDA()  Ops.cpp:0
11 0x0000000004836ba0 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
 8 0x00000000010e8f1e c10d::ProcessGroupNCCL::barrier()  ???:0
10 0x000000000516f02e c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long> >, false>::call()  :0
12 0x000000000517c242 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), void>::call()  :0
 9 0x000000000515fa7f c10d::ops::(anonymous namespace)::barrierCUDA()  Ops.cpp:0
11 0x0000000004836ba0 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
13 0x000000000517cf59 c10d::ProcessGroup::barrier()  :0
10 0x000000000516f02e c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long> >, false>::call()  :0
12 0x000000000517c242 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), void>::call()  :0
14 0x0000000000d3b602 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
11 0x0000000004836ba0 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
13 0x000000000517cf59 c10d::ProcessGroup::barrier()  :0
15 0x000000000046c0f7 pybind11::cpp_function::dispatcher()  :0
12 0x000000000517c242 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), void>::call()  :0
14 0x0000000000d3b602 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
16 0x000000000015adae PyObject_CallFunctionObjArgs()  ???:0
13 0x000000000517cf59 c10d::ProcessGroup::barrier()  :0
15 0x000000000046c0f7 pybind11::cpp_function::dispatcher()  :0
17 0x000000000015152b _PyObject_MakeTpCall()  ???:0
14 0x0000000000d3b602 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
16 0x000000000015adae PyObject_CallFunctionObjArgs()  ???:0
18 0x000000000016952b PyMethod_New()  ???:0
15 0x000000000046c0f7 pybind11::cpp_function::dispatcher()  :0
17 0x000000000015152b _PyObject_MakeTpCall()  ???:0
19 0x0000000000144c16 _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015adae PyObject_CallFunctionObjArgs()  ???:0
18 0x000000000016952b PyMethod_New()  ???:0
20 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
17 0x000000000015152b _PyObject_MakeTpCall()  ???:0
19 0x0000000000144c16 _PyEval_EvalFrameDefault()  ???:0
21 0x0000000000145ca9 _PyEval_EvalFrameDefault()  ???:0
18 0x000000000016952b PyMethod_New()  ???:0
20 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
22 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
19 0x0000000000144c16 _PyEval_EvalFrameDefault()  ???:0
21 0x0000000000145ca9 _PyEval_EvalFrameDefault()  ???:0
23 0x0000000000149742 _PyEval_EvalFrameDefault()  ???:0
20 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
22 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
24 0x0000000000140096 _PyArg_ParseTuple_SizeT()  ???:0
21 0x0000000000145ca9 _PyEval_EvalFrameDefault()  ???:0
23 0x0000000000149742 _PyEval_EvalFrameDefault()  ???:0
25 0x0000000000235f66 PyEval_EvalCode()  ???:0
22 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
24 0x0000000000140096 _PyArg_ParseTuple_SizeT()  ???:0
26 0x0000000000260e98 PyUnicode_Tailmatch()  ???:0
23 0x0000000000149742 _PyEval_EvalFrameDefault()  ???:0
25 0x0000000000235f66 PyEval_EvalCode()  ???:0
27 0x000000000025a79b PyInit__collections()  ???:0
24 0x0000000000140096 _PyArg_ParseTuple_SizeT()  ???:0
26 0x0000000000260e98 PyUnicode_Tailmatch()  ???:0
28 0x0000000000260be5 PyUnicode_Tailmatch()  ???:0
25 0x0000000000235f66 PyEval_EvalCode()  ???:0
27 0x000000000025a79b PyInit__collections()  ???:0
29 0x00000000002600c8 _PyRun_SimpleFileObject()  ???:0
26 0x0000000000260e98 PyUnicode_Tailmatch()  ???:0
28 0x0000000000260be5 PyUnicode_Tailmatch()  ???:0
30 0x000000000025fd13 _PyRun_AnyFileObject()  ???:0
27 0x000000000025a79b PyInit__collections()  ???:0
29 0x00000000002600c8 _PyRun_SimpleFileObject()  ???:0
31 0x000000000025270e Py_RunMain()  ???:0
28 0x0000000000260be5 PyUnicode_Tailmatch()  ???:0
30 0x000000000025fd13 _PyRun_AnyFileObject()  ???:0
32 0x0000000000228dfd Py_BytesMain()  ???:0
29 0x00000000002600c8 _PyRun_SimpleFileObject()  ???:0
31 0x000000000025270e Py_RunMain()  ???:0
33 0x0000000000029d90 __libc_init_first()  ???:0
30 0x000000000025fd13 _PyRun_AnyFileObject()  ???:0
32 0x0000000000228dfd Py_BytesMain()  ???:0
34 0x0000000000029e40 __libc_start_main()  ???:0
31 0x000000000025270e Py_RunMain()  ???:0
33 0x0000000000029d90 __libc_init_first()  ???:0
35 0x0000000000228cf5 _start()  ???:0
32 0x0000000000228dfd Py_BytesMain()  ???:0
34 0x0000000000029e40 __libc_start_main()  ???:0
=================================
33 0x0000000000029d90 __libc_init_first()  ???:0
35 0x0000000000228cf5 _start()  ???:0
34 0x0000000000029e40 __libc_start_main()  ???:0
=================================
35 0x0000000000228cf5 _start()  ???:0
=================================
==== backtrace (tid:   4680) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000050dc2 pncclRedOpDestroy()  ???:0
 2 0x0000000000051527 pncclRedOpDestroy()  ???:0
 3 0x0000000000052703 pncclRedOpDestroy()  ???:0
 4 0x000000000004c818 ncclRecv()  ???:0
 5 0x0000000000042990 ncclAllReduce()  ???:0
 6 0x00000000010db161 c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>()  ProcessGroupNCCL.cpp:0
 7 0x00000000010dbfc0 c10d::ProcessGroupNCCL::allreduce_impl()  ???:0
 8 0x00000000010e8f1e c10d::ProcessGroupNCCL::barrier()  ???:0
 9 0x000000000515fa7f c10d::ops::(anonymous namespace)::barrierCUDA()  Ops.cpp:0
10 0x000000000516f02e c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long> >, false>::call()  :0
11 0x0000000004836ba0 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
12 0x000000000517c242 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), void>::call()  :0
13 0x000000000517cf59 c10d::ProcessGroup::barrier()  :0
14 0x0000000000d3b602 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
15 0x000000000046c0f7 pybind11::cpp_function::dispatcher()  :0
16 0x000000000015adae PyObject_CallFunctionObjArgs()  ???:0
17 0x000000000015152b _PyObject_MakeTpCall()  ???:0
18 0x000000000016952b PyMethod_New()  ???:0
19 0x0000000000144c16 _PyEval_EvalFrameDefault()  ???:0
20 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
21 0x0000000000145ca9 _PyEval_EvalFrameDefault()  ???:0
22 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
23 0x0000000000149742 _PyEval_EvalFrameDefault()  ???:0
24 0x0000000000140096 _PyArg_ParseTuple_SizeT()  ???:0
25 0x0000000000235f66 PyEval_EvalCode()  ???:0
26 0x0000000000260e98 PyUnicode_Tailmatch()  ???:0
27 0x000000000025a79b PyInit__collections()  ???:0
28 0x0000000000260be5 PyUnicode_Tailmatch()  ???:0
29 0x00000000002600c8 _PyRun_SimpleFileObject()  ???:0
30 0x000000000025fd13 _PyRun_AnyFileObject()  ???:0
31 0x000000000025270e Py_RunMain()  ???:0
32 0x0000000000228dfd Py_BytesMain()  ???:0
33 0x0000000000029d90 __libc_init_first()  ???:0
34 0x0000000000029e40 __libc_start_main()  ???:0
35 0x0000000000228cf5 _start()  ???:0
=================================
==== backtrace (tid:   4682) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000050dc2 pncclRedOpDestroy()  ???:0
 2 0x0000000000051527 pncclRedOpDestroy()  ???:0
 3 0x0000000000052703 pncclRedOpDestroy()  ???:0
 4 0x000000000004c818 ncclRecv()  ???:0
 5 0x0000000000042990 ncclAllReduce()  ???:0
 6 0x00000000010db161 c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>()  ProcessGroupNCCL.cpp:0
 7 0x00000000010dbfc0 c10d::ProcessGroupNCCL::allreduce_impl()  ???:0
 8 0x00000000010e8f1e c10d::ProcessGroupNCCL::barrier()  ???:0
 9 0x000000000515fa7f c10d::ops::(anonymous namespace)::barrierCUDA()  Ops.cpp:0
10 0x000000000516f02e c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long> >, false>::call()  :0
11 0x0000000004836ba0 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
12 0x000000000517c242 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> > const&, long), void>::call()  :0
13 0x000000000517cf59 c10d::ProcessGroup::barrier()  :0
14 0x0000000000d3b602 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, c10d::BarrierOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, c10d::BarrierOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, c10d::BarrierOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
15 0x000000000046c0f7 pybind11::cpp_function::dispatcher()  :0
16 0x000000000015adae PyObject_CallFunctionObjArgs()  ???:0
17 0x000000000015152b _PyObject_MakeTpCall()  ???:0
18 0x000000000016952b PyMethod_New()  ???:0
19 0x0000000000144c16 _PyEval_EvalFrameDefault()  ???:0
20 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
21 0x0000000000145ca9 _PyEval_EvalFrameDefault()  ???:0
22 0x000000000015b6ac _PyFunction_Vectorcall()  ???:0
23 0x0000000000149742 _PyEval_EvalFrameDefault()  ???:0
24 0x0000000000140096 _PyArg_ParseTuple_SizeT()  ???:0
25 0x0000000000235f66 PyEval_EvalCode()  ???:0
26 0x0000000000260e98 PyUnicode_Tailmatch()  ???:0
27 0x000000000025a79b PyInit__collections()  ???:0
28 0x0000000000260be5 PyUnicode_Tailmatch()  ???:0
29 0x00000000002600c8 _PyRun_SimpleFileObject()  ???:0
30 0x000000000025fd13 _PyRun_AnyFileObject()  ???:0
31 0x000000000025270e Py_RunMain()  ???:0
32 0x0000000000228dfd Py_BytesMain()  ???:0
33 0x0000000000029d90 __libc_init_first()  ???:0
34 0x0000000000029e40 __libc_start_main()  ???:0
35 0x0000000000228cf5 _start()  ???:0
=================================
W1212 03:27:32.718000 4609 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 4675 closing signal SIGTERM
W1212 03:27:32.718000 4609 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 4676 closing signal SIGTERM
W1212 03:27:32.718000 4609 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 4678 closing signal SIGTERM
W1212 03:27:32.718000 4609 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 4679 closing signal SIGTERM
W1212 03:27:32.719000 4609 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 4680 closing signal SIGTERM
W1212 03:27:32.719000 4609 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 4681 closing signal SIGTERM
W1212 03:27:32.719000 4609 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 4682 closing signal SIGTERM
E1212 03:27:33.598000 4609 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: -11) local_rank: 2 (pid: 4677) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.0a0+b465a5843b.nv24.9', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
bandwidth.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-12_03:27:32
  host      : job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0
  rank      : 2 (local_rank: 2)
  exitcode  : -11 (pid: 4677)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 4677
============================================================
@nkflash
Copy link
Author

nkflash commented Dec 12, 2024

With more info log:

root@job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-1:/mnt/share/flag_diagnose/toolkits/results/2024-12-12-04-06-57-061937/2024-12-12-04-06-57-074941InterserverAllReduce/172.24.180.73/workspace# cat case_log_outerr.txt
W1212 04:06:59.300000 5597 torch/distributed/run.py:793]
W1212 04:06:59.300000 5597 torch/distributed/run.py:793] *****************************************
W1212 04:06:59.300000 5597 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1212 04:06:59.300000 5597 torch/distributed/run.py:793] *****************************************
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5663 [0] NCCL INFO Bootstrap : Using eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5663 [0] NCCL INFO cudaDriverVersion 12060
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5663 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO P2P plugin v8 IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [RO]; OOB eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Using network IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5668 [5] NCCL INFO cudaDriverVersion 12060
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5668 [5] NCCL INFO Bootstrap : Using eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5668 [5] NCCL INFO NCCL version 2.22.3+cuda12.6
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO P2P plugin v8 IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5666 [3] NCCL INFO cudaDriverVersion 12060
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5666 [3] NCCL INFO Bootstrap : Using eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5666 [3] NCCL INFO NCCL version 2.22.3+cuda12.6
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5667 [4] NCCL INFO cudaDriverVersion 12060
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5667 [4] NCCL INFO Bootstrap : Using eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5667 [4] NCCL INFO NCCL version 2.22.3+cuda12.6
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5665 [2] NCCL INFO cudaDriverVersion 12060
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5665 [2] NCCL INFO Bootstrap : Using eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5665 [2] NCCL INFO NCCL version 2.22.3+cuda12.6
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5664 [1] NCCL INFO cudaDriverVersion 12060
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5664 [1] NCCL INFO Bootstrap : Using eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5664 [1] NCCL INFO NCCL version 2.22.3+cuda12.6
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5669 [6] NCCL INFO cudaDriverVersion 12060
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5669 [6] NCCL INFO Bootstrap : Using eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5669 [6] NCCL INFO NCCL version 2.22.3+cuda12.6
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [RO]; OOB eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO Using network IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO P2P plugin v8 IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO P2P plugin v8 IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO P2P plugin v8 IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO P2P plugin v8 IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO P2P plugin v8 IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [RO]; OOB eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [RO]; OOB eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO Using network IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO Using network IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [RO]; OOB eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO Using network IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [RO]; OOB eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO Using network IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [RO]; OOB eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO Using network IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5670 [7] NCCL INFO cudaDriverVersion 12060
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5670 [7] NCCL INFO Bootstrap : Using eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5670 [7] NCCL INFO NCCL version 2.22.3+cuda12.6
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO P2P plugin v8 IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [4]mlx5_4:1/RoCE [5]mlx5_5:1/RoCE [6]mlx5_6:1/RoCE [7]mlx5_7:1/RoCE [RO]; OOB eth0:172.24.180.73<0>
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO Using network IBext_v8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO ncclCommInitRank comm 0x5557507fd3b0 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x168a89b9f9b43a94 - Init START
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO ncclCommInitRank comm 0x55b13df9eb50 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x168a89b9f9b43a94 - Init START
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO ncclCommInitRank comm 0x55d188b11810 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 18000 commId 0x168a89b9f9b43a94 - Init START
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO ncclCommInitRank comm 0x556e6000c570 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 3a000 commId 0x168a89b9f9b43a94 - Init START
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO ncclCommInitRank comm 0x5605c2332e10 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId db000 commId 0x168a89b9f9b43a94 - Init START
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO ncclCommInitRank comm 0x56057dbd2a10 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 9a000 commId 0x168a89b9f9b43a94 - Init START
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO ncclCommInitRank comm 0x55bfd6a5c820 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId ba000 commId 0x168a89b9f9b43a94 - Init START
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO ncclCommInitRank comm 0x55c3c4ac3ea0 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId ab000 commId 0x168a89b9f9b43a94 - Init START
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO MNNVL busId 0x5d000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO MNNVL busId 0x18000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO MNNVL busId 0x9a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO MNNVL busId 0xab000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO MNNVL busId 0x2a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO MNNVL busId 0x3a000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO MNNVL busId 0xdb000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO MNNVL busId 0xba000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO Setting affinity for GPU 3 to ffff,ffffffff,00000000,0000ffff,ffffffff
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NVLS multicast support is available on dev 3
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,00000000,0000ffff,ffffffff
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NVLS multicast support is available on dev 4
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NVLS multicast support is available on dev 1
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff,00000000,0000ffff,ffffffff
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO Setting affinity for GPU 2 to ffff,ffffffff,00000000,0000ffff,ffffffff
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NVLS multicast support is available on dev 5
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NVLS multicast support is available on dev 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NVLS multicast support is available on dev 7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NVLS multicast support is available on dev 6
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NVLS multicast support is available on dev 0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO comm 0x5605c2332e10 rank 7 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO comm 0x55bfd6a5c820 rank 6 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO comm 0x55c3c4ac3ea0 rank 5 nRanks 16 nNodes 2 localRanks 8 localRank 5 MNNVL 0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO comm 0x55b13df9eb50 rank 3 nRanks 16 nNodes 2 localRanks 8 localRank 3 MNNVL 0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO comm 0x5557507fd3b0 rank 1 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO comm 0x556e6000c570 rank 2 nRanks 16 nNodes 2 localRanks 8 localRank 2 MNNVL 0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NVLS Head  0:  0  8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO comm 0x55d188b11810 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO comm 0x56057dbd2a10 rank 4 nRanks 16 nNodes 2 localRanks 8 localRank 4 MNNVL 0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NVLS Head  0:  0  8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NVLS Head  0:  0  8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NVLS Head  0:  0  8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NVLS Head  0:  0  8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NVLS Head  0:  0  8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NVLS Head  1:  1  9
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NVLS Head  0:  0  8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NVLS Head  0:  0  8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NVLS Head  1:  1  9
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NVLS Head  1:  1  9
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NVLS Head  1:  1  9
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NVLS Head  1:  1  9
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NVLS Head  1:  1  9
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NVLS Head  2:  2 10
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NVLS Head  1:  1  9
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NVLS Head  1:  1  9
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NVLS Head  2:  2 10
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NVLS Head  2:  2 10
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NVLS Head  2:  2 10
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NVLS Head  2:  2 10
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NVLS Head  2:  2 10
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NVLS Head  3:  3 11
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NVLS Head  2:  2 10
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NVLS Head  2:  2 10
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NVLS Head  3:  3 11
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NVLS Head  3:  3 11
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NVLS Head  3:  3 11
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NVLS Head  3:  3 11
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NVLS Head  3:  3 11
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NVLS Head  4:  4 12
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NVLS Head  3:  3 11
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NVLS Head  3:  3 11
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NVLS Head  4:  4 12
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NVLS Head  4:  4 12
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NVLS Head  4:  4 12
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NVLS Head  4:  4 12
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NVLS Head  4:  4 12
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NVLS Head  5:  5 13
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NVLS Head  4:  4 12
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NVLS Head  4:  4 12
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NVLS Head  5:  5 13
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NVLS Head  5:  5 13
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NVLS Head  5:  5 13
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NVLS Head  5:  5 13
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NVLS Head  5:  5 13
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NVLS Head  6:  6 14
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NVLS Head  5:  5 13
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NVLS Head  5:  5 13
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NVLS Head  6:  6 14
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NVLS Head  6:  6 14
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NVLS Head  6:  6 14
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NVLS Head  6:  6 14
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NVLS Head  6:  6 14
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO NVLS Head  7:  7 15
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NVLS Head  6:  6 14
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NVLS Head  6:  6 14
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO NVLS Head  7:  7 15
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO NVLS Head  7:  7 15
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO NVLS Head  7:  7 15
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO NVLS Head  7:  7 15
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO NVLS Head  7:  7 15
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] 0/-1/-1->7->6 [2] 0/-1/-1->7->6 [3] 0/-1/-1->7->6 [4] 0/-1/-1->7->6 [5] 0/-1/-1->7->6 [6] 0/-1/-1->7->6 [7] 0/15/-1->7->-1 [8] -1/-1/-1->7->6 [9] 0/-1/-1->7->6 [10] 0/-1/-1->7->6 [11] 0/-1/-1->7->6 [12] 0/-1/-1->7->6 [13] 0/-1/-1->7->6 [14] 0/-1/-1->7->6 [15] 0/-1/-1->7->15
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO NVLS Head  7:  7 15
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO NVLS Head  7:  7 15
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/14/-1->6->-1 [7] -1/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->14 [15] -1/-1/-1->6->5
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/13/-1->5->-1 [6] -1/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->13 [14] -1/-1/-1->5->4 [15] 6/-1/-1->5->4
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/11/-1->3->-1 [4] -1/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->11 [12] -1/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/9/-1->1->-1 [2] -1/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->9 [10] -1/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/10/-1->2->-1 [3] -1/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->10 [11] -1/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO P2P Chunksize set to 131072
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 00/16 :    0   7   6   5   4   3   2   1   9  10  11  12  13  14  15   8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/12/-1->4->-1 [5] -1/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->12 [13] -1/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO P2P Chunksize set to 131072
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO P2P Chunksize set to 131072
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO P2P Chunksize set to 131072
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO P2P Chunksize set to 131072
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO P2P Chunksize set to 131072
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 01/16 :    0   8  15  14  13  12  11  10   9   1   2   3   4   5   6   7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO P2P Chunksize set to 131072
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 02/16 :    0   7   6   5   4   3  11  12  13  14  15   8   9  10   2   1
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 03/16 :    0   1   2  10   9   8  15  14  13  12  11   3   4   5   6   7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 04/16 :    0   7   6   5  13  14  15   8   9  10  11  12   4   3   2   1
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 05/16 :    0   1   2   3   4  12  11  10   9   8  15  14  13   5   6   7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 06/16 :    0   7  15   8   9  10  11  12  13  14   6   5   4   3   2   1
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 07/16 :    0   1   2   3   4   5   6  14  13  12  11  10   9   8  15   7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 08/16 :    0   7   6   5   4   3   2   1   9  10  11  12  13  14  15   8
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 09/16 :    0   8  15  14  13  12  11  10   9   1   2   3   4   5   6   7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 10/16 :    0   7   6   5   4   3  11  12  13  14  15   8   9  10   2   1
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 11/16 :    0   1   2  10   9   8  15  14  13  12  11   3   4   5   6   7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 12/16 :    0   7   6   5  13  14  15   8   9  10  11  12   4   3   2   1
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 13/16 :    0   1   2   3   4  12  11  10   9   8  15  14  13   5   6   7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 14/16 :    0   7  15   8   9  10  11  12  13  14   6   5   4   3   2   1
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Channel 15/16 :    0   1   2   3   4   5   6  14  13  12  11  10   9   8  15   7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] -1/-1/-1->0->7 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->7 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->7 [8] 1/-1/-1->0->8 [9] -1/-1/-1->0->7 [10] 1/-1/-1->0->7 [11] 1/-1/-1->0->7 [12] 1/-1/-1->0->7 [13] 1/-1/-1->0->7 [14] 1/-1/-1->0->7 [15] 1/-1/-1->0->7
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO P2P Chunksize set to 131072
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO ncclCommInitRank comm 0x5605c2332e10 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId db000 commId 0x168a89b9f9b43a94 - Init COMPLETE
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO ncclCommInitRank comm 0x55bfd6a5c820 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId ba000 commId 0x168a89b9f9b43a94 - Init COMPLETE
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO ncclCommInitRank comm 0x55b13df9eb50 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x168a89b9f9b43a94 - Init COMPLETE
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO ncclCommInitRank comm 0x5557507fd3b0 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x168a89b9f9b43a94 - Init COMPLETE
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5782 [7] NCCL INFO Init timings: rank 7 nranks 16 total 5.16 (kernels 0.91, bootstrap 3.04, allgathers 0.02, topo 0.95, graphs 0.01, connections 0.21, rest 0.00)
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO ncclCommInitRank comm 0x55d188b11810 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 18000 commId 0x168a89b9f9b43a94 - Init COMPLETE
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5735 [6] NCCL INFO Init timings: rank 6 nranks 16 total 5.74 (kernels 0.35, bootstrap 4.19, allgathers 0.02, topo 0.95, graphs 0.01, connections 0.21, rest 0.01)
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO ncclCommInitRank comm 0x556e6000c570 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 3a000 commId 0x168a89b9f9b43a94 - Init COMPLETE
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO ncclCommInitRank comm 0x55c3c4ac3ea0 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId ab000 commId 0x168a89b9f9b43a94 - Init COMPLETE
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO ncclCommInitRank comm 0x56057dbd2a10 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 9a000 commId 0x168a89b9f9b43a94 - Init COMPLETE
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5728 [3] NCCL INFO Init timings: rank 3 nranks 16 total 5.82 (kernels 0.42, bootstrap 4.20, allgathers 0.01, topo 0.95, graphs 0.02, connections 0.21, rest 0.00)
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5734 [1] NCCL INFO Init timings: rank 1 nranks 16 total 5.74 (kernels 0.35, bootstrap 4.19, allgathers 0.01, topo 0.95, graphs 0.02, connections 0.21, rest 0.00)
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5712 [0] NCCL INFO Init timings: rank 0 nranks 16 total 6.90 (kernels 0.22, bootstrap 5.48, allgathers 0.01, topo 0.95, graphs 0.02, connections 0.21, rest 0.01)
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5733 [2] NCCL INFO Init timings: rank 2 nranks 16 total 5.75 (kernels 0.35, bootstrap 4.20, allgathers 0.01, topo 0.95, graphs 0.02, connections 0.21, rest 0.00)
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5722 [5] NCCL INFO Init timings: rank 5 nranks 16 total 6.24 (kernels 0.42, bootstrap 4.62, allgathers 0.01, topo 0.95, graphs 0.02, connections 0.21, rest 0.01)
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5730 [4] NCCL INFO Init timings: rank 4 nranks 16 total 5.78 (kernels 0.40, bootstrap 4.18, allgathers 0.01, topo 0.95, graphs 0.03, connections 0.21, rest 0.00)
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 01/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 02/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 03/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 04/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 05/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 06/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 07/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 09/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 10/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 11/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 12/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 13/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 14/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 15/0 : 0[0] -> 7[7] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 06/0 : 14[6] -> 6[6] [receive] via NET/IBext_v8/6/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 14/0 : 14[6] -> 6[6] [receive] via NET/IBext_v8/6/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 06/0 : 6[6] -> 14[6] [send] via NET/IBext_v8/6/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 14/0 : 6[6] -> 14[6] [send] via NET/IBext_v8/6/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 02/0 : 10[2] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 05/0 : 13[5] -> 5[5] [receive] via NET/IBext_v8/5/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 04/0 : 12[4] -> 4[4] [receive] via NET/IBext_v8/4/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 10/0 : 10[2] -> 2[2] [receive] via NET/IBext_v8/2/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 13/0 : 13[5] -> 5[5] [receive] via NET/IBext_v8/5/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 12/0 : 12[4] -> 4[4] [receive] via NET/IBext_v8/4/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 02/0 : 2[2] -> 10[2] [send] via NET/IBext_v8/2/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 05/0 : 5[5] -> 13[5] [send] via NET/IBext_v8/5/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 10/0 : 2[2] -> 10[2] [send] via NET/IBext_v8/2/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 04/0 : 4[4] -> 12[4] [send] via NET/IBext_v8/4/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 13/0 : 5[5] -> 13[5] [send] via NET/IBext_v8/5/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 12/0 : 4[4] -> 12[4] [send] via NET/IBext_v8/4/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 01/0 : 9[1] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 03/0 : 11[3] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 09/0 : 9[1] -> 1[1] [receive] via NET/IBext_v8/1/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 01/0 : 1[1] -> 9[1] [send] via NET/IBext_v8/1/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 11/0 : 11[3] -> 3[3] [receive] via NET/IBext_v8/3/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 09/0 : 1[1] -> 9[1] [send] via NET/IBext_v8/1/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 03/0 : 3[3] -> 11[3] [send] via NET/IBext_v8/3/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 11/0 : 3[3] -> 11[3] [send] via NET/IBext_v8/3/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 08/0 : 8[0] -> 0[0] [receive] via NET/IBext_v8/0/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/IBext_v8/0/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5809 [0] NCCL INFO Channel 08/0 : 0[0] -> 8[0] [send] via NET/IBext_v8/0/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 07/0 : 15[7] -> 7[7] [receive] via NET/IBext_v8/7/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 15/0 : 15[7] -> 7[7] [receive] via NET/IBext_v8/7/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 07/0 : 7[7] -> 15[7] [send] via NET/IBext_v8/7/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 15/0 : 7[7] -> 15[7] [send] via NET/IBext_v8/7/GDRDMA
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5807 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5813 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5808 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/CUMEM
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5798 [1] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5798 [1] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5798 [1] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5801 [2] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5801 [2] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5801 [2] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5797 [3] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5797 [3] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5666:5797 [3] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5802 [4] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5802 [4] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5802 [4] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5794 [5] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5794 [5] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5668:5794 [5] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5805 [0] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5805 [0] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5663:5805 [0] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5791 [7] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5791 [7] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5791 [7] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5793 [6] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5793 [6] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5669:5793 [6] NCCL INFO transport/net.cc:700 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO transport.cc:166 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO transport/generic.cc:29 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO group.cc:147 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5812 [1] NCCL INFO group.cc:70 -> 2 [Async thread]

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5798 [1] proxy.cc:1487 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664:5798 [1] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
[job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5664 :0:5664] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] misc/ipcsocket.cc:221 NCCL WARN UDS: Sending data over socket /tmp/nccl-socket-1-810f5f749ef64895 failed : Connection refused (111)
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO proxy.cc:1052 -> 2

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] proxy.cc:1063 NCCL WARN ncclProxyCallBlockingUDS call to tpRank 1(810f5f749ef64895) failed : 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO proxy.cc:1073 -> 2

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] proxy.cc:1081 NCCL WARN ncclProxyClientGetFd call to tpRank 1 handle 0x7ff18c05b890 failed : 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO transport/p2p.cc:250 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO transport/p2p.cc:331 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO transport/p2p.cc:461 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO transport.cc:166 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO transport/generic.cc:29 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO group.cc:147 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5811 [2] NCCL INFO group.cc:70 -> 2 [Async thread]

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5801 [2] proxy.cc:1487 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665:5801 [2] proxy.cc:1521 NCCL WARN [Proxy Service 2] Failed to execute operation Connect from rank 2, retcode 3
[job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5665 :0:5665] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO transport.cc:166 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO transport/generic.cc:29 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO group.cc:147 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5814 [4] NCCL INFO group.cc:70 -> 2 [Async thread]

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5802 [4] proxy.cc:1487 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667:5802 [4] proxy.cc:1521 NCCL WARN [Proxy Service 4] Failed to execute operation Connect from rank 4, retcode 3
[job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5667 :0:5667] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO transport.cc:166 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO transport/generic.cc:29 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO group.cc:147 -> 2
job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5810 [7] NCCL INFO group.cc:70 -> 2 [Async thread]

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5791 [7] proxy.cc:1487 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-0:5670:5791 [7] proxy.cc:1521 NCCL WARN [Proxy Service 7] Failed to execute operation Connect from rank 7, retcode 3

@sjeaugey
Copy link
Member

It would look like NCCL is failing to open a file like:

/sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/<GID index>

Can you run ls /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/?

If you upgrade to a newer NCCL version, NCCL should print an explicit NCCL WARN message explaining why NCCL failed. That print was missing in 2.22 and was fixed later. Maybe there were other fixes as well, so upgrading seems like a good idea anyway.

@nkflash
Copy link
Author

nkflash commented Dec 12, 2024

It would look like NCCL is failing to open a file like:

/sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/<GID index>

Can you run ls /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/?

If you upgrade to a newer NCCL version, NCCL should print an explicit NCCL WARN message explaining why NCCL failed. That print was missing in 2.22 and was fixed later. Maybe there were other fixes as well, so upgrading seems like a good idea anyway.

Image

Only NCCL_IB_ROCE_VERSION_NUM doesn't work. if I specify NCCL_IB_GID_INDEX=7, it work well.
I will try to upgrade NCCL

@gcongiu
Copy link

gcongiu commented Dec 12, 2024

That is strange. The internal IB plugin is going through the GID table looking for the ROCE version:

static ncclResult_t ncclIbRoceGetVersionNum(const char* deviceName, int portNum, int gidIndex, int* version) {
  char gidRoceVerStr[16] = { 0 };
  char roceTypePath[PATH_MAX] = { 0 };
  sprintf(roceTypePath, "/sys/class/infiniband/%s/ports/%d/gid_attrs/types/%d", deviceName, portNum, gidIndex);

  int fd = open(roceTypePath, O_RDONLY);
  if (fd == -1) {
    return ncclSystemError;
  }
  int ret = read(fd, gidRoceVerStr, 15);
  close(fd);
...
}

It looks like the plugin can't find the roceTypePath containing the ROCE version for that port. What is your OS?

@gcongiu
Copy link

gcongiu commented Dec 12, 2024

Could you also attach the output of show_gids?

@nkflash
Copy link
Author

nkflash commented Dec 12, 2024

Could you also attach the output of show_gids?

Image

@nkflash
Copy link
Author

nkflash commented Dec 12, 2024

That is strange. The internal IB plugin is going through the GID table looking for the ROCE version:

static ncclResult_t ncclIbRoceGetVersionNum(const char* deviceName, int portNum, int gidIndex, int* version) {
char gidRoceVerStr[16] = { 0 };
char roceTypePath[PATH_MAX] = { 0 };
sprintf(roceTypePath, "/sys/class/infiniband/%s/ports/%d/gid_attrs/types/%d", deviceName, portNum, gidIndex);

int fd = open(roceTypePath, O_RDONLY);
if (fd == -1) {
return ncclSystemError;
}
int ret = read(fd, gidRoceVerStr, 15);
close(fd);
...
}
It looks like the plugin can't find the roceTypePath containing the ROCE version for that port. What is your OS?

Image

@gcongiu
Copy link

gcongiu commented Dec 12, 2024

Nothing strange there. Can you also cat the content of /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/7 and attach it here?

@nkflash
Copy link
Author

nkflash commented Dec 12, 2024

Nothing strange there. Can you also cat the content of /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/7 and attach it here?

root@job-3bb823c0-6d27-4312-8094-1c5910bf9b51-worker-1:/workspace# cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/7
RoCE v2

@gcongiu
Copy link

gcongiu commented Dec 12, 2024

That looks normal too. Can you set NCCL_DEBUG_SUBSYS=NET and rerun, without setting NCCL_IB_GID_INDEX, and attach the logs?

@nkflash
Copy link
Author

nkflash commented Dec 12, 2024

That looks normal too. Can you set NCCL_DEBUG_SUBSYS=NET and rerun, without setting NCCL_IB_GID_INDEX, and attach the logs?

With NCCL_IB_ROCE_VERSION_NUM=2 ?

@nkflash
Copy link
Author

nkflash commented Dec 13, 2024

case_log_outerr.txt

@nkflash
Copy link
Author

nkflash commented Dec 13, 2024

That looks normal too. Can you set NCCL_DEBUG_SUBSYS=NET and rerun, without setting NCCL_IB_GID_INDEX, and attach the logs?

With NCCL_IB_ROCE_VERSION_NUM=2 ?

env like:
export CUDA_DEVICE_MAX_CONNECTIONS=1;
export NCCL_IB_DISABLE=0;
export NCCL_IB_CUDA_SUPPORT=1;
#export NCCL_IB_GID_INDEX=7;
export NCCL_DEBUG=INFO;
unset NCCL_IB_GID_INDEX;
export NCCL_IB_ROCE_VERSION_NUM=2;
export NCCL_DEBUG_SUBSYS=NET;
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7;

@gcongiu
Copy link

gcongiu commented Dec 13, 2024

Unfortunately, not much info in the logs either. Please try NCCL v2.23 (as Sylvain suggested) since it adds WARN messages when opening and reading the RoCE version file in /sys. At this point it is not clear whether the file can't be opened correctly or it is opened correctly but can't be read.

@nkflash
Copy link
Author

nkflash commented Dec 16, 2024

Unfortunately, not much info in the logs either. Please try NCCL v2.23 (as Sylvain suggested) since it adds WARN messages when opening and reading the RoCE version file in /sys. At this point it is not clear whether the file can't be opened correctly or it is opened correctly but can't be read.

NCCL hang when I try 2.23.4 with NCCL_IB_ROCE_VERSION_NUM=2. NCCL_IB_GID_INDEX=7 could work well.

show_gids

root@job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-master-0:/share/project# tools/show_gids
DEV	PORT	INDEX	GID					IPv4  		VER	DEV
---	----	-----	---					------------  	---	---
mlx5_0	1	4	fe80:0000:0000:0000:c025:a1ff:fe31:2e81			v1	net1
mlx5_0	1	5	fe80:0000:0000:0000:c025:a1ff:fe31:2e81			v2	net1
mlx5_0	1	6	0000:0000:0000:0000:0000:ffff:0a9c:8184	10.156.129.132  	v1	net1
mlx5_0	1	7	0000:0000:0000:0000:0000:ffff:0a9c:8184	10.156.129.132  	v2	net1
mlx5_1	1	4	fe80:0000:0000:0000:58f6:6bff:fe9a:8027			v1	net2
mlx5_1	1	5	fe80:0000:0000:0000:58f6:6bff:fe9a:8027			v2	net2
mlx5_1	1	6	0000:0000:0000:0000:0000:ffff:0a9c:8584	10.156.133.132  	v1	net2
mlx5_1	1	7	0000:0000:0000:0000:0000:ffff:0a9c:8584	10.156.133.132  	v2	net2
mlx5_2	1	4	fe80:0000:0000:0000:c417:79ff:fe88:9ece			v1	net3
mlx5_2	1	5	fe80:0000:0000:0000:c417:79ff:fe88:9ece			v2	net3
mlx5_2	1	6	0000:0000:0000:0000:0000:ffff:0a9c:8984	10.156.137.132  	v1	net3
mlx5_2	1	7	0000:0000:0000:0000:0000:ffff:0a9c:8984	10.156.137.132  	v2	net3
mlx5_3	1	4	fe80:0000:0000:0000:98ff:16ff:fe64:3a6d			v1	net4
mlx5_3	1	5	fe80:0000:0000:0000:98ff:16ff:fe64:3a6d			v2	net4
mlx5_3	1	6	0000:0000:0000:0000:0000:ffff:0a9c:8d84	10.156.141.132  	v1	net4
mlx5_3	1	7	0000:0000:0000:0000:0000:ffff:0a9c:8d84	10.156.141.132  	v2	net4
mlx5_4	1	4	fe80:0000:0000:0000:6cb2:1aff:fe5c:e5a0			v1	net5
mlx5_4	1	5	fe80:0000:0000:0000:6cb2:1aff:fe5c:e5a0			v2	net5
mlx5_4	1	6	0000:0000:0000:0000:0000:ffff:0a9c:9184	10.156.145.132  	v1	net5
mlx5_4	1	7	0000:0000:0000:0000:0000:ffff:0a9c:9184	10.156.145.132  	v2	net5
mlx5_5	1	4	fe80:0000:0000:0000:cca6:4bff:fe43:5ba0			v1	net6
mlx5_5	1	5	fe80:0000:0000:0000:cca6:4bff:fe43:5ba0			v2	net6
mlx5_5	1	6	0000:0000:0000:0000:0000:ffff:0a9c:9584	10.156.149.132  	v1	net6
mlx5_5	1	7	0000:0000:0000:0000:0000:ffff:0a9c:9584	10.156.149.132  	v2	net6
mlx5_6	1	4	fe80:0000:0000:0000:8098:7eff:fe1a:b06b			v1	net7
mlx5_6	1	5	fe80:0000:0000:0000:8098:7eff:fe1a:b06b			v2	net7
mlx5_6	1	6	0000:0000:0000:0000:0000:ffff:0a9c:9984	10.156.153.132  	v1	net7
mlx5_6	1	7	0000:0000:0000:0000:0000:ffff:0a9c:9984	10.156.153.132  	v2	net7
mlx5_7	1	4	fe80:0000:0000:0000:d0f6:81ff:feb1:d28b			v1	net8
mlx5_7	1	5	fe80:0000:0000:0000:d0f6:81ff:feb1:d28b			v2	net8
mlx5_7	1	6	0000:0000:0000:0000:0000:ffff:0a9c:9d84	10.156.157.132  	v1	net8
mlx5_7	1	7	0000:0000:0000:0000:0000:ffff:0a9c:9d84	10.156.157.132  	v2	net8

env like:
export CUDA_DEVICE_MAX_CONNECTIONS=1;
export NCCL_IB_DISABLE=0;
export NCCL_IB_CUDA_SUPPORT=1;
export NCCL_DEBUG=INFO;
unset NCCL_IB_GID_INDEX;
export NCCL_IB_ROCE_VERSION_NUM=2;
export NCCL_DEBUG_SUBSYS=NET;
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7;

case_log_outerr.txt

@GeofferyGeng
Copy link

GeofferyGeng commented Dec 16, 2024

try run with NCCL_NET_PLUGIN=none, it seems you run in container, maybe net plugin not incompatible.

In old version plugin, NCCL_IB_GID_INDEX is necessary.

job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4179:4229 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4179:4229 [7] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4174:4232 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4174:4232 [2] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4175:4233 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4175:4233 [3] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4176:4236 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4176:4236 [4] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4173:4235 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4173:4235 [1] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4177:4234 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4177:4234 [5] NCCL INFO P2P plugin v8 IBext_v8

@nkflash
Copy link
Author

nkflash commented Dec 16, 2024

try run with NCCL_NET_PLUGIN=none, it seems you run in container, maybe net plugin not incompatible.

env like:

export CUDA_DEVICE_MAX_CONNECTIONS=1;
export NCCL_IB_DISABLE=0;
export NCCL_IB_CUDA_SUPPORT=1;
export NCCL_DEBUG=INFO;
export NCCL_NET_PLUGIN=none;
unset NCCL_IB_GID_INDEX;
export NCCL_IB_ROCE_VERSION_NUM=2;
export NCCL_DEBUG_SUBSYS=NET;
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7;
job-4c1cc3a1-c58b-4ead-ae66-702484b6cbc4-master-0:801:935 [6] transport/net_ib.cc:328 NCCL WARN NET/IB: read failed in ncclIbRoceGetVersionNum: Invalid argument

job-4c1cc3a1-c58b-4ead-ae66-702484b6cbc4-master-0:797:797 [2] NCCL INFO collectives.cc:114 -> 2
[rank2]: Traceback (most recent call last):
[rank2]:   File "/share/project/toolkits/cases/InterserverAllReduce/nvidia/bandwidth.py", line 13, in <module>
[rank2]:     dist.barrier()
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 4190, in barrier
[rank2]:     work = group.barrier(opts=opts)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2828, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.23.4
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank2]: Last error:
[rank2]: [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

case_log_outerr.txt

@nkflash
Copy link
Author

nkflash commented Dec 16, 2024

try run with NCCL_NET_PLUGIN=none, it seems you run in container, maybe net plugin not incompatible.

In old version plugin, NCCL_IB_GID_INDEX is necessary.

job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4179:4229 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4179:4229 [7] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4174:4232 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4174:4232 [2] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4175:4233 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4175:4233 [3] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4176:4236 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4176:4236 [4] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4173:4235 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4173:4235 [1] NCCL INFO P2P plugin v8 IBext_v8
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4177:4234 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
job-d01a8e06-2079-4ec3-a2fb-337517d5f9c7-worker-0:4177:4234 [5] NCCL INFO P2P plugin v8 IBext_v8

I think I catch this error. @GeofferyGeng @gcongiu

Image
Image
Image
Image

The top loop will return when any gid get such error. In system there are many gid index(some of them are invalid), so the top loop will return fail which cause the connect fail!

gid index for example:
Image

@GeofferyGeng
Copy link

Yes, especially your gid start from 4, some unexpected error occurred.

@nkflash
Copy link
Author

nkflash commented Dec 16, 2024

Yes, especially your gid start from 4, some unexpected error occurred.

I think this is an normal case, since RoCE will reserve some index

@gcongiu
Copy link

gcongiu commented Dec 16, 2024

The top loop will return when any gid get such error. In system there are many gid index(some of them are invalid), so the top loop will return fail which cause the connect fail!

The call to validGid() in the if above should account for invalid GIDs. Invalid GIDs normally are not shown by show_gids and are ignored by NCCL:

static bool configuredGid(union ibv_gid* gid) {
  const struct in6_addr *a = (struct in6_addr *)gid->raw;
  int trailer = (a->s6_addr32[1] | a->s6_addr32[2] | a->s6_addr32[3]);
  if (((a->s6_addr32[0] | trailer) == 0UL) || ((a->s6_addr32[0] == htonl(0xfe800000)) && (trailer == 0UL))) {
    return false;
  }
  return true;
}

static bool linkLocalGid(union ibv_gid* gid) {
  const struct in6_addr *a = (struct in6_addr *)gid->raw;
  if (a->s6_addr32[0] == htonl(0xfe800000) && a->s6_addr32[1] == 0UL) {
    return true;
  }
  return false;
}

static bool validGid(union ibv_gid* gid) {
  return (configuredGid(gid) && !linkLocalGid(gid));
}

Thus, if the GID is configured NCCL should never return with an error.

Instead, both the external and internal plugins are trying (and failing) to read the RoCE version from the /sys filesystem for a correctly configured GID. The WARN log does not say what file has failed to read but it should be /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/6. Adding that to the WARN log could help confirm:

WARN("NET/IB: read of %s failed in ncclIbRoceGetVersionNum: %s", roceTypePath, strerror(errno));

I shall add the above to future releases as well.

EDIT: link local GIDs are not considered valid by NCCL but are shown by show_gids.

@GeofferyGeng
Copy link

@gcongiu Can we use a higher version of the API (ibv_query_gid_ex/ibv_query_gid_table) to handle gid? Using them would be more convenient.

I can understand that using the basic API can improve compatibility, but in scenarios where the latest version of NCCL is used, the versions of rdma-core/MLNX_OFED should also include these APIs.

If so, I can provide a patch and check it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants