Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1287, invalid usage, NCCL version 2.17.1 ncclInvalidUsage: This usually reflects invalid usage of NCCL library. Last error: Duplicate GPU detected : rank 2 and rank 3 both on CUDA device e2000 #1241

Open
mfdj2002 opened this issue Mar 30, 2024 · 1 comment

Comments

@mfdj2002
Copy link

mfdj2002 commented Mar 30, 2024

I got these errors while using Megatron-LM to pretrain a GPT model. Strangely, my setup works fine for this cluster previously, but I got assigned to some other nodes this time and some of the nodes don't work. Below is a toy run with one node (master node) that works and another that doesn't.

I started docker with docker run --gpus all --network=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ... (other environmental variables, etc) nvcr.io/nvidia/pytorch:23.04-py3
NCCL version: 2.17.1
I ran the following with: TORCH_CPP_LOG_LEVEL=INFO, TORCH_DISTRIBUTED_DEBUG=INFO, TORCH_SHOW_CPP_STACKTRACES=1, NCCL_DEBUG=INFO

I'm wondering if this is because I didn't meet some of the hardware requirements, since it only fails on some nodes?

Problematic node:

[I debug.cpp:49] [c10d] The debug level is set to INFO.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu014.clemson.cloudlab.us]:38812.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu014.clemson.cloudlab.us]:38816.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu014.clemson.cloudlab.us]:41288.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu014.clemson.cloudlab.us]:41292.
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu014.clemson.cloudlab.us]:41304.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu014.clemson.cloudlab.us]:41312.
[I ProcessGroupNCCL.cpp:672] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 600000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:672] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 600000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 3] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:850] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 3] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 3] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 3] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:850] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:672] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 3] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:850] [Rank 3] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:672] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:2470] Rank 3 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[I ProcessGroupNCCL.cpp:2470] Rank 2 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Traceback (most recent call last):
Traceback (most recent call last):
  File "pretrain_gpt.py", line 207, in <module>
  File "pretrain_gpt.py", line 207, in <module>
        pretrain(train_valid_test_datasets_provider,pretrain(train_valid_test_datasets_provider,

  File "/workspace/Megatron-LM/megatron/training.py", line 177, in pretrain
  File "/workspace/Megatron-LM/megatron/training.py", line 177, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
      File "/workspace/Megatron-LM/megatron/initialize.py", line 89, in initialize_megatron
initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/Megatron-LM/megatron/initialize.py", line 89, in initialize_megatron
        _compile_dependencies()_compile_dependencies()

  File "/workspace/Megatron-LM/megatron/initialize.py", line 156, in _compile_dependencies
  File "/workspace/Megatron-LM/megatron/initialize.py", line 156, in _compile_dependencies
        torch.distributed.barrier()torch.distributed.barrier()

  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3369, in barrier
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3369, in barrier
        work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)

torch.distributedtorch.distributed.DistBackendError.: DistBackendErrorNCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1287, invalid usage, NCCL version 2.17.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 3 and rank 2 both on CUDA device e2000
Exception raised from getNCCLComm at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1287 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7efc9fe37efc in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc83fa9 (0x7efca0b6efa9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xee41f2 (0x7efca0dcf1f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x35 (0x7efca0dd0675 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x3eb (0x7efca0dd31eb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x890 (0x7efca0de2a40 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x52a118d (0x7efcdd2d418d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x52a7b4f (0x7efcdd2dab4f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x52ba8a1 (0x7efcdd2ed8a1 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0xc06676 (0x7efce3ffa676 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3e48ad (0x7efce37d88ad in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: PyCFunction_Call + 0x59 (0x5f6489 in /usr/bin/python)
frame #12: _PyObject_MakeTpCall + 0x296 (0x5f7056 in /usr/bin/python)
frame #13: /usr/bin/python() [0x50b993]
frame #14: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #15: /usr/bin/python() [0x6b40bc]
frame #16: _PyEval_EvalFrameDefault + 0x57f2 (0x570ac2 in /usr/bin/python)
frame #17: _PyFunction_Vectorcall + 0x1b6 (0x5f6836 in /usr/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x72d (0x56b9fd in /usr/bin/python)
frame #19: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x393 (0x5f6a13 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #23: _PyFunction_Vectorcall + 0x393 (0x5f6a13 in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #26: PyEval_EvalCode + 0x27 (0x68e7b7 in /usr/bin/python)
frame #27: /usr/bin/python() [0x680001]
frame #28: /usr/bin/python() [0x68007f]
frame #29: /usr/bin/python() [0x680121]
frame #30: PyRun_SimpleFileExFlags + 0x197 (0x680db7 in /usr/bin/python)
frame #31: Py_RunMain + 0x212 (0x6b8122 in /usr/bin/python)
frame #32: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #33: __libc_start_main + 0xf3 (0x7efd2c3f8083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x2e (0x5fb39e in /usr/bin/python)
: 
NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1287, invalid usage, NCCL version 2.17.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 2 and rank 3 both on CUDA device e2000
Exception raised from getNCCLComm at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1287 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f8b59687efc in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc83fa9 (0x7f8b5a3befa9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xee41f2 (0x7f8b5a61f1f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x35 (0x7f8b5a620675 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x3eb (0x7f8b5a6231eb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x890 (0x7f8b5a632a40 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x52a118d (0x7f8b96b2418d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x52a7b4f (0x7f8b96b2ab4f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x52ba8a1 (0x7f8b96b3d8a1 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0xc06676 (0x7f8b9d84a676 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3e48ad (0x7f8b9d0288ad in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: PyCFunction_Call + 0x59 (0x5f6489 in /usr/bin/python)
frame #12: _PyObject_MakeTpCall + 0x296 (0x5f7056 in /usr/bin/python)
frame #13: /usr/bin/python() [0x50b993]
frame #14: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #15: /usr/bin/python() [0x6b40bc]
frame #16: _PyEval_EvalFrameDefault + 0x57f2 (0x570ac2 in /usr/bin/python)
frame #17: _PyFunction_Vectorcall + 0x1b6 (0x5f6836 in /usr/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x72d (0x56b9fd in /usr/bin/python)
frame #19: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x393 (0x5f6a13 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #23: _PyFunction_Vectorcall + 0x393 (0x5f6a13 in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #26: PyEval_EvalCode + 0x27 (0x68e7b7 in /usr/bin/python)
frame #27: /usr/bin/python() [0x680001]
frame #28: /usr/bin/python() [0x68007f]
frame #29: /usr/bin/python() [0x680121]
frame #30: PyRun_SimpleFileExFlags + 0x197 (0x680db7 in /usr/bin/python)
frame #31: Py_RunMain + 0x212 (0x6b8122 in /usr/bin/python)
frame #32: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #33: __libc_start_main + 0xf3 (0x7f8be5c48083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x2e (0x5fb39e in /usr/bin/python)

[I ProcessGroupNCCL.cpp:852] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 3] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 2] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 2] NCCL watchdog thread terminated normally
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 90) of binary: /usr/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.06961894035339355 seconds
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/agent/server/api.py", line 920, in _exit_barrier
    store_util.barrier(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Connection reset by peer
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_20:58:25
  host      : clgpu014.clemson.cloudlab.us
  rank      : 3 (local_rank: 1)
  exitcode  : 1 (pid: 91)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_20:58:25
  host      : clgpu014.clemson.cloudlab.us
  rank      : 2 (local_rank: 0)
  exitcode  : 1 (pid: 90)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Master node:

[I debug.cpp:49] [c10d] The debug level is set to INFO.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:6000.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:6000.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu021.clemson.cloudlab.us]:48326.
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu021.clemson.cloudlab.us]:48326.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu014.clemson.cloudlab.us]:38812.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu014.clemson.cloudlab.us]:38816.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu021.clemson.cloudlab.us]:48340.
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu021.clemson.cloudlab.us]:48340.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
Zarr-based strategies will not be registered because of missing packages
using world size: 4, data-parallel size: 4, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
setting global batch size to 4
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
using torch.float16 for parameters ...
------------------------ arguments ------------------------
(printing arguments omitted)  
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu021.clemson.cloudlab.us]:55190.
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu021.clemson.cloudlab.us]:55190.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu021.clemson.cloudlab.us]:55194.
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu021.clemson.cloudlab.us]:55194.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu021.clemson.cloudlab.us]:55202.
[I ProcessGroupNCCL.cpp:672] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 600000
USE_HIGH_PRIORITY_STREAM: 0
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu021.clemson.cloudlab.us]:55202.
[I ProcessGroupNCCL.cpp:850] [Rank 1] NCCL watchdog thread started!
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (node0.mds.hetkv-pg0.clemson.cloudlab.us, 6000).
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu021.clemson.cloudlab.us]:55204.
[I socket.cpp:787] [c10d] The client socket has connected to [clgpu021.clemson.cloudlab.us]:6000 on [clgpu021.clemson.cloudlab.us]:55204.
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 600000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu014.clemson.cloudlab.us]:41288.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu014.clemson.cloudlab.us]:41292.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu014.clemson.cloudlab.us]:41304.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu014.clemson.cloudlab.us]:41312.
[I ProcessGroupNCCL.cpp:672] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu010.clemson.cloudlab.us]:44508.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu010.clemson.cloudlab.us]:44514.
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:672] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:850] [Rank 1] NCCL watchdog thread started!
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
[I ProcessGroupNCCL.cpp:2470] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
make: Entering directory '/workspace/Megatron-LM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.049 seconds
> compiling and loading fused kernels ...
[I ProcessGroupNCCL.cpp:2470] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu010.clemson.cloudlab.us]:44516.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:6000 has accepted a connection from [clgpu010.clemson.cloudlab.us]:44528.

(And sometimes the master node just hangs from here on... Or, we could have this:)

Traceback (most recent call last):
  File "pretrain_gpt.py", line 207, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/workspace/Megatron-LM/megatron/training.py", line 177, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/Megatron-LM/megatron/initialize.py", line 89, in initialize_megatron
    _compile_dependencies()
  File "/workspace/Megatron-LM/megatron/initialize.py", line 154, in _compile_dependencies
    torch.distributed.barrier()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3369, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1287, remote process exited or there was a network error, NCCL version 2.17.1
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
socketProgress: Connection closed by remote peer clgpu014.clemson.cloudlab.us<60616>
Exception raised from getNCCLComm at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1287 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fb7fa0e7efc in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc83fa9 (0x7fb7fae1efa9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xee41f2 (0x7fb7fb07f1f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x35 (0x7fb7fb080675 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x3eb (0x7fb7fb0831eb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x890 (0x7fb7fb092a40 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x52a118d (0x7fb83758418d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x52a7b4f (0x7fb83758ab4f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x52ba8a1 (0x7fb83759d8a1 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0xc06676 (0x7fb83e2aa676 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3e48ad (0x7fb83da888ad in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: PyCFunction_Call + 0x59 (0x5f6489 in /usr/bin/python)
frame #12: _PyObject_MakeTpCall + 0x296 (0x5f7056 in /usr/bin/python)
frame #13: /usr/bin/python() [0x50b993]
frame #14: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #15: /usr/bin/python() [0x6b40bc]
frame #16: _PyEval_EvalFrameDefault + 0x57f2 (0x570ac2 in /usr/bin/python)
frame #17: _PyFunction_Vectorcall + 0x1b6 (0x5f6836 in /usr/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x72d (0x56b9fd in /usr/bin/python)
frame #19: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x393 (0x5f6a13 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #23: _PyFunction_Vectorcall + 0x393 (0x5f6a13 in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #26: PyEval_EvalCode + 0x27 (0x68e7b7 in /usr/bin/python)
frame #27: /usr/bin/python() [0x680001]
frame #28: /usr/bin/python() [0x68007f]
frame #29: /usr/bin/python() [0x680121]
frame #30: PyRun_SimpleFileExFlags + 0x197 (0x680db7 in /usr/bin/python)
frame #31: Py_RunMain + 0x212 (0x6b8122 in /usr/bin/python)
frame #32: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #33: __libc_start_main + 0xf3 (0x7fb8866a8083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x2e (0x5fb39e in /usr/bin/python)

[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
Traceback (most recent call last):
  File "pretrain_gpt.py", line 207, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/workspace/Megatron-LM/megatron/training.py", line 177, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/Megatron-LM/megatron/initialize.py", line 89, in initialize_megatron
    _compile_dependencies()
  File "/workspace/Megatron-LM/megatron/initialize.py", line 156, in _compile_dependencies
    torch.distributed.barrier()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3369, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1287, remote process exited or there was a network error, NCCL version 2.17.1
ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
Last error:
socketProgress: Connection closed by remote peer clgpu021.clemson.cloudlab.us<58718>
Exception raised from getNCCLComm at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1287 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f23e0d88efc in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xc83fa9 (0x7f23e1abffa9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xee41f2 (0x7f23e1d201f2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x35 (0x7f23e1d21675 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x3eb (0x7f23e1d241eb in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x890 (0x7f23e1d33a40 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x52a118d (0x7f241e22518d in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x52a7b4f (0x7f241e22bb4f in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x52ba8a1 (0x7f241e23e8a1 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0xc06676 (0x7f2424f4b676 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x3e48ad (0x7f24247298ad in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: PyCFunction_Call + 0x59 (0x5f6489 in /usr/bin/python)
frame #12: _PyObject_MakeTpCall + 0x296 (0x5f7056 in /usr/bin/python)
frame #13: /usr/bin/python() [0x50b993]
frame #14: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #15: /usr/bin/python() [0x6b40bc]
frame #16: _PyEval_EvalFrameDefault + 0x57f2 (0x570ac2 in /usr/bin/python)
frame #17: _PyFunction_Vectorcall + 0x1b6 (0x5f6836 in /usr/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x72d (0x56b9fd in /usr/bin/python)
frame #19: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #20: _PyFunction_Vectorcall + 0x393 (0x5f6a13 in /usr/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #23: _PyFunction_Vectorcall + 0x393 (0x5f6a13 in /usr/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x1901 (0x56cbd1 in /usr/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x26a (0x569cea in /usr/bin/python)
frame #26: PyEval_EvalCode + 0x27 (0x68e7b7 in /usr/bin/python)
frame #27: /usr/bin/python() [0x680001]
frame #28: /usr/bin/python() [0x68007f]
frame #29: /usr/bin/python() [0x680121]
frame #30: PyRun_SimpleFileExFlags + 0x197 (0x680db7 in /usr/bin/python)
frame #31: Py_RunMain + 0x212 (0x6b8122 in /usr/bin/python)
frame #32: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #33: __libc_start_main + 0xf3 (0x7f246d349083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: _start + 0x2e (0x5fb39e in /usr/bin/python)

[I ProcessGroupNCCL.cpp:852] [Rank 1] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 1] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 1] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 1] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 1] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:852] [Rank 1] NCCL watchdog thread terminated normally
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 91) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-30_22:00:04
  host      : clgpu021.clemson.cloudlab.us
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 92)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_22:00:04
  host      : clgpu021.clemson.cloudlab.us
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 91)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

(the ranks don't quite match because these results are from different runs, but hopefully this illustrates the point)

@sjeaugey
Copy link
Member

sjeaugey commented Apr 2, 2024

I think the message is quite explicit. It seems you're trying to launch multiple NCCL ranks on the same CUDA device, which NCCL doesn't support. Are you trying to launch more ranks per node than there are GPUs on the system? Maybe you should run nvidia-smi on the nodes you have to see how many GPUs there are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants