torch.distributed.DistBackendError: NCCL error #1517

Chevolier · 2024-11-18T13:30:33Z

I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by running

torchrun ${DISTRIBUTED_ARGS} ${WORKING_DIR}/dlrm_main.py --print_sharding_plan --model_type dnn
--epochs 1 --embedding_dim 16 --batch_size 8192 --learning_rate 0.006 --adagrad --num_embeddings 1000000000
--binary_path $binary_path --training_days 14 --valid_hour 23/00
--test_hour 23/00 --num_workers 4 --prefetch_factor 8 --save_dir $SM_WORKING_DIR

When the data downloading process takes more than 20 min, the training fails with the following error:

2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
invoke_main() invoke_main()
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.170Z
dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device) File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
2024-11-15T06:01:13.170Z
func_return = func(*args, **kwargs) ^^dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device)^
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^ ^func_return = func(*args, **kwargs)^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
2024-11-15T06:01:13.170Z
default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Error accessing endpoint. Endpoint has not been initialized.

torch version: 2.5.0
cuda version: 12.4

It seems that communication between gpus of different nodes fail after more than 20 min or more considering all initialization time. I also tested with downloading less data (downloading takes less than 20 min), the training has no problem. Also, single node with more data also has no problem. Please help, thanks a lot!

sjeaugey · 2024-11-18T14:12:21Z

I would suggest running again with NCCL_DEBUG=WARN then looking for the log with NCCL WARN. If the error happens inside the AWS OFI network plugin, you may want to open a ticket on the OFI NCCL plugin project instead of NCCL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.DistBackendError: NCCL error #1517

torch.distributed.DistBackendError: NCCL error #1517

Chevolier commented Nov 18, 2024

sjeaugey commented Nov 18, 2024 •

edited

Loading

torch.distributed.DistBackendError: NCCL error #1517

torch.distributed.DistBackendError: NCCL error #1517

Comments

Chevolier commented Nov 18, 2024

sjeaugey commented Nov 18, 2024 • edited Loading

sjeaugey commented Nov 18, 2024 •

edited

Loading