Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.DistBackendError: NCCL error #1517

Open
Chevolier opened this issue Nov 18, 2024 · 1 comment
Open

torch.distributed.DistBackendError: NCCL error #1517

Chevolier opened this issue Nov 18, 2024 · 1 comment

Comments

@Chevolier
Copy link

I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by running

torchrun ${DISTRIBUTED_ARGS} ${WORKING_DIR}/dlrm_main.py --print_sharding_plan --model_type dnn
--epochs 1 --embedding_dim 16 --batch_size 8192 --learning_rate 0.006 --adagrad --num_embeddings 1000000000
--binary_path $binary_path --training_days 14 --valid_hour 23/00
--test_hour 23/00 --num_workers 4 --prefetch_factor 8 --save_dir $SM_WORKING_DIR

When the data downloading process takes more than 20 min, the training fails with the following error:

2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
invoke_main() invoke_main()
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.170Z
dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device) File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
2024-11-15T06:01:13.170Z
func_return = func(*args, **kwargs) ^^dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device)^
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^ ^func_return = func(*args, **kwargs)^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
2024-11-15T06:01:13.170Z
default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Error accessing endpoint. Endpoint has not been initialized.

torch version: 2.5.0
cuda version: 12.4

It seems that communication between gpus of different nodes fail after more than 20 min or more considering all initialization time. I also tested with downloading less data (downloading takes less than 20 min), the training has no problem. Also, single node with more data also has no problem. Please help, thanks a lot!

@sjeaugey
Copy link
Member

sjeaugey commented Nov 18, 2024

I would suggest running again with NCCL_DEBUG=WARN then looking for the log with NCCL WARN. If the error happens inside the AWS OFI network plugin, you may want to open a ticket on the OFI NCCL plugin project instead of NCCL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants