You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by running
When the data downloading process takes more than 20 min, the training fails with the following error:
2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
invoke_main() invoke_main()
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.170Z
dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device) File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
2024-11-15T06:01:13.170Z
func_return = func(*args, **kwargs) ^^dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device)^
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^ ^func_return = func(*args, **kwargs)^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
2024-11-15T06:01:13.170Z
default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Error accessing endpoint. Endpoint has not been initialized.
torch version: 2.5.0
cuda version: 12.4
It seems that communication between gpus of different nodes fail after more than 20 min or more considering all initialization time. I also tested with downloading less data (downloading takes less than 20 min), the training has no problem. Also, single node with more data also has no problem. Please help, thanks a lot!
The text was updated successfully, but these errors were encountered:
I would suggest running again with NCCL_DEBUG=WARN then looking for the log with NCCL WARN. If the error happens inside the AWS OFI network plugin, you may want to open a ticket on the OFI NCCL plugin project instead of NCCL.
I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by running
torchrun ${DISTRIBUTED_ARGS} ${WORKING_DIR}/dlrm_main.py --print_sharding_plan --model_type dnn
--epochs 1 --embedding_dim 16 --batch_size 8192 --learning_rate 0.006 --adagrad --num_embeddings 1000000000
--binary_path $binary_path --training_days 14 --valid_hour 23/00
--test_hour 23/00 --num_workers 4 --prefetch_factor 8 --save_dir $SM_WORKING_DIR
When the data downloading process takes more than 20 min, the training fails with the following error:
2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
Traceback (most recent call last): File "/opt/ml/code/dlrm_main.py", line 954, in
2024-11-15T06:01:13.169Z
invoke_main() invoke_main()
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main File "/opt/ml/code/dlrm_main.py", line 951, in invoke_main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.169Z
File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.169Z
main(sys.argv[1:])
2024-11-15T06:01:13.170Z
dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device) File "/opt/ml/code/dlrm_main.py", line 760, in main
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper
2024-11-15T06:01:13.170Z
func_return = func(*args, **kwargs) ^^dist.init_process_group(backend=backend, timeout=timeout_duration, device_id=device)^
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper return func(*args, **kwargs)
2024-11-15T06:01:13.170Z
^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 97, in wrapper default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^ ^func_return = func(*args, **kwargs)^^
2024-11-15T06:01:13.170Z
^^^^^^^^^^
2024-11-15T06:01:13.170Z
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper ^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1527, in init_process_group
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Couldn't open CQ. RC: -22, ERROR: Invalid argument
2024-11-15T06:01:13.170Z
default_pg, _ = _new_process_group_helper(
2024-11-15T06:01:13.170Z
^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-15T06:01:13.170Z
^^ File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1867, in _new_process_group_helper
2024-11-15T06:01:13.170Z
eager_backend.eager_connect_single_device(device_id)
2024-11-15T06:01:13.170Z
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
2024-11-15T06:01:13.170Z
ncclInternalError: Internal check failed.
2024-11-15T06:01:13.170Z
Last error:
2024-11-15T06:01:13.170Z
NET/OFI Error accessing endpoint. Endpoint has not been initialized.
torch version: 2.5.0
cuda version: 12.4
It seems that communication between gpus of different nodes fail after more than 20 min or more considering all initialization time. I also tested with downloading less data (downloading takes less than 20 min), the training has no problem. Also, single node with more data also has no problem. Please help, thanks a lot!
The text was updated successfully, but these errors were encountered: