-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"raise subprocess.CalledProcessError" shows when training with multiple GPUs in DDP mode #2294
Comments
@Jelly123456 we recently patched a DDP bug fix in #2295 that was introduced in #2292, but this was all in the last few hours, so if you are using older code, or if you git pull now DDP should work correctly. The only other thing you might consider is falling back to a python 3.8 environment, as 3.9 is pretty new and possibly not as mature as 3.8. Docker is a great choice for DDP also, it's basically a guaranteed working environment, we run most of our remote trainings via the Docker container. EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
|
@glenn-jocher Thank you very much for your quick reply. I will try with your recommendations. I will tentatively leave this issue "open" now. After I try, if I found no problem, I will come back to close it. |
@Jelly123456 sounds good |
I was training the Transformer PR using DDP and got the same error when training ended (phew!) Env: py3.9 , torch 1.7.1 I think it's the commit behind: fab5085 , but I'm not sure if it's related. Traceback (most recent call last):
File "/.conda/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/.conda/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/.conda/envs/py39/bin/python', '-u', 'train.py', '--local_rank=3', '--data', 'coco.yaml', '--cfg', 'models/yolotrl.yaml', '--weights', '', '--batch-size', '128', '--device', '3,4,5,6', '--name', '4_5trlv3']' died with <Signals.SIGSEGV: 11>. |
@NanoCode012 @Jelly123456 for DDP (actually for all trainings) I always use the docker image, and I haven't seen an error like this, or any other in the last few months. Segmentation fault may also might be caused by an overloaded system rather than any GPU problems, i.e. lack of CPU threads or RAM. Other than that not sure what to say, these are usually not very repeatable unfortunately. |
I tested with the latest commit and python 3.8. There is no error after training with 50 epochs. |
@Jelly123456 great! Same results here, no DDP errors with 2x or 4x GPUs after 50 epochs. |
I ran into the same problem as you. |
❔Question
I run the below command to train my own custom datasets.
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 64 --data data/Allcls_one.yaml --weights weights/yolov5l.pt --cfg models/yolov5l_1cls.yaml --epochs 1 --device 0,1
Then the below error shows:
wandb: Synced 5 W&B file(s), 47 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:
wandb: Synced exp17: https://wandb.ai/**/YOLOv5/runs/3bhvm9a3
Traceback (most recent call last):
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/anaconda3/envs/YoLo_V5_n74/lib/python3.9/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['**/anaconda3/envs/YoLo_V5_n74/bin/python', '-u', 'train.py', '--local_rank=1', '--batch-size', '64', '--data', 'data/Allcls_one.yaml', '--weights', 'weights/yolov5l.pt', '--cfg', 'models/yolov5l_1cls.yaml', '--epochs', '1', '--device', '0,1']' died with <Signals.SIGSEGV: 11>
Training environment
OS: Linux
Two GPUs: Tesla V100S
Python version: 3.9
Other dependency versions: Already follow "requirements.txt"
Additional context
Previously I thought it could be my virtual environment problem, then I create a new virtual environment to retrain and it can train without error. But after training several times, this error still shows even I remove the old environment and create another new one.
Reference files
I also shared the configuration files I used for my training.
Support needed
Could any experienced person help solve this problem?
The text was updated successfully, but these errors were encountered: