You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been comparing DDP with Fairscale Sharded DDP + OSS, and found the training progress of our model to be very different between the two setups.
After a bit of investigation, I suspected there was a race condition in the broadcasting of gradients in sharded DDP. To confirm this, I changed ShardedDataParallel._try_consume_work_handles to call _consume_work_handles instead - if I understand correctly, this should just add additional waits for all pending reduces to finish, but also be a safe change in absence of races (as it is possible that the async reduces would always be finished by the time _try_consume_work_handles is called).
This gave us three conditions to check:
DDP
Sharded DDP
Sharded DDP (extra syncs)
We found that "DDP" and "Sharded DDP (extra syncs)" were exactly reproducible between runs, and the loss values produced were similar between the two conditions but not exactly identical. The normal "Sharded DDP" was not reproducible between runs, the first few steps were identical in repeat runs and then they would diverge. The loss values produced were also significantly different to both the baseline "DDP" and "Sharded DDP (extra syncs)".
This raises a few questions that I'd like to get some help with:
Is the modification I made to add extra syncs correct? If yes, this suggests there is a race condition in at least our usage of Sharded DDP, but I don't think we're doing anything unusual.
Is it expected that "Sharded DDP" and "DDP" produce significantly different training dynamics?
The text was updated successfully, but these errors were encountered:
I have been comparing DDP with Fairscale Sharded DDP + OSS, and found the training progress of our model to be very different between the two setups.
After a bit of investigation, I suspected there was a race condition in the broadcasting of gradients in sharded DDP. To confirm this, I changed
ShardedDataParallel._try_consume_work_handles
to call_consume_work_handles
instead - if I understand correctly, this should just add additional waits for all pending reduces to finish, but also be a safe change in absence of races (as it is possible that the async reduces would always be finished by the time_try_consume_work_handles
is called).This gave us three conditions to check:
We found that "DDP" and "Sharded DDP (extra syncs)" were exactly reproducible between runs, and the loss values produced were similar between the two conditions but not exactly identical. The normal "Sharded DDP" was not reproducible between runs, the first few steps were identical in repeat runs and then they would diverge. The loss values produced were also significantly different to both the baseline "DDP" and "Sharded DDP (extra syncs)".
This raises a few questions that I'd like to get some help with:
The text was updated successfully, but these errors were encountered: