-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop/Revisit usage of checkpoints
#9221
Comments
I think we need to move it to DVC tbh, it's a bigger discussion that we need to have and agree. We've discussed briefly with @daavoo today, and agreed that we at least want to discuss one thing that checkpoints give - ability to recover. For this we need either a replacement plan (outside), or simplify the existing logic somehow, or something else. It's worth summarizing all the potential benefits of the existing solution (if it's implemented correctly), again to make sure that we are not missing anything. |
I agree that It seems like what we actually need in place of
IMO this is a problem with the fact that |
My point was they would be equivalent at the level our UIs currently compare (except for the plots): Studio compare, exp/params diff, etc.
To give you more context, from the product perspective we have been discussing that the cherry picking use case of checkpoints is not valuable enough and we would like to think what could be the shortest path to enable resuming from an interrupted training. So, putting the current implementation aside (al least initially to not biase discussing) and focusing on that use case. In a workspace, single-experiment scenario, this already world by using Maybe we can start from that and thinking what are the Minimum chances needed to support recovering in other scenarios, like a remote instance that might get shutted down. |
Is there agreement on both of these points?
For 2, I have been meaning to open iterative/dvclive#505 for awhile now to discuss a possible solution. |
I can only speak for myself but I agree and iterative/dvclive#505 looks like the direction I would prefer to go |
Got a timely new request for auto-recovery with non-Python tools. As discussed there, one way to solve that use case is to save the final outputs of a "failed" experiment so you could resume from that state. |
Anything that DVCLive does/will do could be replicated by non-Python tools (background dvc push, Studio REST API calls) |
Spoke with @shcheklein and @mattseddon and agreed that we can start by deprecating it from the UI (don't show it in |
We also agreed that when I take on the work of integrating the new |
@pmrowla See above where @mattseddon is planning to drop checkpoints from the experiments table. We can do the same in the CLI table if it will save us maintenance time, and it might even be useful so that we can see if anyone complains. |
Don't know what the use case yet, but def on_model_save(trainer):
session = getattr(trainer, 'hub_session', None)
if session:
# Upload checkpoints with rate limiting
is_best = trainer.best_fitness == trainer.fitness
if time() - session.timers['ckpt'] > session.rate_limits['ckpt']:
LOGGER.info(f'{PREFIX}Uploading checkpoint {session.model_id}')
session.upload_model(trainer.epoch, trainer.last, is_best)
session.timers['ckpt'] = time() # reset timer |
Originally posted by @shcheklein in iterative/dvc.org#4415 (comment)
I don't really think we need a built-in replacement/solution in DVC to handle the checkpoints use case (which I am still unsure what that is).
People should handle interruption and resuming through the ML framework and DVC already provides convenient tools to wrap it (params,
persist
, run-cache).My main points about dropping checkpoints are:
The current solution doesn't provide value while coming at an important cost of code/docs maintenance.
Induces users into incorrect workflows and the required changes in the code are not properly explained anywhere.
Introduces ambiguous/unexpected behavior when executing more complex/realistic pipelines (i.e. how are downstream stages after
checkpoint: true
supposed to be executed?; what aboutforeach
ordvc.yaml
with more than 1 model?)As an example of the second point, here are the things that are "incorrect" in this repo (same applies to the example in https://dvc.org/doc/user-guide/experiment-management/checkpoints):
The
state_dict
of the optimizer should be also considered when loading/saving.I would dare to say that using a fixed learning rate would never result in a better model than using any kind of lr scheduler.
It would also require to be handled when loading/saving (which connects with the issues in the next point).
When picking a checkpoint and resuming from it, the
epochs
param is now treated asepochs + epochs_completed_at_checkpoint
which differs with the meaning when training without resuming whereepochs
params reflects the total number.Let's say we have a completed experiments that was using checkpoints:
If I run:
It is not possible to reproduce the experiment with a single command. We would have to run the exact combination of
exp apply
andexp run
.It is not possible to reproduce the experiment at all if the checkpoints are deleted.
Let's say I have another experiment completed using checkpoints:
And I run:
Persisting this experiment or the one from the previous point will result in an equivalent state in the repo regarding
params
andstep
metric, even though the training that led to the resulting model is completely different.The text was updated successfully, but these errors were encountered: