Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop/Revisit usage of checkpoints #9221

Closed
Tracked by #7093
daavoo opened this issue Mar 21, 2023 · 11 comments · Fixed by #9271
Closed
Tracked by #7093

Drop/Revisit usage of checkpoints #9221

daavoo opened this issue Mar 21, 2023 · 11 comments · Fixed by #9271
Assignees
Labels
A: experiments Related to dvc exp discussion requires active participation to reach a conclusion

Comments

@daavoo
Copy link
Contributor

daavoo commented Mar 21, 2023

Are we in agreement that we don't support checkpoint anymore? (I'm personally still not convinced. Primarily because I'm not sure we have a decent replacement for this. I think it's needed if we were to remove this).

Originally posted by @shcheklein in iterative/dvc.org#4415 (comment)


I don't really think we need a built-in replacement/solution in DVC to handle the checkpoints use case (which I am still unsure what that is).

People should handle interruption and resuming through the ML framework and DVC already provides convenient tools to wrap it (params, persist, run-cache).

My main points about dropping checkpoints are:

  • The current solution doesn't provide value while coming at an important cost of code/docs maintenance.

  • Induces users into incorrect workflows and the required changes in the code are not properly explained anywhere.

  • Introduces ambiguous/unexpected behavior when executing more complex/realistic pipelines (i.e. how are downstream stages after checkpoint: true supposed to be executed?; what about foreach or dvc.yaml with more than 1 model?)


As an example of the second point, here are the things that are "incorrect" in this repo (same applies to the example in https://dvc.org/doc/user-guide/experiment-management/checkpoints):

  • Optimizer state is not handled.

The state_dict of the optimizer should be also considered when loading/saving.

  • No learning rate scheduler.

I would dare to say that using a fixed learning rate would never result in a better model than using any kind of lr scheduler.

It would also require to be handled when loading/saving (which connects with the issues in the next point).

  • Epochs are being handled (arguably) incorrectly.

When picking a checkpoint and resuming from it, the epochs param is now treated as epochs + epochs_completed_at_checkpoint which differs with the meaning when training without resuming where epochs params reflects the total number.

  • After resuming from a checkpoint, the experiments can't be reproduced easily.

Let's say we have a completed experiments that was using checkpoints:

# EXP_A
lr: 0.003
weight_decay: 0
epochs: 15

If I run:

$ dvc exp apply EXP_A_CHECKPOINT_5
$ dvc exp run -S lr=99 weight_decay=0 epochs=10

It is not possible to reproduce the experiment with a single command. We would have to run the exact combination of exp apply and exp run.
It is not possible to reproduce the experiment at all if the checkpoints are deleted.

  • Resumed experiments are not differentiable after persisting.

Let's say I have another experiment completed using checkpoints:

# EXP_B
lr: 0.1
weight_decay: 10
epochs: 40

And I run:

$ dvc exp apply EXP_B_CHECKPOINT_39
$ dvc exp run -S lr=99 weight_decay=0 epochs=10

Persisting this experiment or the one from the previous point will result in an equivalent state in the repo regarding params and step metric, even though the training that led to the resulting model is completely different.

@shcheklein
Copy link
Member

I think we need to move it to DVC tbh, it's a bigger discussion that we need to have and agree. We've discussed briefly with @daavoo today, and agreed that we at least want to discuss one thing that checkpoints give - ability to recover. For this we need either a replacement plan (outside), or simplify the existing logic somehow, or something else. It's worth summarizing all the potential benefits of the existing solution (if it's implemented correctly), again to make sure that we are not missing anything.

@shcheklein shcheklein transferred this issue from iterative/vscode-dvc-demo Mar 21, 2023
@shcheklein shcheklein added discussion requires active participation to reach a conclusion A: experiments Related to dvc exp product: VSCode Integration with VSCode extension labels Mar 21, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Mar 22, 2023

I agree that checkpoint: true/dvc.api.make_checkpoint() is not particularly useful for users, and that it's better to just provide the mechanism for a user's stage to generate an intermediate commit through the ML frameworks (which we already do with dvclive).

It seems like what we actually need in place of checkpoint: true is proper support for circular dependencies (the existing checkpoint: true/persist: true flags are really just hacks to workaround the fact that DVC doesn't support circular dependencies)

Resumed experiments are not differentiable after persisting.

IMO this is a problem with the fact that exp apply; git add .; git commit is the default suggested workflow for persisting an experiment. With exp branch (or an alternative git checkout based workflow) the git commits prior to resuming are available after persisting the experiment.

@daavoo
Copy link
Contributor Author

daavoo commented Mar 22, 2023

With exp branch (or an alternative git checkout based workflow) the git commits prior to resuming are available after persisting the experiment

My point was they would be equivalent at the level our UIs currently compare (except for the plots): Studio compare, exp/params diff, etc.

It seems like what we actually need in place of checkpoint: true is proper support for circular dependencies (the existing checkpoint: true/persist: true flags are really just hacks to workaround the fact that DVC doesn't support circular dependencies)

To give you more context, from the product perspective we have been discussing that the cherry picking use case of checkpoints is not valuable enough and we would like to think what could be the shortest path to enable resuming from an interrupted training.

So, putting the current implementation aside (al least initially to not biase discussing) and focusing on that use case.

In a workspace, single-experiment scenario, this already world by using persist, a resume param and letting DVCLiv and the framework the responsibility of handling the rest in the Python Code. The idea is that we don't carne about all iterations and we can assume that the current model on workspace would always be the one we want to resume.

Maybe we can start from that and thinking what are the Minimum chances needed to support recovering in other scenarios, like a remote instance that might get shutted down.

@dberenbaum
Copy link
Collaborator

Is there agreement on both of these points?

  1. It would be better to rely on the functionality of existing ML frameworks where possible rather than have a separate way to checkpoint models.
  2. If you are training remotely and the machine shuts down, there's often no way to recover the last saved checkpoint on the new remote machine.

For 2, I have been meaning to open iterative/dvclive#505 for awhile now to discuss a possible solution.

@daavoo
Copy link
Contributor Author

daavoo commented Mar 23, 2023

Is there agreement on both of these points?

I can only speak for myself but I agree and iterative/dvclive#505 looks like the direction I would prefer to go

@dberenbaum
Copy link
Collaborator

Got a timely new request for auto-recovery with non-Python tools. As discussed there, one way to solve that use case is to save the final outputs of a "failed" experiment so you could resume from that state.

@daavoo
Copy link
Contributor Author

daavoo commented Mar 23, 2023

Got a timely new request for auto-recovery with non-Python tools. As discussed there, one way to solve that use case is to save the final outputs of a "failed" experiment so you could resume from that state.

Anything that DVCLive does/will do could be replicated by non-Python tools (background dvc push, Studio REST API calls)

@dberenbaum
Copy link
Collaborator

Spoke with @shcheklein and @mattseddon and agreed that we can start by deprecating it from the UI (don't show it in dvc exp show or in the VS Code views).

@mattseddon
Copy link
Member

Spoke with @shcheklein and @mattseddon and agreed that we can start by deprecating it from the UI (don't show it in dvc exp show or in the VS Code views).

We also agreed that when I take on the work of integrating the new --json format in #9170 I am going to gamble and not integrate checkpoints (unless necessary for some remaining feature). I'll scope and create a linked issue in vscode-dvc for that work.

daavoo added a commit that referenced this issue Mar 29, 2023
@daavoo daavoo linked a pull request Mar 29, 2023 that will close this issue
daavoo added a commit that referenced this issue Mar 29, 2023
daavoo added a commit that referenced this issue Mar 29, 2023
@dberenbaum
Copy link
Collaborator

@pmrowla See above where @mattseddon is planning to drop checkpoints from the experiments table. We can do the same in the CLI table if it will save us maintenance time, and it might even be useful so that we can see if anyone complains.

This was referenced Mar 29, 2023
@shcheklein
Copy link
Member

shcheklein commented Apr 1, 2023

Don't know what the use case yet, but yolo repo uploads every checkpoint to its hub:

def on_model_save(trainer):
    session = getattr(trainer, 'hub_session', None)
    if session:
        # Upload checkpoints with rate limiting
        is_best = trainer.best_fitness == trainer.fitness
        if time() - session.timers['ckpt'] > session.rate_limits['ckpt']:
            LOGGER.info(f'{PREFIX}Uploading checkpoint {session.model_id}')
            session.upload_model(trainer.epoch, trainer.last, is_best)
            session.timers['ckpt'] = time()  # reset timer

daavoo added a commit that referenced this issue May 16, 2023
@daavoo daavoo removed the product: VSCode Integration with VSCode extension label May 17, 2023
daavoo added a commit that referenced this issue May 22, 2023
daavoo added a commit that referenced this issue May 22, 2023
daavoo added a commit that referenced this issue May 22, 2023
daavoo added a commit that referenced this issue May 22, 2023
skshetry pushed a commit that referenced this issue May 23, 2023
exp: Drop `checkpoints`.

Closes #9221
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp discussion requires active participation to reach a conclusion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants