Drop/Revisit usage of `checkpoints` #9221

daavoo · 2023-03-21T10:13:47Z

Are we in agreement that we don't support checkpoint anymore? (I'm personally still not convinced. Primarily because I'm not sure we have a decent replacement for this. I think it's needed if we were to remove this).

Originally posted by @shcheklein in iterative/dvc.org#4415 (comment)

I don't really think we need a built-in replacement/solution in DVC to handle the checkpoints use case (which I am still unsure what that is).

People should handle interruption and resuming through the ML framework and DVC already provides convenient tools to wrap it (params, persist, run-cache).

My main points about dropping checkpoints are:

The current solution doesn't provide value while coming at an important cost of code/docs maintenance.
Induces users into incorrect workflows and the required changes in the code are not properly explained anywhere.
Introduces ambiguous/unexpected behavior when executing more complex/realistic pipelines (i.e. how are downstream stages after checkpoint: true supposed to be executed?; what about foreach or dvc.yaml with more than 1 model?)

As an example of the second point, here are the things that are "incorrect" in this repo (same applies to the example in https://dvc.org/doc/user-guide/experiment-management/checkpoints):

Optimizer state is not handled.

The state_dict of the optimizer should be also considered when loading/saving.

No learning rate scheduler.

I would dare to say that using a fixed learning rate would never result in a better model than using any kind of lr scheduler.

It would also require to be handled when loading/saving (which connects with the issues in the next point).

Epochs are being handled (arguably) incorrectly.

When picking a checkpoint and resuming from it, the epochs param is now treated as epochs + epochs_completed_at_checkpoint which differs with the meaning when training without resuming where epochs params reflects the total number.

After resuming from a checkpoint, the experiments can't be reproduced easily.

Let's say we have a completed experiments that was using checkpoints:

# EXP_A
lr: 0.003
weight_decay: 0
epochs: 15

If I run:

$ dvc exp apply EXP_A_CHECKPOINT_5
$ dvc exp run -S lr=99 weight_decay=0 epochs=10

It is not possible to reproduce the experiment with a single command. We would have to run the exact combination of exp apply and exp run.
It is not possible to reproduce the experiment at all if the checkpoints are deleted.

Resumed experiments are not differentiable after persisting.

Let's say I have another experiment completed using checkpoints:

# EXP_B
lr: 0.1
weight_decay: 10
epochs: 40

And I run:

$ dvc exp apply EXP_B_CHECKPOINT_39
$ dvc exp run -S lr=99 weight_decay=0 epochs=10

Persisting this experiment or the one from the previous point will result in an equivalent state in the repo regarding params and step metric, even though the training that led to the resulting model is completely different.

The text was updated successfully, but these errors were encountered:

shcheklein · 2023-03-21T17:47:31Z

I think we need to move it to DVC tbh, it's a bigger discussion that we need to have and agree. We've discussed briefly with @daavoo today, and agreed that we at least want to discuss one thing that checkpoints give - ability to recover. For this we need either a replacement plan (outside), or simplify the existing logic somehow, or something else. It's worth summarizing all the potential benefits of the existing solution (if it's implemented correctly), again to make sure that we are not missing anything.

pmrowla · 2023-03-22T05:14:00Z

I agree that checkpoint: true/dvc.api.make_checkpoint() is not particularly useful for users, and that it's better to just provide the mechanism for a user's stage to generate an intermediate commit through the ML frameworks (which we already do with dvclive).

It seems like what we actually need in place of checkpoint: true is proper support for circular dependencies (the existing checkpoint: true/persist: true flags are really just hacks to workaround the fact that DVC doesn't support circular dependencies)

Resumed experiments are not differentiable after persisting.

IMO this is a problem with the fact that exp apply; git add .; git commit is the default suggested workflow for persisting an experiment. With exp branch (or an alternative git checkout based workflow) the git commits prior to resuming are available after persisting the experiment.

daavoo · 2023-03-22T10:00:15Z

With exp branch (or an alternative git checkout based workflow) the git commits prior to resuming are available after persisting the experiment

My point was they would be equivalent at the level our UIs currently compare (except for the plots): Studio compare, exp/params diff, etc.

It seems like what we actually need in place of checkpoint: true is proper support for circular dependencies (the existing checkpoint: true/persist: true flags are really just hacks to workaround the fact that DVC doesn't support circular dependencies)

To give you more context, from the product perspective we have been discussing that the cherry picking use case of checkpoints is not valuable enough and we would like to think what could be the shortest path to enable resuming from an interrupted training.

So, putting the current implementation aside (al least initially to not biase discussing) and focusing on that use case.

In a workspace, single-experiment scenario, this already world by using persist, a resume param and letting DVCLiv and the framework the responsibility of handling the rest in the Python Code. The idea is that we don't carne about all iterations and we can assume that the current model on workspace would always be the one we want to resume.

Maybe we can start from that and thinking what are the Minimum chances needed to support recovering in other scenarios, like a remote instance that might get shutted down.

dberenbaum · 2023-03-22T15:25:44Z

Is there agreement on both of these points?

It would be better to rely on the functionality of existing ML frameworks where possible rather than have a separate way to checkpoint models.
If you are training remotely and the machine shuts down, there's often no way to recover the last saved checkpoint on the new remote machine.

For 2, I have been meaning to open iterative/dvclive#505 for awhile now to discuss a possible solution.

daavoo · 2023-03-23T08:33:54Z

Is there agreement on both of these points?

I can only speak for myself but I agree and iterative/dvclive#505 looks like the direction I would prefer to go

dberenbaum · 2023-03-23T11:31:57Z

Got a timely new request for auto-recovery with non-Python tools. As discussed there, one way to solve that use case is to save the final outputs of a "failed" experiment so you could resume from that state.

daavoo · 2023-03-23T11:42:52Z

Got a timely new request for auto-recovery with non-Python tools. As discussed there, one way to solve that use case is to save the final outputs of a "failed" experiment so you could resume from that state.

Anything that DVCLive does/will do could be replicated by non-Python tools (background dvc push, Studio REST API calls)

dberenbaum · 2023-03-28T21:20:18Z

Spoke with @shcheklein and @mattseddon and agreed that we can start by deprecating it from the UI (don't show it in dvc exp show or in the VS Code views).

mattseddon · 2023-03-28T21:41:40Z

Spoke with @shcheklein and @mattseddon and agreed that we can start by deprecating it from the UI (don't show it in dvc exp show or in the VS Code views).

We also agreed that when I take on the work of integrating the new --json format in #9170 I am going to gamble and not integrate checkpoints (unless necessary for some remaining feature). I'll scope and create a linked issue in vscode-dvc for that work.

Closes #9221

dberenbaum · 2023-03-29T19:56:47Z

@pmrowla See above where @mattseddon is planning to drop checkpoints from the experiments table. We can do the same in the CLI table if it will save us maintenance time, and it might even be useful so that we can see if anyone complains.

shcheklein · 2023-04-01T18:01:58Z

Don't know what the use case yet, but yolo repo uploads every checkpoint to its hub:

def on_model_save(trainer):
    session = getattr(trainer, 'hub_session', None)
    if session:
        # Upload checkpoints with rate limiting
        is_best = trainer.best_fitness == trainer.fitness
        if time() - session.timers['ckpt'] > session.rate_limits['ckpt']:
            LOGGER.info(f'{PREFIX}Uploading checkpoint {session.model_id}')
            session.upload_model(trainer.epoch, trainer.last, is_best)
            session.timers['ckpt'] = time()  # reset timer

Closes #9221

exp: Drop `checkpoints`. Closes #9221

daavoo mentioned this issue Mar 21, 2023

exp: Drop init. iterative/dvc.org#4415

Merged

shcheklein transferred this issue from iterative/vscode-dvc-demo Mar 21, 2023

shcheklein added discussion requires active participation to reach a conclusion A: experiments Related to dvc exp product: VSCode Integration with VSCode extension labels Mar 21, 2023

shcheklein assigned daavoo and dberenbaum Mar 21, 2023

dberenbaum mentioned this issue Mar 22, 2023

Remote training recovery from interruptions iterative/dvclive#505

Open

mattseddon mentioned this issue Mar 29, 2023

Remove checkpoint experiment support from extension UI iterative/vscode-dvc#3577

Closed

daavoo added a commit that referenced this issue Mar 29, 2023

exp: Drop checkpoints.

e00763e

Closes #9221

daavoo linked a pull request Mar 29, 2023 that will close this issue

exp: Drop checkpoints #9271

Merged

daavoo mentioned this issue Mar 29, 2023

exp: Drop checkpoints #9271

Merged

daavoo added a commit that referenced this issue Mar 29, 2023

exp: Drop checkpoints.

16d73db

Closes #9221

daavoo added a commit that referenced this issue Mar 29, 2023

exp: Drop checkpoints.

4e03644

Closes #9221

daavoo mentioned this issue Mar 29, 2023

exp: Drop checkpoints. iterative/dvc.org#4442

Closed

This was referenced Mar 29, 2023

Release 3.0 #7093

Closed

exp push: studio integration #9074

Closed

daavoo added a commit that referenced this issue May 16, 2023

exp: Drop checkpoints.

d6f8721

Closes #9221

daavoo removed the product: VSCode Integration with VSCode extension label May 17, 2023

daavoo added a commit that referenced this issue May 22, 2023

exp: Drop checkpoints.

ec77a42

Closes #9221

daavoo added a commit that referenced this issue May 22, 2023

exp: Drop checkpoints.

7faf532

Closes #9221

daavoo added a commit that referenced this issue May 22, 2023

exp: Drop checkpoints.

ef1250b

Closes #9221

daavoo added a commit that referenced this issue May 22, 2023

exp: Drop checkpoints.

00f2c7d

Closes #9221

skshetry closed this as completed in #9271 May 23, 2023

skshetry pushed a commit that referenced this issue May 23, 2023

exp: Drop checkpoints (#9271)

25313f5

exp: Drop `checkpoints`. Closes #9221

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop/Revisit usage of `checkpoints` #9221

Drop/Revisit usage of `checkpoints` #9221

daavoo commented Mar 21, 2023 •

edited

Loading

shcheklein commented Mar 21, 2023

pmrowla commented Mar 22, 2023

daavoo commented Mar 22, 2023

dberenbaum commented Mar 22, 2023

daavoo commented Mar 23, 2023

dberenbaum commented Mar 23, 2023

daavoo commented Mar 23, 2023

dberenbaum commented Mar 28, 2023

mattseddon commented Mar 28, 2023

dberenbaum commented Mar 29, 2023

shcheklein commented Apr 1, 2023 •

edited

Loading

Drop/Revisit usage of checkpoints #9221

Drop/Revisit usage of checkpoints #9221

Comments

daavoo commented Mar 21, 2023 • edited Loading

shcheklein commented Mar 21, 2023

pmrowla commented Mar 22, 2023

daavoo commented Mar 22, 2023

dberenbaum commented Mar 22, 2023

daavoo commented Mar 23, 2023

dberenbaum commented Mar 23, 2023

daavoo commented Mar 23, 2023

dberenbaum commented Mar 28, 2023

mattseddon commented Mar 28, 2023

dberenbaum commented Mar 29, 2023

shcheklein commented Apr 1, 2023 • edited Loading

Drop/Revisit usage of `checkpoints` #9221

Drop/Revisit usage of `checkpoints` #9221

daavoo commented Mar 21, 2023 •

edited

Loading

shcheklein commented Apr 1, 2023 •

edited

Loading