ug: add checkpoints tutorial #2373

flippedcoder · 2021-04-13T19:18:45Z

This adds some more detailed documentation around the checkpoints implementation in training models.

dmpetrov

@flippedcoder an amazing tutorial ✨
I left a few comments inline.

PS: you can take some ideas from here https://www.youtube.com/watch?v=J8mCr3wVgdA

dmpetrov · 2021-04-14T08:36:41Z

content/docs/start/checkpoints.md

+to be adjusted, and even recover model weights, parameters, and code. You also
+have the ability to resume training from previous checkpoints. When you adjust
+parameters and code, DVC tracks those changes for you.
+


We probably should mention that this section is not a continuation of the previous ones in Getting Started.

dmpetrov · 2021-04-14T08:38:49Z

content/docs/start/checkpoints.md

+        summary: true
+        html: true
+```
+


the alternative method should be mentioned somewhere - dvc stage add --checkpoints model.pt ...

dmpetrov · 2021-04-14T08:42:09Z

content/docs/start/checkpoints.md

+```
+
+Then you'll want to run your code using the `dvc exp run` command. This means
+it'll start from the existing checkpoint outputs. That's it! Now you have


The code instrumentation part is missing. The checkpoints have to be defined in the pipeline level (dvc.yaml) as well as in the code level make_checkpoint() or DvcLiveCallback().

dmpetrov · 2021-04-14T08:43:23Z

content/docs/start/checkpoints.md

+it'll start from the existing checkpoint outputs. That's it! Now you have
+checkpoints active in your process. While your script is executing epoch runs,
+you should see output in the terminal similar to this:
+


It would be great to have a reference to the code repo. So, users can clone and easily play with it.

dmpetrov · 2021-04-14T08:45:14Z

content/docs/start/checkpoints.md

+file:///Users/Repos/dvc-checkpoints-mnist/dvclive.html
+Updating lock file 'dvc.lock'
+Checkpoint experiment iteration '9b0bf16'.
+```


We should mention that the run should be terminated. I'd love to see the exp table \ all the checkpoints right after the run.

dmpetrov · 2021-04-14T08:46:29Z

content/docs/start/checkpoints.md

+
+When you want integrate checkpoints into your training epochs so you don't have
+to run `dvc exp run` each time, you'll need to use the `make_checkpoint`
+function.


It is not an alternative. The user is supposed to use both as I mentioned above.

dmpetrov · 2021-04-14T09:17:32Z

content/docs/start/checkpoints.md

+`ini.txt` file.
+
+If you want to start from a specific existing checkpoint, you'll need to run
+`dvc exp run --rev 123455` where `12345` is the checkpoint ID for specific


It seems like --rev works only for queue/temp runs, not for this scenario. Created an issue: iterative/dvc#5814

dmpetrov · 2021-04-14T09:20:44Z

content/docs/start/checkpoints.md

+These experiments have IDs associated with each of the checkpoints so you have a
+reference to each epoch.
+
+## Starting from an existing checkpoint


I'd start with continue\resuming training: train for a few epochs, terminate and continue training back. It supposes to create a straight line of metrics improvements.

dmpetrov · 2021-04-14T09:23:12Z

content/docs/start/checkpoints.md

+│ │ ╟ 9b0bf16   │ 01:20 PM │    2 │ 0.9271 │
+│ │ ╟ 39d1444   │ 01:19 PM │    1 │ 0.9157 │
+│ ├─╨ fcefafb   │ 01:19 PM │    0 │ 0.8572 │
+└───────────────┴──────────┴──────┴────────┘


It seems like the algorithm is too good for showing the value of checkpoints 😄 You are getting 85% accuracy in 1st epoch and 95% in 4th. You might consider using a weaker network architecture or a less aggressive learning rate to show a smoother dynamic.

dmpetrov · 2021-04-14T09:25:34Z

content/docs/start/checkpoints.md

+this: `dvc exp run --reset`.
+
+This removes the `model.pt` file and clears the `dvc.lock` file of any existing
+checkpoints.


I'd add apply scenario: dvc exp apply && modify file && dvc exp run. It might be very useful.

What does modify file do here? Does that remove the model.pt file since there's no checkpoint id in the apply command?

Modify means - make changes in params.yaml or train.py

jorgeorpinel · 2021-04-14T18:57:47Z

Thanks for this contribution @flippedcoder! Are we sure we want to expand the Get Started with this somewhat specific guide (cc @shcheklein)? If not, we can definitely find another place to put this info, and link from https://dvc.org/doc/start/experiments.

Happy to help here once that is decided and @dmitry's comments are addressed.

dberenbaum · 2021-04-14T19:09:03Z

Thanks @flippedcoder! This looks awesome, and it helps address an open issue: #2292.

One overarching comment is that we might need to think about where this fits. Having a get started page for checkpoints is important, but appending it to the existing get started section might be a bit confusing at the moment since all the other pages are iterations of the same scenario in https://github.com/iterative/example-get-started and this is a completely new scenario. I think there is some ongoing work to maybe introduce an MNIST version of all the get started pages, but I'm not sure the status. What do you think @dmpetrov @shcheklein?

Also, the existing https://github.com/iterative/dvc-checkpoints-mnist repo has different branches for a bunch of different scenarios:

basic: This is the most basic way to use checkpoints, creating a single checkpoint after the script completes. This is not an expected typical workflow, but it's arguably the simplest way to introduce the concept without having to modify the underlying script.
make_checkpoint: This introduces how to generate multiple checkpoints within a script using make_checkpoint().
signal_file: This introduces a language-agnostic way to mimic make_checkpoint for non-Python projects.
live: This integrates make_checkpoint() with https://github.com/iterative/dvclive.
full_pipeline: This adds more general dvc complexity, like multiple stages and plots. It could probably be ignored or deleted unless we develop a use case that discusses integrating those dvc features with this checkpoints example.

We probably need just one of these branches for get started. I have the live branch as the default because it is probably the most natural and fully featured workflow, but make_checkpoint is similar and doesn't have the complication of the dvclive dependency. We could even start with basic and then add make_checkpoint functionality (which looks like kind of what you did here).

dmpetrov · 2021-04-14T19:21:31Z

@jorgeorpinel & @dberenbaum it is not mentioned here but it is a Draft and the place/url for this doc is not defined yet. What would be the best place to put this document? A new Deep Learning Checkpoints section in Use Cases?

dberenbaum · 2021-04-14T19:30:47Z

Right now, that seems like a good spot for it, but we are also working on developing a docs roadmap, so we might want to reconsider once we have that.

dmpetrov · 2021-04-14T19:36:03Z

Right. But even if we found a better use case for checkpoints it still makes sense to use the Mnist use case as a temporary solution.

jorgeorpinel · 2021-04-14T19:48:22Z

helps address an open issue: #2292

The current content doesn't seem to address that, but maybe there should be a section for it?

the existing https://github.com/iterative/dvc-checkpoints-mnist repo has different branches for a bunch of different scenarios

Agree that it would be great to use that here for consistency but up to you. (I'm working on adapting other existing examples to use it too.) Also agree that this content should focus on the main workflow (make_checkpoint) + mention other possibilities, to keep it short.

What would be the best place to put this document?

This content looks like a guide to me. One option is to help @flippedcoder compact this a bit and put it under https://dvc.org/doc/user-guide/experiment-management (which prob should be a section anyway — separate issue rel #2269 and #2367).

jorgeorpinel · 2021-04-14T19:50:08Z

working on developing a docs roadmap

BTW this is the first PR in https://github.com/iterative/dvc.org/projects/1 😬

dberenbaum · 2021-04-14T19:53:59Z

the one under iterative/

Whoops, that was a mistake. I edited the comment above with the correct link. By the way, I think @flippedcoder is already using that and probably just needs to link to it in the doc.

shcheklein · 2021-04-14T19:56:24Z

This content looks like a guide to me. One option is to help @flippedcoder compact this a bit and put it under https://dvc.org/doc/user-guide/experiment-management

I like this idea! (+ other suggestion to make precise, reuse MNIST if possible, probably make it actionable). This place is indeed a better fit for now. In the get started Experiments section we should put a link for now to the UG section.

@flippedcoder congrats with the first PR! Awesome work 🎉

dmpetrov · 2021-04-14T20:07:14Z

it helps address an open issue: #2292.

@dberenbaum I was thinking it is a bit separate scenario, not the basic one. Do you think it makes sense to combine these two?

@flippedcoder you can keep working on the doc here and then move it under Use Cases. Or move it now. Ups to you 😄

jorgeorpinel · 2021-04-14T20:28:44Z

move it under Use Cases

User Guide right? 🤔 BTW to put it in /user-guide/experiment-management/checkpoints (which I'm suggesting) we'd need other small changes. For now it can be placed directly in /doc/user-guide/checkpoints (below Experiment Management) I think. And I can jump in to finalize the structure before merging.

dberenbaum · 2021-04-14T20:54:08Z

it helps address an open issue: #2292.

@dberenbaum I was thinking it is a bit separate scenario, not the basic one. Do you think it makes sense to combine these two?

It's fine if it doesn't address #2292. In my mind, that had become a placeholder for adding more robust checkpoints docs.

What's the purpose of this doc? We have needs for multiple checkpoint docs IMO:

For getting started, agree with @shcheklein that a link is probably enough (for now).
A use case would be great to show how checkpoints could be used in a typical deep learning project workflow and what value they add. This PR could do that, or the use case could be an opportunity for @flippedcoder to come up with a more interesting project to flesh out more.
A user guide would also be great to explain how checkpoints work. This PR could also do that. It seems like one suggestion is to compact it and focus on a single workflow like make_checkpoint. I think we might want to start there but also explain the different scenarios in the branches of the repo in the user guide. If there are ones that aren't helpful, we can exclude them, but the make_signal scenario for example seems necessary to me.

By the way, I think this confusion is my own fault for giving this idea to @dmpetrov and not providing enough context to him or @flippedcoder, so sorry about that!

jorgeorpinel · 2021-04-14T22:28:29Z

use case to show how checkpoints could be used in a typical deep learning project workflow

I'm not against this idea. It's just that the drafted content so far looks more like an explanation of the checkpoints feature (a DVC guide). Maybe partially the result of @flippedcoder's learning process? (Great practice to write in those situations BTW.)

dberenbaum · 2021-04-15T13:10:10Z

use case to show how checkpoints could be used in a typical deep learning project workflow

I'm not against this idea. It's just that the drafted content so far looks more like an explanation of the checkpoints feature (a DVC guide). Maybe partially the result of @flippedcoder's learning process? (Great practice to write in those situations BTW.)

Yup, seems like there's consensus to make this a user guide. What about addressing the scenarios in the different repo branches?

Restyle Add Checkpoints doc to Get Started

flippedcoder · 2021-04-15T20:50:29Z

Hey everyone! Thanks for all the great feedback! I moved it from the Getting Started section to the User Guide section with the updates @dmpetrov mentioned. I'm working on making the algorithm perform worse to show off checkpoints better. 😅 Otherwise, I'm open to any other feedback or suggestions y'all have.

….org into mm/checkpoints-doc

Co-authored-by: Dave Berenbaum <[email protected]>

jorgeorpinel · 2021-05-03T18:29:54Z

Do you think https://dvc.org/doc/user-guide/experiment-management#checkpoints-in-source-code needs to be updated to reflect this doc?

It should probably link to this doc, yep! It can be part of a (possible) follow up copy edit PR if needed, or feel free to include it, @flippedcoder .

dberenbaum · 2021-05-03T18:35:15Z

Sounds good! Let's leave it for the copy edit PR then. There's enough going on in this one already.

Co-authored-by: Dave Berenbaum <[email protected]>

jorgeorpinel · 2021-05-03T20:57:15Z

content/docs/user-guide/experiment-management/checkpoints.md

+```bash
+dvc exp run


Please use the dvc highlighter (instead of bash) and $ prompts throughout. Here's one change:

Suggested change

```bash

dvc exp run

```dvc

$ dvc exp run

jorgeorpinel · 2021-05-03T20:59:15Z

content/docs/user-guide/experiment-management/checkpoints.md

+You'll see output similar to this in your terminal while the training process is
+going on.
+
+```dvc


In its current form (no commands) this one doesn't need highlighting:

Suggested change

```dvc

```

jorgeorpinel · 2021-05-03T21:00:26Z

content/docs/user-guide/experiment-management/checkpoints.md

+
+You should see something similar to this in your terminal.
+
+```git


Same for this one and any other that's just terminal output:

Suggested change

```git

```

jorgeorpinel

Please ping me if I don't notice when this gets merged and I'll do a copy edit follow up.

flippedcoder · 2021-05-04T19:36:12Z

Alright! I think I've addressed all of the comments. I tried my best not to miss anything. 😅 It should be ready to merge, but let me know if y'all have any other feedback! @dmpetrov @dberenbaum @jorgeorpinel

content/docs/user-guide/experiment-management/checkpoints.md

dberenbaum

LGTM! 🎉

Milecia added 2 commits April 13, 2021 10:04

added checkpoints draft

88070cb

updated checkpoints doc with more details

e274d8a

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp April 13, 2021 19:19 Inactive

dmpetrov requested changes Apr 14, 2021

View reviewed changes

jorgeorpinel added the 2.0 release label Apr 14, 2021

dberenbaum marked this pull request as draft April 14, 2021 19:28

cleaned up doc and move to different section

6f06eab

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp April 15, 2021 20:37 Inactive

Restyled by prettier

6e6e735

restyled-io bot mentioned this pull request Apr 15, 2021

Restyle Add Checkpoints doc to Get Started #2379

Merged

flippedcoder and others added 2 commits April 15, 2021 15:38

cleaned up doc and move to different section

0040bd6

Merge pull request #2379 from iterative/restyled/mm/checkpoints-doc

5efe2c5

Restyle Add Checkpoints doc to Get Started

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp April 15, 2021 20:46 Inactive

flippedcoder added 2 commits April 15, 2021 16:01

updated link reference from location move

d5242aa

Merge branch 'mm/checkpoints-doc' of https://github.com/iterative/dvc…

3ee2636

….org into mm/checkpoints-doc

restyled-io bot mentioned this pull request May 3, 2021

Restyle Add Checkpoints doc to Get Started #2439

Merged

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 17:44 Inactive

Update content/docs/user-guide/experiment-management/checkpoints.md

7dc252a

Co-authored-by: Dave Berenbaum <[email protected]>

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 17:44 Inactive

shcheklein changed the title ~~Add Checkpoints doc to Get Started~~ ug: add checkpoints tutorial May 3, 2021

Update content/docs/user-guide/experiment-management/checkpoints.md

872e287

Co-authored-by: Dave Berenbaum <[email protected]>

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 19:14 Inactive

Update content/docs/user-guide/experiment-management/checkpoints.md

4d27b65

Co-authored-by: Dave Berenbaum <[email protected]>

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 19:17 Inactive

jorgeorpinel reviewed May 3, 2021

View reviewed changes

updated more text

542fdd6

jorgeorpinel reviewed May 3, 2021

View reviewed changes

jorgeorpinel approved these changes May 3, 2021

View reviewed changes

jorgeorpinel temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 21:03 Inactive

cleaned up code examples

6b7c634

flippedcoder force-pushed the mm/checkpoints-doc branch from 5620133 to 6b7c634 Compare May 4, 2021 19:17

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 4, 2021 19:17 Inactive

formatted code snippets

2e79195

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 4, 2021 19:24 Inactive

dberenbaum reviewed May 4, 2021

View reviewed changes

content/docs/user-guide/experiment-management/checkpoints.md Outdated Show resolved Hide resolved

Update content/docs/user-guide/experiment-management/checkpoints.md

de35b15

shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 4, 2021 20:38 Inactive

dberenbaum approved these changes May 4, 2021

View reviewed changes

shcheklein merged commit 7a590a1 into master May 4, 2021

jorgeorpinel assigned flippedcoder May 8, 2021

jorgeorpinel deleted the mm/checkpoints-doc branch May 13, 2021 19:51


		You should see something similar to this in your terminal.

		```git

ug: add checkpoints tutorial #2373

ug: add checkpoints tutorial #2373

Conversation

flippedcoder commented Apr 13, 2021

dmpetrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel commented Apr 14, 2021

dberenbaum commented Apr 14, 2021 • edited Loading

dmpetrov commented Apr 14, 2021

dberenbaum commented Apr 14, 2021 • edited by jorgeorpinel Loading

dmpetrov commented Apr 14, 2021 • edited by jorgeorpinel Loading

jorgeorpinel commented Apr 14, 2021 • edited Loading

jorgeorpinel commented Apr 14, 2021

dberenbaum commented Apr 14, 2021

shcheklein commented Apr 14, 2021

dmpetrov commented Apr 14, 2021 • edited Loading

jorgeorpinel commented Apr 14, 2021

dberenbaum commented Apr 14, 2021

jorgeorpinel commented Apr 14, 2021

dberenbaum commented Apr 15, 2021

flippedcoder commented Apr 15, 2021

jorgeorpinel commented May 3, 2021

dberenbaum commented May 3, 2021 • edited by jorgeorpinel Loading

jorgeorpinel May 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel left a comment

Choose a reason for hiding this comment

flippedcoder commented May 4, 2021

dberenbaum left a comment

Choose a reason for hiding this comment

dberenbaum commented Apr 14, 2021 •

edited

Loading

dberenbaum commented Apr 14, 2021 •

edited by jorgeorpinel

Loading

dmpetrov commented Apr 14, 2021 •

edited by jorgeorpinel

Loading

jorgeorpinel commented Apr 14, 2021 •

edited

Loading

dmpetrov commented Apr 14, 2021 •

edited

Loading

dberenbaum commented May 3, 2021 •

edited by jorgeorpinel

Loading

jorgeorpinel May 3, 2021 •

edited

Loading