Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ug: add checkpoints tutorial #2373

Merged
merged 34 commits into from
May 4, 2021
Merged

ug: add checkpoints tutorial #2373

merged 34 commits into from
May 4, 2021

Conversation

flippedcoder
Copy link
Contributor

This adds some more detailed documentation around the checkpoints implementation in training models.

@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp April 13, 2021 19:19 Inactive
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flippedcoder an amazing tutorial ✨
I left a few comments inline.

PS: you can take some ideas from here https://www.youtube.com/watch?v=J8mCr3wVgdA

to be adjusted, and even recover model weights, parameters, and code. You also
have the ability to resume training from previous checkpoints. When you adjust
parameters and code, DVC tracks those changes for you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should mention that this section is not a continuation of the previous ones in Getting Started.

summary: true
html: true
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative method should be mentioned somewhere - dvc stage add --checkpoints model.pt ...

```

Then you'll want to run your code using the `dvc exp run` command. This means
it'll start from the existing checkpoint outputs. That's it! Now you have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code instrumentation part is missing. The checkpoints have to be defined in the pipeline level (dvc.yaml) as well as in the code level make_checkpoint() or DvcLiveCallback().

it'll start from the existing checkpoint outputs. That's it! Now you have
checkpoints active in your process. While your script is executing epoch runs,
you should see output in the terminal similar to this:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have a reference to the code repo. So, users can clone and easily play with it.

file:///Users/Repos/dvc-checkpoints-mnist/dvclive.html
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '9b0bf16'.
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention that the run should be terminated. I'd love to see the exp table \ all the checkpoints right after the run.


When you want integrate checkpoints into your training epochs so you don't have
to run `dvc exp run` each time, you'll need to use the `make_checkpoint`
function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an alternative. The user is supposed to use both as I mentioned above.

`ini.txt` file.

If you want to start from a specific existing checkpoint, you'll need to run
`dvc exp run --rev 123455` where `12345` is the checkpoint ID for specific
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like --rev works only for queue/temp runs, not for this scenario. Created an issue: iterative/dvc#5814

These experiments have IDs associated with each of the checkpoints so you have a
reference to each epoch.

## Starting from an existing checkpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd start with continue\resuming training: train for a few epochs, terminate and continue training back. It supposes to create a straight line of metrics improvements.

│ │ ╟ 9b0bf16 │ 01:20 PM │ 2 │ 0.9271 │
│ │ ╟ 39d1444 │ 01:19 PM │ 1 │ 0.9157 │
│ ├─╨ fcefafb │ 01:19 PM │ 0 │ 0.8572 │
└───────────────┴──────────┴──────┴────────┘
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the algorithm is too good for showing the value of checkpoints 😄 You are getting 85% accuracy in 1st epoch and 95% in 4th. You might consider using a weaker network architecture or a less aggressive learning rate to show a smoother dynamic.

this: `dvc exp run --reset`.

This removes the `model.pt` file and clears the `dvc.lock` file of any existing
checkpoints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add apply scenario: dvc exp apply && modify file && dvc exp run. It might be very useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does modify file do here? Does that remove the model.pt file since there's no checkpoint id in the apply command?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify means - make changes in params.yaml or train.py

@jorgeorpinel
Copy link
Contributor

Thanks for this contribution @flippedcoder! Are we sure we want to expand the Get Started with this somewhat specific guide (cc @shcheklein)? If not, we can definitely find another place to put this info, and link from https://dvc.org/doc/start/experiments.

Happy to help here once that is decided and @dmitry's comments are addressed.

@dberenbaum
Copy link
Contributor

dberenbaum commented Apr 14, 2021

Thanks @flippedcoder! This looks awesome, and it helps address an open issue: #2292.

One overarching comment is that we might need to think about where this fits. Having a get started page for checkpoints is important, but appending it to the existing get started section might be a bit confusing at the moment since all the other pages are iterations of the same scenario in https://github.com/iterative/example-get-started and this is a completely new scenario. I think there is some ongoing work to maybe introduce an MNIST version of all the get started pages, but I'm not sure the status. What do you think @dmpetrov @shcheklein?

Also, the existing https://github.com/iterative/dvc-checkpoints-mnist repo has different branches for a bunch of different scenarios:

  • basic: This is the most basic way to use checkpoints, creating a single checkpoint after the script completes. This is not an expected typical workflow, but it's arguably the simplest way to introduce the concept without having to modify the underlying script.
  • make_checkpoint: This introduces how to generate multiple checkpoints within a script using make_checkpoint().
  • signal_file: This introduces a language-agnostic way to mimic make_checkpoint for non-Python projects.
  • live: This integrates make_checkpoint() with https://github.com/iterative/dvclive.
  • full_pipeline: This adds more general dvc complexity, like multiple stages and plots. It could probably be ignored or deleted unless we develop a use case that discusses integrating those dvc features with this checkpoints example.

We probably need just one of these branches for get started. I have the live branch as the default because it is probably the most natural and fully featured workflow, but make_checkpoint is similar and doesn't have the complication of the dvclive dependency. We could even start with basic and then add make_checkpoint functionality (which looks like kind of what you did here).

@dmpetrov
Copy link
Member

@jorgeorpinel & @dberenbaum it is not mentioned here but it is a Draft and the place/url for this doc is not defined yet. What would be the best place to put this document? A new Deep Learning Checkpoints section in Use Cases?

@dberenbaum dberenbaum marked this pull request as draft April 14, 2021 19:28
@dberenbaum
Copy link
Contributor

dberenbaum commented Apr 14, 2021

Right now, that seems like a good spot for it, but we are also working on developing a docs roadmap, so we might want to reconsider once we have that.

@dmpetrov
Copy link
Member

dmpetrov commented Apr 14, 2021

Right. But even if we found a better use case for checkpoints it still makes sense to use the Mnist use case as a temporary solution.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 14, 2021

helps address an open issue: #2292

The current content doesn't seem to address that, but maybe there should be a section for it?

the existing https://github.com/iterative/dvc-checkpoints-mnist repo has different branches for a bunch of different scenarios

Agree that it would be great to use that here for consistency but up to you. (I'm working on adapting other existing examples to use it too.) Also agree that this content should focus on the main workflow (make_checkpoint) + mention other possibilities, to keep it short.

What would be the best place to put this document?

This content looks like a guide to me. One option is to help @flippedcoder compact this a bit and put it under https://dvc.org/doc/user-guide/experiment-management (which prob should be a section anyway — separate issue rel #2269 and #2367).

@jorgeorpinel
Copy link
Contributor

working on developing a docs roadmap

BTW this is the first PR in https://github.com/iterative/dvc.org/projects/1 😬

@dberenbaum
Copy link
Contributor

the one under iterative/

Whoops, that was a mistake. I edited the comment above with the correct link. By the way, I think @flippedcoder is already using that and probably just needs to link to it in the doc.

@shcheklein
Copy link
Member

This content looks like a guide to me. One option is to help @flippedcoder compact this a bit and put it under https://dvc.org/doc/user-guide/experiment-management

I like this idea! (+ other suggestion to make precise, reuse MNIST if possible, probably make it actionable). This place is indeed a better fit for now. In the get started Experiments section we should put a link for now to the UG section.

@flippedcoder congrats with the first PR! Awesome work 🎉

@dmpetrov
Copy link
Member

dmpetrov commented Apr 14, 2021

it helps address an open issue: #2292.

@dberenbaum I was thinking it is a bit separate scenario, not the basic one. Do you think it makes sense to combine these two?

@flippedcoder you can keep working on the doc here and then move it under Use Cases. Or move it now. Ups to you 😄

@jorgeorpinel
Copy link
Contributor

move it under Use Cases

User Guide right? 🤔 BTW to put it in /user-guide/experiment-management/checkpoints (which I'm suggesting) we'd need other small changes. For now it can be placed directly in /doc/user-guide/checkpoints (below Experiment Management) I think. And I can jump in to finalize the structure before merging.

@dberenbaum
Copy link
Contributor

it helps address an open issue: #2292.

@dberenbaum I was thinking it is a bit separate scenario, not the basic one. Do you think it makes sense to combine these two?

It's fine if it doesn't address #2292. In my mind, that had become a placeholder for adding more robust checkpoints docs.


What's the purpose of this doc? We have needs for multiple checkpoint docs IMO:

  • For getting started, agree with @shcheklein that a link is probably enough (for now).
  • A use case would be great to show how checkpoints could be used in a typical deep learning project workflow and what value they add. This PR could do that, or the use case could be an opportunity for @flippedcoder to come up with a more interesting project to flesh out more.
  • A user guide would also be great to explain how checkpoints work. This PR could also do that. It seems like one suggestion is to compact it and focus on a single workflow like make_checkpoint. I think we might want to start there but also explain the different scenarios in the branches of the repo in the user guide. If there are ones that aren't helpful, we can exclude them, but the make_signal scenario for example seems necessary to me.

By the way, I think this confusion is my own fault for giving this idea to @dmpetrov and not providing enough context to him or @flippedcoder, so sorry about that!

@jorgeorpinel
Copy link
Contributor

use case to show how checkpoints could be used in a typical deep learning project workflow

I'm not against this idea. It's just that the drafted content so far looks more like an explanation of the checkpoints feature (a DVC guide). Maybe partially the result of @flippedcoder's learning process? (Great practice to write in those situations BTW.)

@dberenbaum
Copy link
Contributor

use case to show how checkpoints could be used in a typical deep learning project workflow

I'm not against this idea. It's just that the drafted content so far looks more like an explanation of the checkpoints feature (a DVC guide). Maybe partially the result of @flippedcoder's learning process? (Great practice to write in those situations BTW.)

Yup, seems like there's consensus to make this a user guide. What about addressing the scenarios in the different repo branches?

@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp April 15, 2021 20:37 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp April 15, 2021 20:46 Inactive
@flippedcoder
Copy link
Contributor Author

Hey everyone! Thanks for all the great feedback! I moved it from the Getting Started section to the User Guide section with the updates @dmpetrov mentioned. I'm working on making the algorithm perform worse to show off checkpoints better. 😅 Otherwise, I'm open to any other feedback or suggestions y'all have.

@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 17:44 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 17:44 Inactive
@shcheklein shcheklein changed the title Add Checkpoints doc to Get Started ug: add checkpoints tutorial May 3, 2021
@jorgeorpinel
Copy link
Contributor

Do you think https://dvc.org/doc/user-guide/experiment-management#checkpoints-in-source-code needs to be updated to reflect this doc?

It should probably link to this doc, yep! It can be part of a (possible) follow up copy edit PR if needed, or feel free to include it, @flippedcoder .

@dberenbaum
Copy link
Contributor

dberenbaum commented May 3, 2021

Sounds good! Let's leave it for the copy edit PR then. There's enough going on in this one already.

@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 19:14 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 19:17 Inactive
Comment on lines 195 to 196
```bash
dvc exp run
Copy link
Contributor

@jorgeorpinel jorgeorpinel May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the dvc highlighter (instead of bash) and $ prompts throughout. Here's one change:

Suggested change
```bash
dvc exp run
```dvc
$ dvc exp run

You'll see output similar to this in your terminal while the training process is
going on.

```dvc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In its current form (no commands) this one doesn't need highlighting:

Suggested change
```dvc
```


You should see something similar to this in your terminal.

```git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one and any other that's just terminal output:

Suggested change
```git
```

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ping me if I don't notice when this gets merged and I'll do a copy edit follow up.

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 3, 2021 21:03 Inactive
@flippedcoder flippedcoder force-pushed the mm/checkpoints-doc branch from 5620133 to 6b7c634 Compare May 4, 2021 19:17
@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 4, 2021 19:17 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 4, 2021 19:24 Inactive
@flippedcoder
Copy link
Contributor Author

Alright! I think I've addressed all of the comments. I tried my best not to miss anything. 😅 It should be ready to merge, but let me know if y'all have any other feedback! @dmpetrov @dberenbaum @jorgeorpinel

@shcheklein shcheklein temporarily deployed to dvc-org-mm-checkpoints--lv46zp May 4, 2021 20:38 Inactive
Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🎉

@shcheklein shcheklein merged commit 7a590a1 into master May 4, 2021
@jorgeorpinel jorgeorpinel deleted the mm/checkpoints-doc branch May 13, 2021 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants