-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ug: add checkpoints tutorial #2373
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flippedcoder an amazing tutorial ✨
I left a few comments inline.
PS: you can take some ideas from here https://www.youtube.com/watch?v=J8mCr3wVgdA
content/docs/start/checkpoints.md
Outdated
to be adjusted, and even recover model weights, parameters, and code. You also | ||
have the ability to resume training from previous checkpoints. When you adjust | ||
parameters and code, DVC tracks those changes for you. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should mention that this section is not a continuation of the previous ones in Getting Started.
content/docs/start/checkpoints.md
Outdated
summary: true | ||
html: true | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative method should be mentioned somewhere - dvc stage add --checkpoints model.pt ...
content/docs/start/checkpoints.md
Outdated
``` | ||
|
||
Then you'll want to run your code using the `dvc exp run` command. This means | ||
it'll start from the existing checkpoint outputs. That's it! Now you have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code instrumentation part is missing. The checkpoints have to be defined in the pipeline level (dvc.yaml) as well as in the code level make_checkpoint() or DvcLiveCallback()
.
content/docs/start/checkpoints.md
Outdated
it'll start from the existing checkpoint outputs. That's it! Now you have | ||
checkpoints active in your process. While your script is executing epoch runs, | ||
you should see output in the terminal similar to this: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have a reference to the code repo. So, users can clone and easily play with it.
content/docs/start/checkpoints.md
Outdated
file:///Users/Repos/dvc-checkpoints-mnist/dvclive.html | ||
Updating lock file 'dvc.lock' | ||
Checkpoint experiment iteration '9b0bf16'. | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention that the run should be terminated. I'd love to see the exp table \ all the checkpoints right after the run.
content/docs/start/checkpoints.md
Outdated
|
||
When you want integrate checkpoints into your training epochs so you don't have | ||
to run `dvc exp run` each time, you'll need to use the `make_checkpoint` | ||
function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not an alternative. The user is supposed to use both as I mentioned above.
content/docs/start/checkpoints.md
Outdated
`ini.txt` file. | ||
|
||
If you want to start from a specific existing checkpoint, you'll need to run | ||
`dvc exp run --rev 123455` where `12345` is the checkpoint ID for specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like --rev
works only for queue/temp runs, not for this scenario. Created an issue: iterative/dvc#5814
content/docs/start/checkpoints.md
Outdated
These experiments have IDs associated with each of the checkpoints so you have a | ||
reference to each epoch. | ||
|
||
## Starting from an existing checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd start with continue\resuming training: train for a few epochs, terminate and continue training back. It supposes to create a straight line of metrics improvements.
content/docs/start/checkpoints.md
Outdated
│ │ ╟ 9b0bf16 │ 01:20 PM │ 2 │ 0.9271 │ | ||
│ │ ╟ 39d1444 │ 01:19 PM │ 1 │ 0.9157 │ | ||
│ ├─╨ fcefafb │ 01:19 PM │ 0 │ 0.8572 │ | ||
└───────────────┴──────────┴──────┴────────┘ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the algorithm is too good for showing the value of checkpoints 😄 You are getting 85% accuracy in 1st epoch and 95% in 4th. You might consider using a weaker network architecture or a less aggressive learning rate to show a smoother dynamic.
content/docs/start/checkpoints.md
Outdated
this: `dvc exp run --reset`. | ||
|
||
This removes the `model.pt` file and clears the `dvc.lock` file of any existing | ||
checkpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add apply scenario: dvc exp apply && modify file && dvc exp run
. It might be very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does modify file
do here? Does that remove the model.pt
file since there's no checkpoint id in the apply command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify means - make changes in params.yaml or train.py
Thanks for this contribution @flippedcoder! Are we sure we want to expand the Get Started with this somewhat specific guide (cc @shcheklein)? If not, we can definitely find another place to put this info, and link from https://dvc.org/doc/start/experiments. Happy to help here once that is decided and @dmitry's comments are addressed. |
Thanks @flippedcoder! This looks awesome, and it helps address an open issue: #2292. One overarching comment is that we might need to think about where this fits. Having a get started page for checkpoints is important, but appending it to the existing get started section might be a bit confusing at the moment since all the other pages are iterations of the same scenario in https://github.com/iterative/example-get-started and this is a completely new scenario. I think there is some ongoing work to maybe introduce an MNIST version of all the get started pages, but I'm not sure the status. What do you think @dmpetrov @shcheklein? Also, the existing https://github.com/iterative/dvc-checkpoints-mnist repo has different branches for a bunch of different scenarios:
We probably need just one of these branches for get started. I have the |
@jorgeorpinel & @dberenbaum it is not mentioned here but it is a Draft and the place/url for this doc is not defined yet. What would be the best place to put this document? A new Deep Learning Checkpoints section in Use Cases? |
Right now, that seems like a good spot for it, but we are also working on developing a docs roadmap, so we might want to reconsider once we have that. |
Right. But even if we found a better use case for checkpoints it still makes sense to use the Mnist use case as a temporary solution. |
The current content doesn't seem to address that, but maybe there should be a section for it?
Agree that it would be great to use that here for consistency but up to you. (I'm working on adapting other existing examples to use it too.) Also agree that this content should focus on the main workflow (
This content looks like a guide to me. One option is to help @flippedcoder compact this a bit and put it under https://dvc.org/doc/user-guide/experiment-management (which prob should be a section anyway — separate issue rel #2269 and #2367). |
BTW this is the first PR in https://github.com/iterative/dvc.org/projects/1 😬 |
Whoops, that was a mistake. I edited the comment above with the correct link. By the way, I think @flippedcoder is already using that and probably just needs to link to it in the doc. |
I like this idea! (+ other suggestion to make precise, reuse MNIST if possible, probably make it actionable). This place is indeed a better fit for now. In the get started Experiments section we should put a link for now to the UG section. @flippedcoder congrats with the first PR! Awesome work 🎉 |
@dberenbaum I was thinking it is a bit separate scenario, not the basic one. Do you think it makes sense to combine these two? @flippedcoder you can keep working on the doc here and then move it under Use Cases. Or move it now. Ups to you 😄 |
User Guide right? 🤔 BTW to put it in /user-guide/experiment-management/checkpoints (which I'm suggesting) we'd need other small changes. For now it can be placed directly in /doc/user-guide/checkpoints (below Experiment Management) I think. And I can jump in to finalize the structure before merging. |
It's fine if it doesn't address #2292. In my mind, that had become a placeholder for adding more robust checkpoints docs. What's the purpose of this doc? We have needs for multiple checkpoint docs IMO:
By the way, I think this confusion is my own fault for giving this idea to @dmpetrov and not providing enough context to him or @flippedcoder, so sorry about that! |
I'm not against this idea. It's just that the drafted content so far looks more like an explanation of the checkpoints feature (a DVC guide). Maybe partially the result of @flippedcoder's learning process? (Great practice to write in those situations BTW.) |
Yup, seems like there's consensus to make this a user guide. What about addressing the scenarios in the different repo branches? |
Restyle Add Checkpoints doc to Get Started
Hey everyone! Thanks for all the great feedback! I moved it from the Getting Started section to the User Guide section with the updates @dmpetrov mentioned. I'm working on making the algorithm perform worse to show off checkpoints better. 😅 Otherwise, I'm open to any other feedback or suggestions y'all have. |
Co-authored-by: Dave Berenbaum <[email protected]>
It should probably link to this doc, yep! It can be part of a (possible) follow up copy edit PR if needed, or feel free to include it, @flippedcoder . |
Sounds good! Let's leave it for the copy edit PR then. There's enough going on in this one already. |
Co-authored-by: Dave Berenbaum <[email protected]>
Co-authored-by: Dave Berenbaum <[email protected]>
```bash | ||
dvc exp run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the dvc
highlighter (instead of bash
) and $
prompts throughout. Here's one change:
```bash | |
dvc exp run | |
```dvc | |
$ dvc exp run |
You'll see output similar to this in your terminal while the training process is | ||
going on. | ||
|
||
```dvc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In its current form (no commands) this one doesn't need highlighting:
```dvc | |
``` |
|
||
You should see something similar to this in your terminal. | ||
|
||
```git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this one and any other that's just terminal output:
```git | |
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ping me if I don't notice when this gets merged and I'll do a copy edit follow up.
5620133
to
6b7c634
Compare
Alright! I think I've addressed all of the comments. I tried my best not to miss anything. 😅 It should be ready to merge, but let me know if y'all have any other feedback! @dmpetrov @dberenbaum @jorgeorpinel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🎉
This adds some more detailed documentation around the checkpoints implementation in training models.