This example DVC project demonstrates the different ways to employ Checkpoint Experiments with DVC.
This scenario uses DVCLive to generate checkpoints for iterative model training. The model is a simple convolutional neural network (CNN) classifier trained on the MNIST data of handwritten digits to predict the digit (0-9) in each image.
🔄 Switch between scenarios
This repo has several branches that show different methods for using checkpoints (using a similar pipeline):
- The live scenario introduces full-featured checkpoint usage — integrating with DVCLive.
- The basic scenario uses single-checkpoint experiments to illustrate how checkpoints work in a simple way.
- The Python-only variation features the make_checkpoint function from DVC's Python API.
- Contrastingly, the signal file scenario shows how to make your own signal files (applicable to any programming language).
- Finally, our full pipeline scenario elaborates on the full-featured usage with a more advanced process.
To try it out for yourself:
- Fork the repository and clone to your local workstation.
- Install the prerequisites in
requirements.txt
(if you are using pip, runpip install -r requirements.txt
).
Start training the model with dvc exp run
. It will train for an unlimited
number of epochs, each of which will generate a checkpoint. Use Ctrl-C
to stop
at the last checkpoint, and simply dvc exp run
again to resume.
DVCLive will track performance at each checkpoint. Open dvclive.html
in your
web browser during training to track performance over time (you will need to
refresh after each epoch completes to see updates). Metrics will also be logged
to .tsv
files in the dvclive
directory.
Once you stop the training script, you can view the results of the experiment with:
$ dvc exp show
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━━━┓
┃ Experiment ┃ Created ┃ step ┃ acc ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━━━┩
│ workspace │ - │ 9 │ 0.4997 │
│ live │ 03:43 PM │ - │ │
│ │ ╓ exp-34e55 │ 03:45 PM │ 9 │ 0.4997 │
│ │ ╟ 2fe819e │ 03:45 PM │ 8 │ 0.4394 │
│ │ ╟ 3da85f8 │ 03:45 PM │ 7 │ 0.4329 │
│ │ ╟ 4f64a8e │ 03:44 PM │ 6 │ 0.4686 │
│ │ ╟ b9bee58 │ 03:44 PM │ 5 │ 0.2973 │
│ │ ╟ e2c5e8f │ 03:44 PM │ 4 │ 0.4004 │
│ │ ╟ c202f62 │ 03:44 PM │ 3 │ 0.1468 │
│ │ ╟ eb0ecc4 │ 03:43 PM │ 2 │ 0.188 │
│ │ ╟ 28b170f │ 03:43 PM │ 1 │ 0.0904 │
│ ├─╨ 9c705fc │ 03:43 PM │ 0 │ 0.0894 │
└───────────────┴──────────┴──────┴────────┘
You can manage it like any other DVC experiments, including:
- Run
dvc exp run
again to continue training from the last checkpoint. - Run
dvc exp apply [checkpoint_id]
to revert to any of the prior checkpoints (which will update themodel.pt
output file and metrics to that point). - Run
dvc exp run --reset
to drop all the existing checkpoints and start from scratch.
Using dvclive
to add checkpoints to a DVC project requires a few additional
lines of code.
In your training script, use dvclive.log()
to log metrics and
dvclive.next_step()
to make a checkpoint with those metrics.
If you need the current epoch number, use dvclive.get_step()
(e.g.
to use a learning rate
schedule
or stop training after a fixed number of epochs). See the
train.py
script for an example:
# Iterate over training epochs.
for epoch in itertools.count(dvclive.get_step()):
train(model, x_train, y_train)
torch.save(model.state_dict(), "model.pt")
# Evaluate and checkpoint.
metrics = evaluate(model, x_test, y_test)
for metric, value in metrics.items():
dvclive.log(metric, value)
dvclive.next_step()
Then, in dvc.yaml
, add the checkpoint: true
option to your model output and
a live
section to your stage output. See
dvc.yaml
for an example:
stages:
train:
cmd: python train.py
deps:
- train.py
outs:
- model.pt:
checkpoint: true
live:
dvclive:
summary: true
html: true
If you do not already have a dvc.yaml
stage, you can use dvc stage
add to create it:
$ dvc stage add -n train -d train.py -c model.pt --live dvclive python train.py
That's it! For users already familiar with logging metrics in DVC, note that you
no longer need a metrics
section in dvc.yaml
since dvclive
is already
logging metrics.