dvc-checkpoints-mnist

This example DVC project demonstrates the different ways to employ Checkpoint Experiments with DVC.

This scenario uses DVCLive to generate checkpoints for iterative model training. The model is a simple convolutional neural network (CNN) classifier trained on the MNIST data of handwritten digits to predict the digit (0-9) in each image.

🔄 Switch between scenarios

This repo has several branches that show different methods for using checkpoints (using a similar pipeline):

The live scenario introduces full-featured checkpoint usage — integrating with DVCLive.
The basic scenario uses single-checkpoint experiments to illustrate how checkpoints work in a simple way.
The Python-only variation features the make_checkpoint function from DVC's Python API.
Contrastingly, the signal file scenario shows how to make your own signal files (applicable to any programming language).
Finally, our full pipeline scenario elaborates on the full-featured usage with a more advanced process.

Setup

To try it out for yourself:

Fork the repository and clone to your local workstation.
Install the prerequisites in requirements.txt (if you are using pip, run pip install -r requirements.txt).

Experimenting

Start training the model with dvc exp run. It will train for an unlimited number of epochs, each of which will generate a checkpoint. Use Ctrl-C to stop at the last checkpoint, and simply dvc exp run again to resume.

DVCLive will track performance at each checkpoint. Open dvclive.html in your web browser during training to track performance over time (you will need to refresh after each epoch completes to see updates). Metrics will also be logged to .tsv files in the dvclive directory.

Once you stop the training script, you can view the results of the experiment with:

$ dvc exp show
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━┳━━━━━━━━┓
┃ Experiment    ┃ Created  ┃ step ┃    acc ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━╇━━━━━━━━┩
│ workspace     │ -        │    9 │ 0.4997 │
│ live          │ 03:43 PM │    - │        │
│ │ ╓ exp-34e55 │ 03:45 PM │    9 │ 0.4997 │
│ │ ╟ 2fe819e   │ 03:45 PM │    8 │ 0.4394 │
│ │ ╟ 3da85f8   │ 03:45 PM │    7 │ 0.4329 │
│ │ ╟ 4f64a8e   │ 03:44 PM │    6 │ 0.4686 │
│ │ ╟ b9bee58   │ 03:44 PM │    5 │ 0.2973 │
│ │ ╟ e2c5e8f   │ 03:44 PM │    4 │ 0.4004 │
│ │ ╟ c202f62   │ 03:44 PM │    3 │ 0.1468 │
│ │ ╟ eb0ecc4   │ 03:43 PM │    2 │  0.188 │
│ │ ╟ 28b170f   │ 03:43 PM │    1 │ 0.0904 │
│ ├─╨ 9c705fc   │ 03:43 PM │    0 │ 0.0894 │
└───────────────┴──────────┴──────┴────────┘

You can manage it like any other DVC experiments, including:

Run dvc exp run again to continue training from the last checkpoint.
Run dvc exp apply [checkpoint_id] to revert to any of the prior checkpoints (which will update the model.pt output file and metrics to that point).
Run dvc exp run --reset to drop all the existing checkpoints and start from scratch.

Adding `dvclive` checkpoints to a DVC project

Using dvclive to add checkpoints to a DVC project requires a few additional lines of code.

In your training script, use dvclive.log() to log metrics and dvclive.next_step() to make a checkpoint with those metrics. If you need the current epoch number, use dvclive.get_step() (e.g. to use a learning rate schedule or stop training after a fixed number of epochs). See the train.py script for an example:

    # Iterate over training epochs.
    for epoch in itertools.count(dvclive.get_step()):
        train(model, x_train, y_train)
        torch.save(model.state_dict(), "model.pt")
        # Evaluate and checkpoint.
        metrics = evaluate(model, x_test, y_test)
        for metric, value in metrics.items():
            dvclive.log(metric, value)
        dvclive.next_step()

Then, in dvc.yaml, add the checkpoint: true option to your model output and a live section to your stage output. See dvc.yaml for an example:

stages:
  train:
    cmd: python train.py
    deps:
    - train.py
    outs:
    - model.pt:
        checkpoint: true
    live:
      dvclive:
        summary: true
        html: true

If you do not already have a dvc.yaml stage, you can use dvc stage add to create it:

$ dvc stage add -n train -d train.py -c model.pt --live dvclive python train.py

That's it! For users already familiar with logging metrics in DVC, note that you no longer need a metrics section in dvc.yaml since dvclive is already logging metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dvc-checkpoints-mnist

Setup

Experimenting

Adding `dvclive` checkpoints to a DVC project

Files

README.md

Latest commit

History

README.md

File metadata and controls

dvc-checkpoints-mnist

Setup

Experimenting

Adding dvclive checkpoints to a DVC project

Adding `dvclive` checkpoints to a DVC project