Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use-cases: second iteration of Data Registry case #818

Merged
merged 32 commits into from
Dec 16, 2019
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
c31d971
use-cases: address smaller points from review (#795)
jorgeorpinel Nov 20, 2019
6002cba
use-cases: reinforce hypothetical phrasing in data registry intro par…
jorgeorpinel Nov 21, 2019
47ebae5
use-cases: partitioned->split in data registry case
jorgeorpinel Nov 21, 2019
a578c15
use-cases: geatly simplify mention about project inter-dependency in …
jorgeorpinel Nov 21, 2019
d9ad1ab
use-cases: improve intro to example in data registry case
jorgeorpinel Nov 22, 2019
50b772e
use-cases: rephrase much of the data registry example to improve its …
jorgeorpinel Nov 23, 2019
55ab757
review usage of ellipses thoughout docs
jorgeorpinel Nov 24, 2019
d125437
use-cases: remove remark about imports getting messy
jorgeorpinel Nov 25, 2019
283eef5
Merge branch 'master' into use-cases/data-registry
jorgeorpinel Nov 25, 2019
3cba8f8
use-cases: further simplify intro of data registry case
jorgeorpinel Nov 25, 2019
131a27e
use-cases: separate example into 2 sections, expand on them
jorgeorpinel Nov 25, 2019
a7dc465
use-cases: comlpete "Building a data registry" section in data-registry
jorgeorpinel Nov 25, 2019
57d4059
use-cases: provide high level abstract overview of the Git and DVC co…
jorgeorpinel Nov 26, 2019
c49bc0c
use-cases: simplify intro and 2nd section in data-registry
jorgeorpinel Nov 26, 2019
8c300a2
use-cases: fix typo in data-registry
jorgeorpinel Nov 26, 2019
6854a8b
WIP: use-cases: simplofy middle sections per discussion with Ivan, by
jorgeorpinel Nov 28, 2019
e2d93c7
WIP: use-cases: rewrite middle section of data registry without cats-…
jorgeorpinel Nov 28, 2019
faeb057
use-cases: review Construction and workflow section per private revie…
jorgeorpinel Nov 30, 2019
f4997cb
use-cases: more updates to data registry per private discussion
jorgeorpinel Dec 1, 2019
707a507
use-cases: draft of new Usage section in data registry
jorgeorpinel Dec 3, 2019
f30c1e7
Merge branch 'master' into use-cases/data-registry
jorgeorpinel Dec 10, 2019
7954f59
use-cases: add diagram to data registry
jorgeorpinel Dec 10, 2019
51ee72b
use-cases: improve usage section (adding API section) and
jorgeorpinel Dec 11, 2019
485fc49
use-cases: add note about deployment via dvc.api.open to data registr…
jorgeorpinel Dec 11, 2019
6ccc49f
use-cases: Some updates per private discussion with Ivan
jorgeorpinel Dec 11, 2019
b42c9cf
Merge branch 'master' into use-cases/data-registry
jorgeorpinel Dec 12, 2019
de65290
use-cases: more feedback per private chat with Ivan
jorgeorpinel Dec 12, 2019
53ea7c6
use-cases: updated img subscript for data registry
jorgeorpinel Dec 12, 2019
7887ca2
use-cases: address Alex' feedback on data registry 2nd iteration
jorgeorpinel Dec 13, 2019
175b75a
use-cases: addressing more feedback from Ivan
jorgeorpinel Dec 16, 2019
7a395f8
use-cases: address Alex's feedback from
jorgeorpinel Dec 16, 2019
f9c1a74
Merge branch 'master' into use-cases/data-registry
jorgeorpinel Dec 16, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion static/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ different names, and not currently tracked by Git:
$ git status
...
Untracked files:
(use "git add <file>..." to include in what will be committed)
(use "git add <file> ..." to include in what will be committed)

model.bigrams.pkl
model.monograms.pkl
Expand Down
7 changes: 3 additions & 4 deletions static/docs/command-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ checkout the `6-featurization` tag:
$ git checkout 6-featurization
Note: checking out '6-featurization'.

You are in 'detached HEAD' state. ...
You are in 'detached HEAD' state...

$ dvc status

Expand Down Expand Up @@ -216,7 +216,7 @@ We can now repeat the command run earlier, to see the difference.
$ git checkout 6-featurization
Note: checking out '6-featurization'.

You are in 'detached HEAD' state. ...
You are in 'detached HEAD' state...

HEAD is now at d13ba9a add featurization stage

Expand Down Expand Up @@ -257,8 +257,7 @@ helpfully informs us the workspace is out of sync. We should therefore run the

```dvc
$ dvc repro evaluate.dvc

... much output
...
To track the changes with git run:

git add featurize.dvc train.dvc evaluate.dvc
Expand Down
2 changes: 1 addition & 1 deletion static/docs/tutorials/deep/reproducibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ $ dvc repro model.p.dvc
$ dvc repro
```

Tries to reproduce the same pipeline... But there is still nothing to reproduce.
Tries to reproduce the same pipeline, but there is still nothing to reproduce.

## Adding bigrams

Expand Down
133 changes: 65 additions & 68 deletions static/docs/use-cases/data-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,24 @@ tracking of datasets and any other <abbr>data artifacts</abbr>.

With the aim to enable reusability of these versioned artifacts between
different projects (similar to package management systems, but for data), DVC
also includes the `dvc get`, `dvc import`, and `dvc update` commands. For
example, project A may use a data file to begin its data
[pipeline](/doc/command-reference/pipeline), but project B also requires this
same file; Instead of
[adding it](/doc/command-reference/add#example-single-file) it to both projects,
B can simply import it from A. Furthermore, the version of the data file
imported to B can be an older iteration than what's currently used in A.
also includes the `dvc get`, `dvc import`, and `dvc update` commands. This means
that a project can depend on data from an external <abbr>DVC project</abbr>.

Keeping this in mind, we could build a <abbr>DVC project</abbr> dedicated to
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
tracking and versioning datasets (or any kind of large files). This way we would
have a repository that has all the metadata and change history for the project's
data. We can see who updated what, and when; use pull requests to update data
the same way you do with code; and we don't need ad-hoc conventions to store
different data versions. Other projects can share the data in the registry by
downloading (`dvc get`) or importing (`dvc import`) them for use in different
data processes.
have a repository with all the metadata and history of changes in the project's
data. We could see who updated what, and when, use pull requests to update data
(the same way we do with code), and avoid ad-hoc conventions to store different
data versions. This is what we call a data registry. Other projects can share
datasets in a registry by downloading (`dvc get`) or importing (`dvc import`)
them for use in different data processes.

The advantages of using a DVC **data registry** project are:
Advantages of using a DVC **data registry** project:

- Data as code: Improve _lifecycle management_ with versioning of simple
directory structures (like Git for your cloud storage), without ad-hoc
conventions. Leverage Git and Git hosting features such as change history,
branching, pull requests, reviews, and even continuous deployment of ML
models.
conventions. Leverage Git and Git hosting features such as commits, branching,
pull requests, reviews, and even continuous deployment of ML models.
- Reusability: Reproduce and organize _feature stores_ with a simple CLI
(`dvc get` and `dvc import` commands, similar to software package management
systems like `pip`).
Expand All @@ -49,29 +43,30 @@ The advantages of using a DVC **data registry** project are:

## Example

A dataset we use for several of our examples and tutorials is one containing
2800 images of cats and dogs. We partitioned the dataset in two for our
[Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on a
storage server, downloading them with `wget` in our examples. This setup was
then revised to download the dataset with `dvc get` instead, so we created the
[dataset-registry](https://github.com/iterative/dataset-registry)) repository, a
<abbr>DVC project</abbr> hosted on GitHub, to version the dataset (see its
A dataset we commonly use for several of our examples and tutorials contains
2800 images of cats and dogs, which was split it in two for our
[Versioning Tutorial](/doc/tutorials/versioning). Originally, the parts were
backed up on a storage server, and downloaded with
[`wget`](https://www.gnu.org/software/wget/). This was then revised in order to
download the parts with `dvc get` instead, so we created the
[dataset-registry](https://github.com/iterative/dataset-registry)
<abbr>project</abbr> to version the dataset (in the
[`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver)
directory).

However, there are a few problems with the way this dataset is structured. Most
importantly, this single dataset is tracked by 2 different
[DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same
one, which would better reflect the intentions of this dataset... Fortunately,
we have also prepared an improved alternative in the
However, there's a few problems with the way that dataset is versioned. Most
importantly, this split dataset is tracked by 2 different
[DVC-files](/doc/user-guide/dvc-file-format) (one for each part), instead of 2
versions of a single DVC-file. An initial version could have the first part
only, while an update would have the entire, unified dataset. Fortunately, we
have also prepared this improved alternative in the
[`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases)
directory of the same <abbr>DVC repository</abbr>.

To create a
[first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases)
To create the
[initial version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases)
of our dataset, we extracted the first part into the `use-cases/cats-dogs`
directory (illustrated below), and ran `dvc add use-cases/cats-dogs` to
[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory).
directory, illustrated below:

```dvc
$ tree use-cases/cats-dogs --filelimit 3
Expand All @@ -85,7 +80,10 @@ use-cases/cats-dogs
└── dogs [400 image files]
```

In a local DVC project, we could have obtained this dataset at this point with
Then we ran `dvc add use-cases/cats-dogs` to
[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory).

At this point, we could have obtained this dataset in another DVC project with
the following command:

```dvc
Expand All @@ -95,15 +93,16 @@ $ dvc import [email protected]:iterative/dataset-registry.git \

> Note that unlike `dvc get`, which can be used from any directory, `dvc import`
> always needs to run from an [initialized](/doc/command-reference/init) DVC
> project.
> project. Remember also that with both commands, the data comes from the source
> project's remote storage, not from the Git repository itself.

<details>

### Expand for actionable command (optional)

The command above is meant for informational purposes only. If you actually run
it in a DVC project, although it should work, it will import the latest version
of `use-cases/cats-dogs` from `dataset-registry`. The following command would
it, although it will work, it will import the latest version of
`use-cases/cats-dogs` from `dataset-registry`. The following command would
actually bring in the version in question:

```dvc
Expand All @@ -117,54 +116,52 @@ See the `dvc import` command reference for more details on the `--rev`

</details>

Importing keeps the connection between the local project and the source data
registry where we are downloading the dataset from. This is achieved by creating
a particular kind of [DVC-file](/doc/user-guide/dvc-file-format) that uses the
`repo` field (a.k.a. _import stage_). (This file can be used for versioning the
import with Git.)
Importing keeps the connection between the local <abbr>project</abbr> and the
data source (registry <abbr>repository</abbr>). This is achieved by creating a
particular kind of [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import
stage_) that includes a `repo` field. (This file can be used staged and
committed with Git.)

> For a sample DVC-file resulting from `dvc import`, refer to
> [this example](/doc/command-reference/import#example-data-registry).

Back in our **dataset-registry** project, a
Back in our **dataset-registry** project, the
[second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases)
of our dataset was created by extracting the second part, with 1000 additional
images (500 cats, 500 dogs), into the same directory structure. Then, we simply
ran `dvc add use-cases/cats-dogs` again.
images (500 cats, 500 dogs) on top of the existing directory structure. Then, we
simply ran `dvc add use-cases/cats-dogs` again.

In our local project, all we have to do in order to obtain this latest version
of the dataset is to run:
All we would have to do in order to obtain this latest version in another
project where the first version was previously imported, is to run:

```dvc
$ dvc update cats-dogs.dvc
```

This is possible because of the connection that the import stage saved among
local and source projects, as explained earlier.

<details>

### Expand for actionable command (optional)

As with the previous hidden note, actually trying the commands above should
produced the expected results, but not for obvious reasons. Specifically, the
initial `dvc import` command would have already obtained the latest version of
the dataset (as noted before), so this `dvc update` is unnecessary and won't
have an effect.
As with the previous hidden note, actually trying the command above will produce
the desired results, but not for obvious reasons. The initial `dvc import`
command would have already obtained the latest version of the dataset (as noted
before), so this `dvc update` is unnecessary and won't have any effect.

If you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its import
stage (DVC-file) would be fixed to that Git tag (`cats-dogs-v1`). In order to
update it, do not use `dvc update`. Instead, re-import the data by using the
original import command (without `--rev`). Refer to
[this example](http://localhost:3000/doc/command-reference/import#example-fixed-revisions-re-importing)
for more information.
And if you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its
import stage (DVC-file) would be
[fixed to that revision](/doc/command-reference/import#example-fixed-revisions-re-importing)
(`cats-dogs-v1` tag), so `dvc update` would also be ineffective. In order to
actually "update" it, re-import the data instead, by now running the initial
import command (the one without `--rev`):

</details>
```dvc
$ dvc import [email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
```

This downloads new and changed files in `cats-dogs/` from the source project,
and updates the metadata in the import stage DVC-file.
</details>

As an extra detail, notice that so far our local project is working only with a
local <abbr>cache</abbr>. It has no need to setup a
[remotes](/doc/command-reference/remote) to [pull](/doc/command-reference/pull)
or [push](/doc/command-reference/push) this dataset.
This is possible because of the connection that the import stage saved among
local and source projects, as explained earlier. The update downloads new and
changed files in `cats-dogs/` based on the source project, and updates the
metadata in the import stage DVC-file.