RFC: DVC Lite #4539

dmpetrov · 2020-09-07T02:05:51Z

Motivation

Right now DVC works only within Git repositories. It forces users to follow the best engineering practices. As we know, some data scientists are not ready to invest resources in a formal code versioning methodology and use Git properly. This limits them not only from data versioning but also from other huge DVC benefits - data transferring.

DVC can provide a holistic way of data codification and data transferring functionality without connection to Git or other source code versioning systems.

References

The proposal is based on the idea of self-contained DVC-files that we discussed with some DVC core team members.

This approach has some similarities with dvc-metrics without DVC. It is a requirement from CML-like scenarios: #4446 (comment)

Feedback from users' discussions. Some DVC users mentioned that their team members are not fluent with Git and it prevents the team from fully move to DVC. The other users say that the duality of DVC and Git make the tool a bit too complicated - command duplications (git-hooks is a good workaround but not a complete solution) and the concept of versioning becomes more complicated.

This is related to our discussions about datasets and storage improvements - #1487. Some of the functionality from the dataset discussion was not implemented because we found that Git has a good and more holistic approach. However, this conclusion is based on the assumption that users are comfortable using Git. We challenge this assumption in the context of some ML teams that are not ready to embrace the best engineering practices.

The experiment logging tools use simple model versioning as python API. While general feedback about this method is not consistent and usage is not high (compared to the metrics logging functionality of these tools), some teams like this simple approach of ML model versioning.

Vision

Does it mean that DVC Lite can replace DVC? Absolutely not. DVC Lite is a lightweight workaround for some pinpoint problems. It should be a good fit for ML teams that are not ready to fully embrace the best engineering practices but still need basic data and model versioning functionality. With ML process maturity more and more teams will be moving to the proper Git-based flow for model and data versioning. The proper Git-based flow of data versioning has great benefits and support on the infrastructure side - GitHub/GitLab/BitBucket with all the features.

Ideas about the implementation

The data transferring part can be completely borrowed from the core DVC. However, we need to come up with a format of metafiles that describes all the information needed for data transferring: file names, versions, data cache dir (if exist), data remotes.

The codification (metafiles) part can be implemented by DVC-file-like metafiles (ideally compatible with the original DVC files). The absence of Git requires an introduction of some new concepts for tracking and navigating among the versions. Some ideas:

File checksums can be still in use for underlying file names (in data storage).
Semantic versions (1.0, 1.1, ...) can be used instead of the file checksum.
Comments might be needed to describe the changes.
Labels might be needed instead of branches. Like production and staging models in mlflow.

Examples

Regarding the PIA - it might be a separate set of the commands dvclite or it might be hidden under the dvc umbrella.

Example 1: Basic model versioning

A user just does not use Git but still needs to track file versions and transfer the files between machines or storages.

$ dvclite add --semver 0.1.0 -c 'Fit-predict and nothing more' --label 'dev' segm/model.h5
Created DVC metafile 'segm/model.h5.dvc' and local cache dir 'segm/model.h5.cache'
model.h5: 0.1.0 assigned

When file changed:

$ dvclite add -c 'normalize & rm empty images' segm/model.h5
Updated DVC metafile: segm/model.h5.dvc
model.h5: 0.1.1 incremented patch version

The metafile (segm/model.h5.dvc) should contain all the versions. Get the old model:

$ dvclite update --segver 0.1.0 segm/model.h5

Get back recent one:

$ dvclite update segm/model.h5

Remote in metafiles:

$ dvclite remote add segm_remote s3://mybucket/segm-model/segm/model.h5
Remote 'segm_remote' was added to metafile 'segm/model.h5.dvc'

Data transfering:

$ dvclite push -a segm/model.h5.dvc

Example 2: Cache in the cloud (no local cache)

Save all version in data remote by default

$ dvclite add --semver 0.1.0 -c 'Fit-predict and nothing more' --label 'dev' --cache s3://mybucket/segm-model/ segm/model.h5
Created DVC metafile 'segm/model.h5.dvc' and cache dir 's3://mybucket/segm-model/ segm/model.h5'
model.h5: 0.1.0 assigned
Data was transfered to the cache

RCF

Please provide your comments and feedback. Any comments and suggestions are welcome.

@iterative/engineering

The text was updated successfully, but these errors were encountered:

Suor · 2020-09-07T10:22:09Z

A couple of thoughts:

does it valuable enough? From what you say it would be a project with a shrinking user base. Or you intend to use it as a gateway drug?)
we need to command be consistent with dvc, e.g. dvclite checkout instead of dvclite update
there is no reason to use some special remote config handling different from whatever we have in dvc, e.g. dvclite remote commands and no --cache flags.

dmpetrov · 2020-09-07T22:14:34Z

@Suor

yes, in some sense, it is a "gateway drug" to best practices. I could not say I like this metaphor.
💯 Ideally, it should be inside DVC. But introducing another set of options might be not the best idea. I'd design API independently on DVC and then think if we can fit it in.
probably. it depends on the set of new commands.

majidaldo · 2020-09-08T15:17:45Z

I can't see how the failure of ML teams to embrace best practices is DVC's issue. You must manage code and data. Period. DVC has literally made this as easy as possible (given their treatment as separate but associated 'things'). I just can't see how it could get simpler.

dmpetrov · 2020-09-09T05:48:41Z

I just can't see how it could get simpler.

I agree if we are talking about the entire process. But in some specific cases, we can simplify the experience.

dmpetrov · 2020-09-09T05:50:32Z

It might be related to data\remote types feature #4040 since the self-contained dvc-file is kind of its own data\remote type.

dmpetrov added the feature request Requesting a new feature label Sep 7, 2020

dmpetrov mentioned this issue Oct 28, 2020

Better way to add large directories #4782

Closed

efiop closed this as completed May 3, 2021

iterative locked and limited conversation to collaborators May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

RFC: DVC Lite #4539

RFC: DVC Lite #4539

dmpetrov commented Sep 7, 2020

Suor commented Sep 7, 2020

dmpetrov commented Sep 7, 2020

majidaldo commented Sep 8, 2020 •

edited

Loading

dmpetrov commented Sep 9, 2020

dmpetrov commented Sep 9, 2020

This issue was moved to a discussion.

This issue was moved to a discussion.

RFC: DVC Lite #4539

RFC: DVC Lite #4539

Comments

dmpetrov commented Sep 7, 2020

Motivation

References

Vision

Ideas about the implementation

Examples

Example 1: Basic model versioning

Example 2: Cache in the cloud (no local cache)

RCF

Suor commented Sep 7, 2020

dmpetrov commented Sep 7, 2020

majidaldo commented Sep 8, 2020 • edited Loading

dmpetrov commented Sep 9, 2020

dmpetrov commented Sep 9, 2020

This issue was moved to a discussion.

majidaldo commented Sep 8, 2020 •

edited

Loading