Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: DVC Lite #4539

Closed
dmpetrov opened this issue Sep 7, 2020 · 5 comments
Closed

RFC: DVC Lite #4539

dmpetrov opened this issue Sep 7, 2020 · 5 comments
Labels
feature request Requesting a new feature

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Sep 7, 2020

Motivation

Right now DVC works only within Git repositories. It forces users to follow the best engineering practices. As we know, some data scientists are not ready to invest resources in a formal code versioning methodology and use Git properly. This limits them not only from data versioning but also from other huge DVC benefits - data transferring.

DVC can provide a holistic way of data codification and data transferring functionality without connection to Git or other source code versioning systems.

References

The proposal is based on the idea of self-contained DVC-files that we discussed with some DVC core team members.

This approach has some similarities with dvc-metrics without DVC. It is a requirement from CML-like scenarios: #4446 (comment)

Feedback from users' discussions. Some DVC users mentioned that their team members are not fluent with Git and it prevents the team from fully move to DVC. The other users say that the duality of DVC and Git make the tool a bit too complicated - command duplications (git-hooks is a good workaround but not a complete solution) and the concept of versioning becomes more complicated.

This is related to our discussions about datasets and storage improvements - #1487. Some of the functionality from the dataset discussion was not implemented because we found that Git has a good and more holistic approach. However, this conclusion is based on the assumption that users are comfortable using Git. We challenge this assumption in the context of some ML teams that are not ready to embrace the best engineering practices.

The experiment logging tools use simple model versioning as python API. While general feedback about this method is not consistent and usage is not high (compared to the metrics logging functionality of these tools), some teams like this simple approach of ML model versioning.

Vision

Does it mean that DVC Lite can replace DVC? Absolutely not. DVC Lite is a lightweight workaround for some pinpoint problems. It should be a good fit for ML teams that are not ready to fully embrace the best engineering practices but still need basic data and model versioning functionality. With ML process maturity more and more teams will be moving to the proper Git-based flow for model and data versioning. The proper Git-based flow of data versioning has great benefits and support on the infrastructure side - GitHub/GitLab/BitBucket with all the features.

Ideas about the implementation

The data transferring part can be completely borrowed from the core DVC. However, we need to come up with a format of metafiles that describes all the information needed for data transferring: file names, versions, data cache dir (if exist), data remotes.

The codification (metafiles) part can be implemented by DVC-file-like metafiles (ideally compatible with the original DVC files). The absence of Git requires an introduction of some new concepts for tracking and navigating among the versions. Some ideas:

  1. File checksums can be still in use for underlying file names (in data storage).
  2. Semantic versions (1.0, 1.1, ...) can be used instead of the file checksum.
  3. Comments might be needed to describe the changes.
  4. Labels might be needed instead of branches. Like production and staging models in mlflow.

Examples

Regarding the PIA - it might be a separate set of the commands dvclite or it might be hidden under the dvc umbrella.

Example 1: Basic model versioning

A user just does not use Git but still needs to track file versions and transfer the files between machines or storages.

$ dvclite add --semver 0.1.0 -c 'Fit-predict and nothing more' --label 'dev' segm/model.h5
Created DVC metafile 'segm/model.h5.dvc' and local cache dir 'segm/model.h5.cache'
model.h5: 0.1.0 assigned

When file changed:

$ dvclite add -c 'normalize & rm empty images' segm/model.h5
Updated DVC metafile: segm/model.h5.dvc
model.h5: 0.1.1 incremented patch version

The metafile (segm/model.h5.dvc) should contain all the versions. Get the old model:

$ dvclite update --segver 0.1.0 segm/model.h5

Get back recent one:

$ dvclite update segm/model.h5

Remote in metafiles:

$ dvclite remote add segm_remote s3://mybucket/segm-model/segm/model.h5
Remote 'segm_remote' was added to metafile 'segm/model.h5.dvc'

Data transfering:

$ dvclite push -a segm/model.h5.dvc

Example 2: Cache in the cloud (no local cache)

Save all version in data remote by default

$ dvclite add --semver 0.1.0 -c 'Fit-predict and nothing more' --label 'dev' --cache s3://mybucket/segm-model/ segm/model.h5
Created DVC metafile 'segm/model.h5.dvc' and cache dir 's3://mybucket/segm-model/ segm/model.h5'
model.h5: 0.1.0 assigned
Data was transfered to the cache

RCF

Please provide your comments and feedback. Any comments and suggestions are welcome.

@iterative/engineering

@dmpetrov dmpetrov added the feature request Requesting a new feature label Sep 7, 2020
@Suor
Copy link
Contributor

Suor commented Sep 7, 2020

A couple of thoughts:

  • does it valuable enough? From what you say it would be a project with a shrinking user base. Or you intend to use it as a gateway drug?)
  • we need to command be consistent with dvc, e.g. dvclite checkout instead of dvclite update
  • there is no reason to use some special remote config handling different from whatever we have in dvc, e.g. dvclite remote commands and no --cache flags.

@dmpetrov
Copy link
Member Author

dmpetrov commented Sep 7, 2020

@Suor

  • yes, in some sense, it is a "gateway drug" to best practices. I could not say I like this metaphor.
  • 💯 Ideally, it should be inside DVC. But introducing another set of options might be not the best idea. I'd design API independently on DVC and then think if we can fit it in.
  • probably. it depends on the set of new commands.

@majidaldo
Copy link

majidaldo commented Sep 8, 2020

I can't see how the failure of ML teams to embrace best practices is DVC's issue. You must manage code and data. Period. DVC has literally made this as easy as possible (given their treatment as separate but associated 'things'). I just can't see how it could get simpler.

@dmpetrov
Copy link
Member Author

dmpetrov commented Sep 9, 2020

I just can't see how it could get simpler.

I agree if we are talking about the entire process. But in some specific cases, we can simplify the experience.

@dmpetrov
Copy link
Member Author

dmpetrov commented Sep 9, 2020

It might be related to data\remote types feature #4040 since the self-contained dvc-file is kind of its own data\remote type.

@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

4 participants