-
Notifications
You must be signed in to change notification settings - Fork 1.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: DVC Lite #4539
Comments
A couple of thoughts:
|
|
I can't see how the failure of ML teams to embrace best practices is DVC's issue. You must manage code and data. Period. DVC has literally made this as easy as possible (given their treatment as separate but associated 'things'). I just can't see how it could get simpler. |
I agree if we are talking about the entire process. But in some specific cases, we can simplify the experience. |
It might be related to data\remote types feature #4040 since the self-contained dvc-file is kind of its own data\remote type. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Motivation
Right now DVC works only within Git repositories. It forces users to follow the best engineering practices. As we know, some data scientists are not ready to invest resources in a formal code versioning methodology and use Git properly. This limits them not only from data versioning but also from other huge DVC benefits - data transferring.
DVC can provide a holistic way of data codification and data transferring functionality without connection to Git or other source code versioning systems.
References
The proposal is based on the idea of self-contained DVC-files that we discussed with some DVC core team members.
This approach has some similarities with dvc-metrics without DVC. It is a requirement from CML-like scenarios: #4446 (comment)
Feedback from users' discussions. Some DVC users mentioned that their team members are not fluent with Git and it prevents the team from fully move to DVC. The other users say that the duality of DVC and Git make the tool a bit too complicated - command duplications (git-hooks is a good workaround but not a complete solution) and the concept of versioning becomes more complicated.
This is related to our discussions about datasets and storage improvements - #1487. Some of the functionality from the dataset discussion was not implemented because we found that Git has a good and more holistic approach. However, this conclusion is based on the assumption that users are comfortable using Git. We challenge this assumption in the context of some ML teams that are not ready to embrace the best engineering practices.
The experiment logging tools use simple model versioning as python API. While general feedback about this method is not consistent and usage is not high (compared to the metrics logging functionality of these tools), some teams like this simple approach of ML model versioning.
Vision
Does it mean that DVC Lite can replace DVC? Absolutely not. DVC Lite is a lightweight workaround for some pinpoint problems. It should be a good fit for ML teams that are not ready to fully embrace the best engineering practices but still need basic data and model versioning functionality. With ML process maturity more and more teams will be moving to the proper Git-based flow for model and data versioning. The proper Git-based flow of data versioning has great benefits and support on the infrastructure side - GitHub/GitLab/BitBucket with all the features.
Ideas about the implementation
The data transferring part can be completely borrowed from the core DVC. However, we need to come up with a format of metafiles that describes all the information needed for data transferring: file names, versions, data cache dir (if exist), data remotes.
The codification (metafiles) part can be implemented by DVC-file-like metafiles (ideally compatible with the original DVC files). The absence of Git requires an introduction of some new concepts for tracking and navigating among the versions. Some ideas:
production
andstaging
models in mlflow.Examples
Regarding the PIA - it might be a separate set of the commands
dvclite
or it might be hidden under thedvc
umbrella.Example 1: Basic model versioning
A user just does not use Git but still needs to track file versions and transfer the files between machines or storages.
When file changed:
The metafile (
segm/model.h5.dvc
) should contain all the versions. Get the old model:Get back recent one:
Remote in metafiles:
Data transfering:
Example 2: Cache in the cloud (no local cache)
Save all version in data remote by default
RCF
Please provide your comments and feedback. Any comments and suggestions are welcome.
@iterative/engineering
The text was updated successfully, but these errors were encountered: