Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: add a mechanism to lock external dependencies #2139

Closed
efiop opened this issue Jun 17, 2019 · 6 comments
Closed

import: add a mechanism to lock external dependencies #2139

efiop opened this issue Jun 17, 2019 · 6 comments
Assignees
Labels
awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@efiop
Copy link
Contributor

efiop commented Jun 17, 2019

From #2012

@efiop efiop added feature request Requesting a new feature p1-important Important, aka current backlog of things to do labels Jun 17, 2019
@efiop efiop mentioned this issue Jun 17, 2019
10 tasks
@efiop
Copy link
Contributor Author

efiop commented Jun 18, 2019

Related #1487

@shcheklein shcheklein changed the title pkg: add pkg-list and pkg-lock files import: add a mechanism to lock external dependencies Jun 21, 2019
@shcheklein
Copy link
Member

Summary of changes

The most recent PR #2160 and a few previous merged PRs have introduced a mechanism to reuse (import) a data artifact (model or data file or directory) from one Github repository into another:

cd myrepo
dvc import https://github.com/iterative/example-get-started model.pkl

creates a DVC-file that has a dependency to the (the repo field below) https://github.com/iterative/example-get-started, and specifically to the model.pkl file inside produced by the DVC-pipeline inside it:

md5: 39c9c563afc97491c86db391afd9cf82
wdir: .
deps:
- md5: 3863d0e317dee0a55c4e59d2ec0eef33
  path: model.pkl
  repo:
    url: https://github.com/iterative/example-get-started
outs:
- md5: 3863d0e317dee0a55c4e59d2ec0eef33
  path: model.pkl
  cache: true
  metric: false
  persist: false

this command also automatically "executes" this DVC-file, more specifically, it brings the model.pkl file into the workspace.

The existing dvc import command has been moved to the dvc import-url.

What is the "lock" mechanism about?

Now that we have an external dependency in some sense, the question is what should we do when we run dvc status, dvc repro? Pull the latest version as we do with dvc import-url every single time?

After a few discussions we have decided that we need a common semantics for all external dependencies - import and import-url. Semantics should be aligned with the DVC purpose in the first place - ensure the reproducibility.

It means that if we, say, in one moth decide to do git checkout <1-moth-ago> in myrepo, to get that same DVC-file from the above, DVC must guarantee that we are getting the same result (the same model.pkl in the first place) if we run repro. Even if https://github.com/iterative/example-get-started has changed a lot since that.

The same for import-url which can import a file via https, from s3, etc to the workspace. Right now it always checks and updates the data in the workspace when you run dvc repro.

Thus we need to introduce some "lock" mechanism for these "import" DVC-files. They need to capture somewhere all the necessary context to get the same data when it's needed. It does not necessarily mean that we need some new fields, we might end up reusing what we have already and just change behavior.

Implementation options

  1. aka Distibuted - add a special field for dvc import produced DVC-file, more specifically rev-lock that will contain a specific hash of the git commit the repo was fetched initially from:
...
  repo:
    url: https://github.com/iterative/example-get-started
    rev-lock: 123456dfdfgdfgdf
...

So, each DVC-file is self-contained and has all the information to restore that specific version of the repo initial data was taken from.

Cons/pros for this option:

✅No additional files or concepts, easier to implement (at least initially);
✅We can easily work with multiple versions of the same repo;
❌We are going even further into DVC-files being machine-controlled, not human;
❌Consequent dvc import of different data artifacts from the same Github repo may lock onto different versions of the repo. This semantics might be confusing. Even though the repo is installed already, the same dvc import might fail since remote repo has changed.
❌Mass-update becomes a mess. Since we now have multiple files containing the same rev-lock information to update a singe dependency we will have to update all these files. Do we need this? Yes, if we introduce aliases, it would be nice to have something like dvc update requests, instead of dvc update github.com/sdfdsds/requests.
❌Duplication of information means less control. For example, if I have two dependencies - a model and a dict from the same remote repo. How do I make sure that two DVC-files are always in sync. DVC does not guarantee that. It will be easier to update only one.

  1. aka Centralized - introduce a separate file .dvc/repos/lock (and may be .dvc/repos/list in the future to support aliases or work with multiple model/data versions) (place and names TBD). Every time we do dvc import github/something data we update those files automatically, add put installed repo revision hash to the lock file.

✅Repo update becomes easier - a single place to update.
✅No information duplication - less probability of an error.
✅Consequent dvc import commands install data from the same repo. It's easier to predict the outcome.
❌Update affects all stages that depend on it. It's fine if we don't want to depend on multiple versions of the same repo. Sometimes it might lead to mass update of a lot of pipelines we actually do not want even touching.
❌Calculating status, pull, etc becomes more complicates - need to go and check a central lock file. For example, md5 of the DVC-file itself now has to include some information from
❌Harder to implement (at least) initially.
❌Requires introducing aliases to deal with multiple versions of the repo.

Anything else I forgot @efiop @dmpetrov @Suor? @MrOutis @pared @jorgeorpinel @villasv @ei-grad @sotte let's discuss, share you opinions and votes? Let me know if something is not clear - a lot of information to digest :)

@ghost ghost added question I have a question? awaiting response we are waiting for your reply, please respond! :) and removed question I have a question? labels Jun 25, 2019
@ghost
Copy link

ghost commented Jun 25, 2019

Mass-update becomes a mess.

Isn't the same as doing a git pull repo && find -name *.dvc | grep repo | sed s/rev-lock: old-hash/rev-lock: new-hash/ && dvc checkout ?

@shcheklein
Copy link
Member

yep, but it can touch potentially a lot of files. Imagine you would have to update every single python code file that has import requests every time you are updating requests library.

Also, what should I do if some of them are not in sync? Should I update all of them anyway?

@ghost
Copy link

ghost commented Jun 26, 2019

What do you mean by not in sync, @shcheklein, also, why DVC is allowing files to be out-of-sync, is it useful? Is it like a submodule of requests (let's say requests.adapters) in v1 and the other submodule (maybe requests.api) in v2?

@shcheklein
Copy link
Member

yep, you got it right! Since we essentially have a lock per DVC-files nothing prevents us to lock two files to two different versions (revisions) of the same repository.

It's hard to predict how useful is it. But if we go with a decentralized approach (and looks like we do) it means that we have this feature out of the box. The question is how much will we have to pay for this down the road :)

@efiop efiop self-assigned this Jul 2, 2019
efiop added a commit to efiop/dvc that referenced this issue Jul 5, 2019
@efiop efiop closed this as completed in 932b0d7 Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
None yet
Development

No branches or pull requests

2 participants