-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New command: dvc verify - check that the pipeline is up to date without having to pull or run it #5369
Comments
Hi @sjawhar ! As always, great suggestion! I think this feature has been requested before in different shapes and forms (can't pinpoint a particular ticket right now, sorry) and is generally related to dvc being able to operate in a "virtual environment" with remote and run-cache in mind, so that missing local cache is not a problem as long as stuff is available on remote. I think the functionality you've requested still makes sense for Maybe this might fit into |
@efiop Although I haven't used status extensively, I agree that it intuitively makes sense as a flag for that command. I also agree with your final analysis that it probably doesn't make sense as part of |
+1 for |
+1 I think this would be a very nice feature for when you are doing CI/CD tests to confirm your branch is ready to merge. I opened this iterative/cml#681 issue because I thought it might be a cml feature but actually I think this would fit the bill. |
Hi, I consider this feature very useful in my workflow as well. |
Any updates on this feature? |
tl;dr It's on our roadmap but not close enough to give an estimate for it yet. A lot of the functionality today is part of |
hello, do you have any updates on it? :) (it seems like the 2nd upvoted issue 👀) |
Sorry, I still don't have an estimate. It remains on our backlog because we have been focusing on rehauling a lot of the data management and haven't been able to put focus on pipelines. To that effect, we at least have |
Hey, any news on it? Is there any ways we can provide help? |
Would be highly beneficial to have such feature for our CI pipelines. |
Oh, this would be immensely useful ! |
Sorry for the repeated delays here. We are finally planning this for the quarter. Thanks to @sjawhar for opening this and listing clear requirements.
Some thoughts on these:
Maybe I'm misunderstanding, but I think this is covered already by
Again, this should already be reported by In other words, taking a closer look here, the issue is covered by Since |
Some nuance on point 3
I don't think that quite covers it. The out of one stage might exist in the remote, as might the dep of the next stage in the dag, but they don't agree with each other. So they both exist simply by checking the remote, but they aren't the same artifact. We would indeed need to do that collision detection that you hint at ("if you did |
I as a user want the possibility to check if e.g my college did in fact dvc pushed. For that i would neeed currently to dvc pull which takes a long time and then check with dvc status if remote hashes and the pulled in the dvc.lock are the same. And on CI pipeline dvc pull is a unwanted necessity as the data is not kept anyway. |
Yup, sorry if the previous message wasn't clear. tldr we need to do a status check as if we had pulled the data but without actually pulling it. I think this also addresses the point from @sjawhar, but let me know if I'm missing something. |
We haven't documented the behavior yet, but with the latest release, you should be able to use the For example, run The only part I see missing from the requests above is checking whether all the data exists on the remote. This can already be achieved with |
Hi @dberenbaum, just tried to reproduce what you said. I'm trying to check if all objects referenced in the EDIT: It come from bug referenced in #9541 My pip install -q dvc['s3']
dvc repro --allow-missing -q
pip install -q dvc['s3']==2.57
if dvc data status --not-in-remote | grep "Not in remote"
then
exit 1
fi EDIT: Works with 3.0.0 pip install -q dvc['s3']==3.0.0
dvc repro --allow-missing -q
if dvc data status --not-in-remote | grep "Not in remote"
then
exit 1
fi |
Bringing this over from Discord.
What I'd like is a command to use in a CI/CD process to check that the DVC pipeline is in a valid state. By "valid", I mean that:
Essentially, this is the same as running
dvc pull
thendvc repro --dry
and checking that "Data and pipelines are up to date", except I don't want to have to run dvc pull. This is especially important if you're working with large datasets, as pulling them every time on a CI machine could be quite costly in time and/or actual dollar bills 💰I played around a bit to see if there's a workaround with only the current functionality. Here's what I found.
dvc status
by itself will tell you if any non-cached deps (e.g. source code) don't match dvc.lock. That'll look like this:dvc status -c
will tell you if any outputs of any stages listed in dvc.lock aren't in remote storage. That'll look like this:dvc params diff
will obviously catch param changes. I don't think anything covers point 3 above, and even if I stitched these all together, it would be very brittle as it relies on the outputs of all these commands not changing.As always, I'm happy to contribute the change if you all think it would be valuable. I know I would use it right away in several projects. Or let me know if I'm simply overlooking some existing functionality that would serve the same purpose.
The text was updated successfully, but these errors were encountered: