Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New command: dvc verify - check that the pipeline is up to date without having to pull or run it #5369

Closed
sjawhar opened this issue Jan 30, 2021 · 18 comments · Fixed by iterative/dvc.org#4585
Labels
A: pipelines Related to the pipelines feature A: status Related to the dvc diff/list/status feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@sjawhar
Copy link
Contributor

sjawhar commented Jan 30, 2021

Bringing this over from Discord.

What I'd like is a command to use in a CI/CD process to check that the DVC pipeline is in a valid state. By "valid", I mean that:

  1. All deps of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock
  2. All outs of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock
  3. If any stage has an out that is the dep of any other stage, they have the same hash in dvc.lock
  4. All params match what's in dvc.lock

Essentially, this is the same as running dvc pull then dvc repro --dry and checking that "Data and pipelines are up to date", except I don't want to have to run dvc pull. This is especially important if you're working with large datasets, as pulling them every time on a CI machine could be quite costly in time and/or actual dollar bills 💰

I played around a bit to see if there's a workaround with only the current functionality. Here's what I found.

dvc status by itself will tell you if any non-cached deps (e.g. source code) don't match dvc.lock. That'll look like this:

train:
        changed deps:
                deleted:            data/processed/data_train.npz
                deleted:            data/processed/data_val.npz
                modified:           src/train <-- This one

dvc status -c will tell you if any outputs of any stages listed in dvc.lock aren't in remote storage. That'll look like this:

        missing:            data/processed/data_test.npz   <-- This one 
        deleted:            data/processed/data_train.npz
        deleted:            data/processed/data_val.npz

dvc params diff will obviously catch param changes. I don't think anything covers point 3 above, and even if I stitched these all together, it would be very brittle as it relies on the outputs of all these commands not changing.

As always, I'm happy to contribute the change if you all think it would be valuable. I know I would use it right away in several projects. Or let me know if I'm simply overlooking some existing functionality that would serve the same purpose.

@efiop
Copy link
Contributor

efiop commented Feb 1, 2021

Hi @sjawhar !

As always, great suggestion! I think this feature has been requested before in different shapes and forms (can't pinpoint a particular ticket right now, sorry) and is generally related to dvc being able to operate in a "virtual environment" with remote and run-cache in mind, so that missing local cache is not a problem as long as stuff is available on remote. I think the functionality you've requested still makes sense for dvc status (maybe as a flag) instead of a separate verify command, unless that new separate command is more tailored to the pipelines (we've had this idea of smth like dvc pipeline/stage/e.g. status for a long time).

Maybe this might fit into dvc exp better these days? 🤔 (CC @dberenbaum @pmrowla) E.g. there will be remote executors and stuff, so maybe this global state will somehow fit better into that paradigm. This is just a random thought though, no more than a rant, so nevermind.

@efiop efiop added the feature request Requesting a new feature label Feb 1, 2021
@dberenbaum
Copy link
Collaborator

@efiop Although I haven't used status extensively, I agree that it intuitively makes sense as a flag for that command. I also agree with your final analysis that it probably doesn't make sense as part of dvc exp 😄

@antonpetkoff
Copy link

+1 for dvc verify! It is very useful for CI tests that prevent you from going forward with a stale pipeline.

@davesteps
Copy link

+1 I think this would be a very nice feature for when you are doing CI/CD tests to confirm your branch is ready to merge. I opened this iterative/cml#681 issue because I thought it might be a cml feature but actually I think this would fit the bill.

@yummydum
Copy link

Hi, I consider this feature very useful in my workflow as well.
For the time being, if there is a workaround script available, would anyone share it?

@mraxilus
Copy link

Any updates on this feature?
I would find it quite useful for my CI pipeline process.

@dberenbaum
Copy link
Collaborator

tl;dr It's on our roadmap but not close enough to give an estimate for it yet.

A lot of the functionality today is part of dvc status. Our plan is to separate the data management aspects of dvc status from this pipeline-focused dvc verify functionality (like dvc data status and dvc stage status; actual command names TBD) and make each more robust and useful on their own. We are currently focused on addressing gaps in data management generally, including updating dvc status, and those improvements should be coming soon. Once that's complete, we should have more capacity to focus on pipeline improvements, and this feature request will be a high priority.

@courentin
Copy link
Contributor

courentin commented Sep 16, 2022

hello, do you have any updates on it? :)

(it seems like the 2nd upvoted issue 👀)

@dberenbaum
Copy link
Collaborator

Sorry, I still don't have an estimate. It remains on our backlog because we have been focusing on rehauling a lot of the data management and haven't been able to put focus on pipelines. To that effect, we at least have dvc data status now, so there is a clear path to starting on dvc stage status once we have the capacity for it.

@courentin
Copy link
Contributor

Hey, any news on it? Is there any ways we can provide help?

@Otterpatsch
Copy link

Would be highly beneficial to have such feature for our CI pipelines.

@ssmatana
Copy link

ssmatana commented Feb 3, 2023

Oh, this would be immensely useful !

@dberenbaum
Copy link
Collaborator

Sorry for the repeated delays here. We are finally planning this for the quarter.

Thanks to @sjawhar for opening this and listing clear requirements.

What I'd like is a command to use in a CI/CD process to check that the DVC pipeline is in a valid state. By "valid", I mean that:

1. All deps of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock

2. All outs of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock

3. If any stage has an out that is the dep of any other stage, they have the same hash in dvc.lock

4. All params match what's in dvc.lock

Some thoughts on these:

3. If any stage has an out that is the dep of any other stage, they have the same hash in dvc.lock

Maybe I'm misunderstanding, but I think this is covered already by dvc status, since otherwise it should report that either the dependency or output has changed, depending on which matches what's in your workspace. Since another requirement is that the data may be missing from the workspace, we can report on what would happen if you did dvc pull (it would pull the output version of the data).

4. All params match what's in dvc.lock

Again, this should already be reported by dvc status unless I'm misunderstanding.


In other words, taking a closer look here, the issue is covered by dvc status except that the command needs to work without pulling all the data locally, which we are finally ready to do.

Since dvc status is already overloaded and we now have dvc data status, let's stick with the plan to put this in a new dvc stage status command that only reports on the status of the pipeline and has some option (--remote, --allow-missing, or something similar) to check missing outputs against what's in the remote. The command shouldn't need any of the other features of dvc status like comparing against other commits.

@sjawhar
Copy link
Contributor Author

sjawhar commented Apr 12, 2023

Some nuance on point 3

Maybe I'm misunderstanding, but I think this is covered already by dvc status, since otherwise it should report that either the dependency or output has changed, depending on which matches what's in your workspace. Since another requirement is that the data may be missing from the workspace, we can report on what would happen if you did dvc pull (it would pull the output version of the data).

I don't think that quite covers it. The out of one stage might exist in the remote, as might the dep of the next stage in the dag, but they don't agree with each other. So they both exist simply by checking the remote, but they aren't the same artifact. We would indeed need to do that collision detection that you hint at ("if you did dvc pull")

@Otterpatsch
Copy link

I as a user want the possibility to check if e.g my college did in fact dvc pushed. For that i would neeed currently to dvc pull which takes a long time and then check with dvc status if remote hashes and the pulled in the dvc.lock are the same.
So the idea (as i get it and agree on) on verify is do skip the step of dvc pulling as it takes a long time.

And on CI pipeline dvc pull is a unwanted necessity as the data is not kept anyway.
So one "just" ones with verify to check the hashes which are in the current dvc.lock with the remote if those hashes exist there and could be pulled or in otherwords where pushed.

@dberenbaum
Copy link
Collaborator

So the idea (as i get it and agree on) on verify is do skip the step of dvc pulling as it takes a long time.

Yup, sorry if the previous message wasn't clear. tldr we need to do a status check as if we had pulled the data but without actually pulling it. I think this also addresses the point from @sjawhar, but let me know if I'm missing something.

@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Apr 15, 2023
@dberenbaum
Copy link
Collaborator

We haven't documented the behavior yet, but with the latest release, you should be able to use the --allow-missing flag from #9437 (thanks @daavoo) to cover most of this functionality.

For example, run dvc [exp run/repro] --allow-missing [--dry] and it will skip any stages that have missing data but are otherwise unchanged. If your pipeline is up to date other than having to pull, the command will succeed without running any of your stages. Otherwise, it will fail.

The only part I see missing from the requests above is checking whether all the data exists on the remote. This can already be achieved with dvc data status --json, but maybe we can think of ways to make this simpler since you would have to parse the output.

@Guillaume-oso
Copy link

Guillaume-oso commented Jun 5, 2023

We haven't documented the behavior yet, but with the latest release, you should be able to use the --allow-missing flag from #9437 (thanks @daavoo) to cover most of this functionality.

For example, run dvc [exp run/repro] --allow-missing [--dry] and it will skip any stages that have missing data but are otherwise unchanged. If your pipeline is up to date other than having to pull, the command will succeed without running any of your stages. Otherwise, it will fail.

The only part I see missing from the requests above is checking whether all the data exists on the remote. This can already be achieved with dvc data status --json, but maybe we can think of ways to make this simpler since you would have to parse the output.

Hi @dberenbaum, just tried to reproduce what you said.
dvc repro --allow-missing --dry allows me to check if git tracked objects are up to date in the dvc.lock

I'm trying to check if all objects referenced in the dvc.lock are in the remote cache, but when I try your command: dvc data status --json --not-in-remote, I do not see any differences in the output when data is in the remote and when the data is not in the remote. Indeed, I do not understand how these 2 commands can cover the dvc verify functionnality

EDIT: It come from bug referenced in #9541

My dvc verify dirty code:

pip install -q dvc['s3']
dvc repro --allow-missing -q
pip install -q dvc['s3']==2.57
if dvc data status --not-in-remote | grep "Not in remote"
then
    exit 1
fi

EDIT: Works with 3.0.0

pip install -q dvc['s3']==3.0.0
dvc repro --allow-missing -q
if dvc data status --not-in-remote | grep "Not in remote"
then
    exit 1
fi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature A: status Related to the dvc diff/list/status feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.