Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Granular pipeline dependency status #9431

Open
johan-sightic opened this issue May 10, 2023 · 5 comments
Open

Granular pipeline dependency status #9431

johan-sightic opened this issue May 10, 2023 · 5 comments
Labels
A: pipelines Related to the pipelines feature A: status Related to the dvc diff/list/status p2-medium Medium priority, should be done, but less important

Comments

@johan-sightic
Copy link

johan-sightic commented May 10, 2023

Feature Request

When I run dvc repro dvc detects which dependencies have changed an therefore which stages needs to be reproduced. I would like to access the granular changes of all the dependencies for a stage since it was last reproduced.

Example usage

Change in dependencies for stage preprocess:

$ dvc stage status preprocess --granular --json  (New command or addition to "dvc status" or "dvc data status")
{
    "new": [
        "path/to/new/dependency/file",
        ...
    ],
    "modified": [
        "path/to/modified/dependency/file",
        ...
    ],
    "deleted": [
        "path/to/deleted/dependency/file",
        ...
    ]
}

Motivation

This feature would be very useful for pipelines which process many independent samples and take a long time to run.

Imagine the following simple data setup where samples get preprocessed and stored in a new folder.

data
├── raw
│   ├── sample_001.jpeg
│   └── sample_002.jpeg
└── preprocessed
    ├── sample_001.jpeg
    └── sample_002.jpeg

And the corresponding simple pipeline.

stages:
  preprocess:
    cmd: python preprocess.py
    deps:
      - preprocess.py
      - data/raw/
    outs:
      - data/preprocessed:
        persist: true

With this feature the pipeline stage code could check which samples have changed (new/modified/deleted) and only process those. It could also detect that the code has changed and reprocess all samples.

This would save me a lot of time since we have a long and slow pipeline where the raw data gets updated quite often.
Link to extended Discord discussion: https://discord.com/channels/485586884165107732/1093361005754585109
Link to another discussion of the same problem: #5917

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important A: pipelines Related to the pipelines feature A: status Related to the dvc diff/list/status labels May 10, 2023
@dberenbaum
Copy link
Collaborator

Thanks @johan-sightic!

@skshetry
Copy link
Member

I think this will be part of #5369. Closing in favor of that.

@dberenbaum dberenbaum reopened this May 11, 2023
@dberenbaum
Copy link
Collaborator

Hey @skshetry, I think it's possible they are part of the same command, but I don't want to conflate the two since the use cases are pretty different. For example, the changes @daavoo has been working on in dvc repro may be enough that a new status command isn't that important for #5369. I also don't see anything in #5369 asking about granularity.

@johan-sightic
Copy link
Author

Is there any progress on this or any alternative solutions?

@dberenbaum
Copy link
Collaborator

The latest is what was discussed in #10042. Nothing further at the moment unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature A: status Related to the dvc diff/list/status p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

3 participants