Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: add allow-missing scenarios #4585

Merged
merged 6 commits into from
Jun 13, 2023
Merged

guide: add allow-missing scenarios #4585

merged 6 commits into from
Jun 13, 2023

Conversation

dberenbaum
Copy link
Contributor

@dberenbaum dberenbaum commented Jun 1, 2023

@dberenbaum dberenbaum requested a review from daavoo June 1, 2023 19:20
@shcheklein shcheklein temporarily deployed to dvc-org-pipeline-verify-mui19v June 1, 2023 19:24 Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Jun 1, 2023

Link Check Report

There were no links to check!

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Dave! Some minor comments. I was not following the feature implementation, and it took me a bit of time to connect the dots while I was reading this. Not saying we should change much, it's goog to go, but maybe add a bit more context in a few places to better explain why it is an issue.

@@ -108,6 +108,160 @@ stages:
always_changed: true
```

## Pulling Data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since title is a bit general, should we give a one sentence intro- pull gives a way to ... then explain how allow missing helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more in the intro

ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data'
```

You can also check that all data exists on the remote. The command below will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to update the command, but it's needed because --allow-missing won't actually check whether the missing data exists on the remote. However, it's probably important to users to ensure all data has been pushed so that the pipeline could be reproduced from scratch if needed. See the initial request for this feature in iterative/dvc#5369.

@dberenbaum dberenbaum added the blocked Waiting/Blocked over some dependency label Jun 2, 2023
@dberenbaum
Copy link
Contributor Author

Blocked by iterative/dvc#9530

@shcheklein shcheklein temporarily deployed to dvc-org-pipeline-verify-mui19v June 2, 2023 13:26 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pipeline-verify-mui19v June 6, 2023 16:04 Inactive
@dberenbaum dberenbaum mentioned this pull request Jun 7, 2023
@dberenbaum dberenbaum self-assigned this Jun 8, 2023
@dberenbaum dberenbaum requested a review from shcheklein June 8, 2023 19:09
@dberenbaum
Copy link
Contributor Author

@shcheklein This is still blocked, but if you can review so we can merge when ready, I'd appreciate it 🙏

@dberenbaum
Copy link
Contributor Author

@daavoo @dberenbaum Please review when you have a chance 🙏

`--pull` will download missing dependencies (and will download the cached
outputs of previous runs saved in the [run cache]), so you don't need to pull
all data for your project before running the pipeline. `--allow-missing` will
skip stages with no other changes than missing data. You can combine the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to understand that "missing data" is a change

data/train_data/
```

We can modify the `evaluate` stage (for example, we changed the code to add a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor.
Why not use --set-param?
Is it to keep the focus on only the 2 flags being explained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about it. I think I just copied from your example in https://dvc.org/doc/command-reference/repro#example-only-pull-pipeline-data-as-needed, which used repro so I guess couldn't rely on --set-param. I could update with an example that uses it.

</details>

```cli
$ dvc exp run --allow-missing --dry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, if I were new do DVC it would have not been clear to me why it's not the default behavior ... if I run dry why would it try to pull anything ... does it btw still pull anything at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't pull anything but it will fail because of the missing data even during --dry since it's considered deleted.

@shcheklein shcheklein temporarily deployed to dvc-org-pipeline-verify-mui19v June 12, 2023 20:05 Inactive
@dberenbaum
Copy link
Contributor Author

@daavoo @shcheklein Addressed the comments above, so please give another look when you have a chance.

@shcheklein shcheklein temporarily deployed to dvc-org-pipeline-verify-mui19v June 13, 2023 13:46 Inactive
@dberenbaum dberenbaum merged commit b57fb34 into main Jun 13, 2023
@dberenbaum dberenbaum deleted the pipeline-verify branch June 13, 2023 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Waiting/Blocked over some dependency
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

New command: dvc verify - check that the pipeline is up to date without having to pull or run it
3 participants