-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
guide: add allow-missing scenarios #4585
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Dave! Some minor comments. I was not following the feature implementation, and it took me a bit of time to connect the dots while I was reading this. Not saying we should change much, it's goog to go, but maybe add a bit more context in a few places to better explain why it is an issue.
@@ -108,6 +108,160 @@ stages: | |||
always_changed: true | |||
``` | |||
|
|||
## Pulling Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since title is a bit general, should we give a one sentence intro- pull gives a way to ... then explain how allow missing helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more in the intro
ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data' | ||
``` | ||
|
||
You can also check that all data exists on the remote. The command below will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to update the command, but it's needed because --allow-missing
won't actually check whether the missing data exists on the remote. However, it's probably important to users to ensure all data has been pushed so that the pipeline could be reproduced from scratch if needed. See the initial request for this feature in iterative/dvc#5369.
Blocked by iterative/dvc#9530 |
@shcheklein This is still blocked, but if you can review so we can merge when ready, I'd appreciate it 🙏 |
@daavoo @dberenbaum Please review when you have a chance 🙏 |
`--pull` will download missing dependencies (and will download the cached | ||
outputs of previous runs saved in the [run cache]), so you don't need to pull | ||
all data for your project before running the pipeline. `--allow-missing` will | ||
skip stages with no other changes than missing data. You can combine the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's hard to understand that "missing data" is a change
data/train_data/ | ||
``` | ||
|
||
We can modify the `evaluate` stage (for example, we changed the code to add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor.
Why not use --set-param
?
Is it to keep the focus on only the 2 flags being explained?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think about it. I think I just copied from your example in https://dvc.org/doc/command-reference/repro#example-only-pull-pipeline-data-as-needed, which used repro
so I guess couldn't rely on --set-param
. I could update with an example that uses it.
</details> | ||
|
||
```cli | ||
$ dvc exp run --allow-missing --dry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, if I were new do DVC it would have not been clear to me why it's not the default behavior ... if I run dry
why would it try to pull anything ... does it btw still pull anything at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't pull anything but it will fail because of the missing data even during --dry
since it's considered deleted.
@daavoo @shcheklein Addressed the comments above, so please give another look when you have a chance. |
Per #4575 (review)
Closes iterative/dvc#5369