Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check link target exists #542

Closed
wants to merge 1 commit into from
Closed

check link target exists #542

wants to merge 1 commit into from

Conversation

dberenbaum
Copy link

Closes iterative/dvc#10500. Open to other ideas for how to solve it, but right now linking to a nonexistent target raises no error and can cause downstream problems like in that issue.

@skshetry
Copy link
Member

skshetry commented Aug 6, 2024

@dberenbaum, can we check if the cache.dir exists or not (preferably in checkout or _checkout)? Would that work?

I don't want to do this on every individual transfer because performance for checkout is already bad enough. In fact, I was hoping to get rid of exists() call above that we are already doing in _remove() IIRC.

@dberenbaum
Copy link
Author

@skshetry I think that's the right idea, but I either don't follow what you mean or don't see how to do it. I don't see a higher-level cache path to check anywhere. The best I can see is to check the first path in the diff.

@skshetry
Copy link
Member

skshetry commented Aug 6, 2024

@dberenbaum
Copy link
Author

The cache itself does exist though, so that will not solve the problem unless I'm still not understanding what you mean.

@shcheklein
Copy link
Member

@dberenbaum any idea behind the reason for this to come to this place (no file exists, but it's trying to checkout it)? can it be reproduced with dvc pull --allow-missing && dvc checkout for example (assuming some files are missing)?

@dberenbaum
Copy link
Author

@dberenbaum any idea behind the reason for this to come to this place (no file exists, but it's trying to checkout it)?

Yes. iterative/dvc#10388 tries to checkout the imported data first (to avoid unnecessary downloads) and falls back to downloading if the checkout fails.

You can reproduce it with the script from iterative/dvc#10500, although you can skip the part that sets up a custom cache dir.

@shcheklein
Copy link
Member

@dberenbaum okay, I see.

Yes. iterative/dvc#10388 tries to checkout the imported data first (to avoid unnecessary downloads) and falls back to downloading if the checkout fails.

will it redownload the whole dataset (even if let's say some parts exist)? just to make sure ..

in this specific PR - can we make it less general by passing something like "strict: true|false" to the Link object and use the strict mode in the context of imports only?

@dberenbaum
Copy link
Author

in this specific PR - can we make it less general by passing something like "strict: true|false" to the Link object and use the strict mode in the context of imports only?

I think that could work, although I'm not sure it's better than iterative/dvc#10501 since we are introducing a new option and additional checks for a single narrow scenario.

@dberenbaum dberenbaum closed this Aug 15, 2024
@skshetry skshetry deleted the link-check-from-path branch August 15, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc==3.53.0 import fails with No such file or directory when cache.dir configured and cache.type symlink
3 participants