Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index: fix storage for imports #10213

Merged
merged 1 commit into from
Jan 2, 2024
Merged

index: fix storage for imports #10213

merged 1 commit into from
Jan 2, 2024

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Jan 2, 2024

Fixes #10194

Copy link

codecov bot commented Jan 2, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (987b1c4) 90.50% compared to head (bee5aec) 90.50%.

Files Patch % Lines
dvc/repo/index.py 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #10213   +/-   ##
=======================================
  Coverage   90.50%   90.50%           
=======================================
  Files         499      499           
  Lines       37970    37982   +12     
  Branches     5517     5518    +1     
=======================================
+ Hits        34363    34375   +12     
  Misses       2965     2965           
  Partials      642      642           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@efiop efiop force-pushed the fix-10194 branch 2 times, most recently from 07d34bc to 930cb02 Compare January 2, 2024 09:26
@efiop efiop changed the title debug index: fix storage for imports Jan 2, 2024
@efiop efiop marked this pull request as ready for review January 2, 2024 11:28
@efiop efiop merged commit 041ea17 into iterative:main Jan 2, 2024
24 checks passed
@dberenbaum
Copy link
Collaborator

dberenbaum commented Jan 2, 2024

Thanks @efiop! Just checking that this doesn't break some other edge cases for import-url:

  1. dvc import-url --version-aware should still pull from source (and should not be pushed to remote).
  2. dvc import-url --no-download && dvc pull should be able to download from source even if the data was never pushed (see the discussion around --no-download here).
  3. Both of those should fail if the source data is now missing or modified.

The general logic behind import-url should be to try to pull from dvc remote storage first but be able to fall back to the source storage if the same content still exists there.

@efiop
Copy link
Contributor Author

efiop commented Jan 2, 2024

@dberenbaum Thank you for this, looks like I indeed completely forgot about 1 (confusing how we have 3 different behaviours for import-url and import together). Will adjust.

Second one also needs some attention with cloud versioning context, but it is clear that it is partially broken at the very least. Also taking a look (if I won't send a PR soon I'll create an issue first).

@efiop
Copy link
Contributor Author

efiop commented Jan 3, 2024

@dberenbaum Double checked, we have test for those in

class TestImportURLVersionAware:
, so we should be good for now with all of those points.

dvc import-url --version-aware should still pull from source...

Always using that source fs as a remote storage with version-aware imports ✅

... (and should not be pushed to remote).

and that remote is always marked as read-only ✅

dvc import-url --no-download && dvc pull should be able to download from source even if the data was never pushed (see the discussion around --no-download #8024).

We have a specific test for that (see no_download one in the first link). ✅

Both of those should fail if the source data is now missing or modified.

We request a particular version_id in both, so missing (particular version_id) will get you a FileNotFoundError and modified shouldn't affect it (as that's a different version_id).

@dberenbaum
Copy link
Collaborator

Double checked, we have test for those

That's great (and how it should be obviously).

We have a specific test for that (see no_download one in the first link).

Non-version-aware remotes should also be able to pull from source after no-download. That's tested here:

def test_pull_partial_import(tmp_dir, dvc, local_workspace):
local_workspace.gen("file", "file content")
dst = tmp_dir / "file"
stage = dvc.imp_url("remote://workspace/file", os.fspath(dst), no_download=True)
result = dvc.pull("file")
assert result["fetched"] == 0
assert dst.exists()
assert stage.outs[0].get_hash().value == "d10b4c3ff123b26dc068d43a8bef2d23"

We request a particular version_id in both, so missing (particular version_id) will get you a FileNotFoundError and modified shouldn't affect it (as that's a different version_id).

For non-version-aware remotes, we should also test that it fails if the original source is missing or modified. See this comment. I added tests for those in https://github.com/iterative/dvc/tree/test-pull-partial-import if you want to merge those. test_pull_partial_import_missing passes but test_pull_partial_import_modified doesn't. Not sure when that broke.

@dberenbaum
Copy link
Collaborator

For non-version-aware remotes, we should also test that it fails if the original source is missing or modified. See this comment. I added tests for those in https://github.com/iterative/dvc/tree/test-pull-partial-import if you want to merge those. test_pull_partial_import_missing passes but test_pull_partial_import_modified doesn't. Not sure when that broke.

Should we open an issue for this and move the discussion there?

@efiop
Copy link
Contributor Author

efiop commented Jan 4, 2024

@dberenbaum That makes sense, I don't think we ever supported that in the past, but now we totally can through the same mechanism that we use for cloud versioning. I'll introduce metadata valiadation for non cloud versioned file storage in dvc-data and will update what's necessary in #10220 on top of your tests. Thank you!

@efiop efiop added the bugfix fixes bug label Jan 4, 2024
@dberenbaum
Copy link
Collaborator

@efiop Are you sure it was never supported? Maybe it was specific to storages that have etags? This is what it says in #8024 (comment):

Running dvc pull downloads the partially-imported data, raising an error if the etag changed and also updates the .dvc file with outs and stage metadata.

@efiop
Copy link
Contributor Author

efiop commented Jan 4, 2024

@dberenbaum I'll be sure to double check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes bug
Projects
None yet
2 participants