Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimport checkout #10388

Merged
merged 3 commits into from
Jul 24, 2024
Merged

Reimport checkout #10388

merged 3 commits into from
Jul 24, 2024

Conversation

dberenbaum
Copy link
Collaborator

@dberenbaum dberenbaum commented Apr 12, 2024

Fixes #10255. This is a partial reversion of #9246. I'm not sure this is the cleanest or best way to do it, but it solves the problem.

It builds the hashfile object and tries to checkout from that, falling back to download if it fails.

Note that the build step is still slow for 2.x imports.

@dberenbaum dberenbaum requested a review from skshetry April 12, 2024 17:51
Copy link

codecov bot commented Apr 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.46%. Comparing base (dd2b515) to head (d510ad6).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10388      +/-   ##
==========================================
- Coverage   90.72%   90.46%   -0.27%     
==========================================
  Files         501      501              
  Lines       38865    38885      +20     
  Branches     5618     5619       +1     
==========================================
- Hits        35262    35176      -86     
- Misses       2961     3038      +77     
- Partials      642      671      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dberenbaum
Copy link
Collaborator Author

Ping @skshetry

@skshetry
Copy link
Member

I'll need to investigate. But we set cache.dir to point to the local cache, which should have made the DVCFileSystem use that cache.

config["cache"]["dir"] = self.repo.cache.local_cache_dir

I am a bit hesitant of this PR as it goes against #9246, and adds another (complex) codepath.

@dberenbaum
Copy link
Collaborator Author

Yes, I looked into it, and the local cache dir is set, but it's not enough because it's doing download/copy and not checkout.

download() is defined here:

dvc/dvc/fs/__init__.py

Lines 47 to 76 in 7df8b0c

def download(
fs: "FileSystem", fs_path: str, to: str, jobs: Optional[int] = None
) -> int:
from dvc.scm import lfs_prefetch
from .callbacks import TqdmCallback
with TqdmCallback(desc=f"Downloading {fs.name(fs_path)}", unit="files") as cb:
# NOTE: We use dvc-objects generic.copy over fs.get since it makes file
# download atomic and avoids fsspec glob/regex path expansion.
if fs.isdir(fs_path):
from_infos = [
path for path in fs.find(fs_path) if not path.endswith(fs.flavour.sep)
]
if not from_infos:
localfs.makedirs(to, exist_ok=True)
return 0
to_infos = [
localfs.join(to, *fs.relparts(info, fs_path)) for info in from_infos
]
else:
from_infos = [fs_path]
to_infos = [to]
if isinstance(fs, DVCFileSystem):
lfs_prefetch(fs, from_infos)
cb.set_size(len(from_infos))
jobs = jobs or fs.jobs
generic.copy(fs, from_infos, localfs, to_infos, callback=cb, batch_size=jobs)
return len(to_infos)

It calls generic.copy() (https://github.com/iterative/dvc-objects/blob/ae8b95d7ea3bbcf46fc5a09eddc7c4f614023d7b/src/dvc_objects/fs/generic.py#L69), which runs fs.get_file() operations on paths and knows nothing about the cache.

I don't think there is a much simpler way to do it than this PR, and I tried to make it as narrow as possible. It uses #9246 to rely on the local cache, building the object and passing it to dvc-data's checkout. We don't have imported_file.dvc, so there is no way to do a higher-level checkout operation AFAICT.

@skshetry
Copy link
Member

skshetry commented Apr 29, 2024

which runs fs.get_file() operations on paths and knows nothing about the cache.

fs is a DVCFileSystem, and knows about the cache.

@dberenbaum
Copy link
Collaborator Author

fs is a DVCFileSystem, and knows about the cache.

Yes, but it is doing a copy from that fs to localfs. It is not linking, and even if I try to enable linking, it fails because it is working across filesystems.

@Honzys
Copy link
Contributor

Honzys commented Jun 5, 2024

Hello, is there any update with this merge request? It would help us a ton if it gets merged.

@dberenbaum
Copy link
Collaborator Author

@skshetry Is there a good reason to keep this blocked? I'm happy to revert if we find time to resolve in a cleaner way later.

@Honzys
Copy link
Contributor

Honzys commented Jul 22, 2024

@skshetry ping

@dberenbaum dberenbaum enabled auto-merge (rebase) July 24, 2024 13:11
@dberenbaum dberenbaum merged commit 248ba3b into main Jul 24, 2024
19 checks passed
@dberenbaum dberenbaum deleted the reimport-checkout branch July 24, 2024 13:37
@skshetry
Copy link
Member

skshetry commented Aug 12, 2024

@dberenbaum, do you have a repro for this? It seems we do use cache and support link_types/cache_types in DataFileSystem.
generic.copy() does call _get and eventually ends up calling get_file for the DataFileSystem.

https://github.com/iterative/dvc-data/blob/c4ed9e8c0cde9be9931136bdd5b9ff80589bcaf7/src/dvc_data/fs.py#L217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

import: local cache is ignored when importing data that already exist in cache
3 participants