-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support adding/transfering data straight to cache/remote #4520
Comments
This comment has been minimized.
This comment has been minimized.
And yes, it seems like this feature could be valuable. It sounds like a "lightweight" data registry (see https://dvc.org/doc/use-cases/data-registries) for people to use DVC as an intermediary between data already hosted on cloud storage and other projects. So they can access the data with a unified interface (DVC CLI or API). |
This comment has been minimized.
This comment has been minimized.
Hey @efiop, is this copy/tranfer operation will be on a new command or an already existing one? Also, may I request 2 workflow command-group samples (some sort of small, self-contained replication of the whole process so that I can test one my local environment)? One is from the current workflow of people, where they try to do on 2 different local file system points, and one is from what they are going to execute when this issue will be resolved. |
@isidentical I'm not a 100% sure if this might fit into some existing command, would love to hear your thoughts on it. Sure, here are some examples. Please make sure you are familiar with our workflow already (see our get-started guide)
|
As a command-line model, I am thinking something like this;
|
I am a bit blocked on the user interface, even though I can start writing tests and migrate them later I'd rather prefer to work on the interface first. Here are the ideas that came and go during meetings
I see amending to the @efiop said For |
Straight-to-remote would be useful both in
the name of the flag looks a bit ugly though 🙂 And i'm not sure how intuitive it will be for users, though those same users find Notice how I didn't use Regarding The cons of |
For the record: while discussing with @isidentical , we found that in straight-to-cache scenario something like
would do what user expects: it will cache the There is a weird association with |
After a bit more discussion, we decided that just bare The issue regarding it is the combination of it with |
Hi! Re-discovered this from docs (and mentioned in demo). Some comments and questions:
@efiop Wouldn't #4527 be a requirement for applying this to imports? Also, the
Again, wouldn't #4527 cover that somwhat?
This is only meaningful when you have a custom cache outside the DVC project, right? Does that have any special implementation implications? (Prob not.)
@isidentical those names sound more like a disconnected utility, similar to
Agree on avoiding
I think just
I definitely agree on letting --to-remote take an optional remote name as argument instead. |
So I answered some of my questions (
|
Hey sorry for the delay.
I like the idea of a new status/message for this, yes. I was thinking more about the |
Support for remote status in regular |
@isidentical #5343 is the last part we are missing (not counting for docs) until we can close this ticket, right? Just checking. |
Yes |
Sure. I'm not suggesting that. I'm saying that the "change outs" message is misleading. No outputs have changed. But I guess that can happen in other circumstances too, so it's also out of scope anyway. But do you think we should try to address this (separately)? |
@jorgeorpinel Sorry, didn't mean to dismiss it like that 🙁 That was just for the record, no bad intention. Sure, we should reconsider the status output, it is quite obsolete. Doesn't seem like there are good simple action points there though, we'll need a general redesign of output to address multiple complaints that we've collected over the years. Looks like it is better to create an epic to collect all complaints instead of creating a yet another ticket. Though we can always repurpose it for an epic, no problem. |
Howdy again! Something came up in https://discord.com/channels/485586884165107732/563406153334128681/806673366416883733 about this recently: what happens if you setup a remote, then set is as external cache e.g. with |
@jorgeorpinel Nothing, it all gets transfered to remote directly. So those options don't affect |
OK its just confusing because the result is that the data gets transferred to the remote, which happens to be your (external) cache, so the data gets "cached" (even when you told DVC not to) but without calculating hashes. I wonder if I guess you can just use |
@jorgeorpinel |
Hi folks, I was the one discussing with @jorgeorpinel on the discord channel. I used DVC in the past so I have some experience with it. We use a server to train ML models with many GPUS, hence I configured DVC as Shared Development Server. The project/workspace is on a small SDD and the DVC cache is on a big NAS HDD. The problem is that the pipeline I created to process the raw dataset (~3TB) and generate the data use to train (~6TB) is too much for the SDD, and I need/want to have it inside the project folder. The solution is to have a symlink pointing to the data in the NAS. Tracking data directlyBefore using the DVC pipeline I tracked the data with a dvc config cache.type hardlink,symlink
dvc add /mnt/nas/data
dvc move /mnt/nas/data ~/project/data
I guess the Expand to see the rest of the discussion
Using a pipelineBut now, I created a pipeline which downloads (raw data comes from public datasets) and process the data. In this scenario I have a stages:
download-data:
cmd: wget https://<some-repo> -P /mnt/nas/raw_data
outs:
- /mnt/nas/raw_data:
cache: false
persist: true
proc-data:
cmd: ./proc_data.sh --input /mnt/nas/raw_data --output /mnt/nas/waves
outs:
- /mnt/nas/waves This works fine, but it's tracking the
If I do the following: ...
proc-data:
cmd: ./proc_data.sh --input /mnt/nas/raw_data --output waves
outs:
- waves It will track the data where I intended, having a symlink That's why the Having a
|
Hey Ale nice chatting with you and thanks for posting here instead for visibility. My comments:
I think you meant
I think that Or, if you meant setting up a remote with the same path as your external cache (error-prone), and then using
Correct, this issue doesn't affect pipelines at all. But I guess it's a good idea to consider implementing Expand to see the rest of the discussion
Yes, DVC doesn't change your code and can't capture/redirect file system write operations, at least for now (we accept feature requests).
Yeah that doc needs an update, thanks for the heads-up! I'm starting to review it in iterative/dvc.org/pull/2154
You are correct.
Not a good idea 🙂 — the cache could potentially get corrupted, and it's just confusing! We do sin a little in that we kind of do just that (merge the remote and cache concepts) for external outputs, which is part of the reason why we don't really recommend that anymore. |
Yes, my bad. I forgot the
I think there will be because when checking out the folder being tracked will change (aka Thanks for solving my other doubts. I could modify the training script but it's not a good idea. What I will do is manually symlink the data folder inside the project. |
|
Exactly, I wasn't referring to the cache. |
Ah I see what you mean. Yes, the external data itself should not be shared by several projects! External outputs are considered part of the extended workspace and project workspaces can't overlap, naturally. That's another reason why we don't recommend external outputs except for very specific uses where absolutely no other option is available.
|
@alealv there is a workaround for this problem - https://github.com/PeterFogh/dvc_dask_use_case . You could use |
We often see people trying to use
--external
to add some big dataset that they have on an external drive, where they also have an external cache dir. People often do that because they can't/don't want to copy their data to dvc repo todvc add
it normally, e.g. because their HDD/SSD won't be able to physically fit two copies of that dataset.Same thing with s3/gs/etc, where people want to just move data straight to their remote, without having to download/add/push it, because, again, it might not even fit on their local machine.
That's why it would be great to introduce feature(s) to be able to move(or copy) data straight to cache/remote from it's original location. Potentially this is not only useful for
dvc add
but also fordvc import[-url]
, where you want to use some data (e.g. through streaming with our API) in your project, that won't fit on your machine.Related to #3920
The text was updated successfully, but these errors were encountered: