Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support adding/transfering data straight to cache/remote #4520

Closed
Tracked by #5367
efiop opened this issue Sep 2, 2020 · 40 comments
Closed
Tracked by #5367

support adding/transfering data straight to cache/remote #4520

efiop opened this issue Sep 2, 2020 · 40 comments
Assignees
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@efiop
Copy link
Contributor

efiop commented Sep 2, 2020

We often see people trying to use --external to add some big dataset that they have on an external drive, where they also have an external cache dir. People often do that because they can't/don't want to copy their data to dvc repo to dvc add it normally, e.g. because their HDD/SSD won't be able to physically fit two copies of that dataset.

Same thing with s3/gs/etc, where people want to just move data straight to their remote, without having to download/add/push it, because, again, it might not even fit on their local machine.

That's why it would be great to introduce feature(s) to be able to move(or copy) data straight to cache/remote from it's original location. Potentially this is not only useful for dvc add but also for dvc import[-url], where you want to use some data (e.g. through streaming with our API) in your project, that won't fit on your machine.

Related to #3920

@efiop efiop added feature request Requesting a new feature p2-medium Medium priority, should be done, but less important labels Sep 2, 2020
@jorgeorpinel

This comment has been minimized.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 3, 2020

And yes, it seems like this feature could be valuable. It sounds like a "lightweight" data registry (see https://dvc.org/doc/use-cases/data-registries) for people to use DVC as an intermediary between data already hosted on cloud storage and other projects. So they can access the data with a unified interface (DVC CLI or API).

@efiop

This comment has been minimized.

@isidentical
Copy link
Contributor

Hey @efiop, is this copy/tranfer operation will be on a new command or an already existing one?

Also, may I request 2 workflow command-group samples (some sort of small, self-contained replication of the whole process so that I can test one my local environment)? One is from the current workflow of people, where they try to do on 2 different local file system points, and one is from what they are going to execute when this issue will be resolved.

@efiop
Copy link
Contributor Author

efiop commented Dec 14, 2020

@isidentical I'm not a 100% sure if this might fit into some existing command, would love to hear your thoughts on it.

Sure, here are some examples. Please make sure you are familiar with our workflow already (see our get-started guide)

straight-to-remote scenario:

Imagine having some file/directory on s3(or other cloud) and you want to add it to your dvc repo and push to your remote (say it is an s3 remote too, say dvc remote add -d mys3 s3://bucket/dvc-remote), you would need to

aws s3 cp s3://bucket/path/to/data data
dvc add data
dvc push

but now what if data is so big that it doesn't fit as a whole on your machine? You won't be able to perform the same set of commands. DVC is able to pull partial data(e.g. if data is a diretory: dvc pull data/subdir) or provide you streaming access to the data on remote (e.g. if data is a file, you could use api.open/read) but we need to somehow shove that giant file into the remote first. So we need some way to put that data into remote without downloading it fully locally. Again have to stress that this new operation, when it is done, should leave the dvc repo in a state that is indistinguishable from that regular worklow described above with the exception that you won't have that data in the local cache or in your workspace, only present in the remote (you'll have data.dvc though, so if you ever move to another machine that can fit the whole file, dvc pull should be able to bring it and checkout it as normal). Regarding particular command name and semantics - it is up to you to research and suggest the CLI design for it 😉

straight-to-cache scenario:

Now imagine you have giant HDD and a small SSD that you keep your dvc repo(aka workspace) on. Since SSD is too small, you set up a dvc cache on HDD (see https://dvc.org/doc/use-cases/shared-development-server) (say /storage/dvc-cache) and use symlinks to link from cache into your local workspace (see https://dvc.org/doc/user-guide/large-dataset-optimization). Now say you have a giant dataset in /storage/some/path/data that you want to add to your repo as if you'd be able to:

cp /storage/some/path/data data
dvc add data

your first instinct might be to dvc add /storage/some/path/data but that file is outside of dvc repo, but dvc will complain. At this point people often misuse --external and dvc add /storage/some/path/data --extranal, but that will make /storage/some/path/data a part of your workspace (dvc will change it there) instead of having data in your dvc repo. So it would be nice to have a way to transfer that file to your workspace. The solution for this feature might be related to straight-to-remote solution, or it might be a separate command/flag, it is unclear right now. If you wish, we can focus on one straight-to-remote or straight-to-cache exclusively for now, to narrow the scope, but it is just useful to keep in mind that these might have a common solution/interface.

@isidentical
Copy link
Contributor

isidentical commented Dec 16, 2020

As a command-line model, I am thinking something like this;

dvc transfer to-remote [-r] <from> <file>
dvc transfer to-cache <from>
# straigh-to-remote's example
dvc transfer to-remote s3://bucket/path/to/data data
# straight-to-cache's example
dvc transfer to-cache /storage/some/path/data

@isidentical
Copy link
Contributor

I am a bit blocked on the user interface, even though I can start writing tests and migrate them later I'd rather prefer to work on the interface first. Here are the ideas that came and go during meetings

  • dvc export-to-remote <url> <name>? -r <remote>
  • dvc add <name> --straight-to-remote --from-url <url> -r <remote>
  • dvc import-url <url> <name> -straight-to-remote -r <remote>

I see amending to the dvc add as the least likely option, since it would be too complicated to adjust it with 2 mandatory flags that one doesn't work with the other one (--straight-to-remote, --from-url).

@efiop said import-url tracks the original location where we would ignore it when straight-to-remote functionality is activated, so it might also not be the ideal one.

For export-to-remote, even though it is a new command @shcheklein suggested semantics are not clear from the name which I definitely agree with. Maybe we could re-use the import keyword, something like import-to-remote (since as @efiop already stated, this is what it does) and also import already associated with both track + download operations.

@efiop
Copy link
Contributor Author

efiop commented Jan 6, 2021

Straight-to-remote would be useful both in dvc add scenario and in dvc import-url actually, and in the latter one it should still track the original location as it does right now. This actually makes me wonder if it should indeed be a flag for both dvc add and dvc import-url. E.g.

dvc add s3://bucket/path -o mydata --straight-to-remote -r myremote  # would create mydata.dvc or path.dvc if no -o specified
dvc import-url s3://bucket/path mydata --straight-to-remote -r myremote  # would create mydata.dvc or path.dvc if no out is provided

the name of the flag looks a bit ugly though 🙂 And i'm not sure how intuitive it will be for users, though those same users find --external somehow and misuse it all the time 🙂

Notice how I didn't use --from-url in dvc add, because it seems unintuitive and users already misuse --external with dvc add s3://bucket/path --external, so that might be the intuitive way to add something from external location.

Regarding dvc export-to-remote, it creates an association with uploading dvc-tracked data to an external location, at least for me. Kinda like if you need to deploy a model to s3 from your dvc repo.

The cons of --straight-to-remote for add/import-url is that they will be overloading the add/import-url CLI, which might be too much. There might be a better way, but if not - this one seems acceptable, even if just in experimental mode, until we find a better name.

@efiop
Copy link
Contributor Author

efiop commented Jan 6, 2021

For the record: while discussing with @isidentical , we found that in straight-to-cache scenario something like

dvc add /external/file -o data

would do what user expects: it will cache the file(aka transfer it to cache) and will make it known as data in the root of the repo. If the user has configured external cache with, say, symlinks, then data will be a symlink too, so you can add a giant file to your project this way.

There is a weird association with --no-pull for dvc add/import-url when discussing --straight-to-remote, but that's not quite it. Maybe --transfer? Not finding the perfect name/ui for it so far 🙁

@isidentical
Copy link
Contributor

isidentical commented Jan 6, 2021

There is a weird association with --no-pull for dvc add/import-url when discussing --straight-to-remote, but that's not quite it. Maybe --transfer? Not finding the perfect name/ui for it so far slightly_frowning_face

After a bit more discussion, we decided that just bare --to-remote (without straight) also might be an option.

The issue regarding it is the combination of it with -r looks a bit odd dvc add s3://bucket/file --to-remote -r my_other_remote.
We could potentially reduce both --to-remote and --remote into just --to-remote <remote (optional)> but it wouldn't be much consistent with actions (pull, push etc).

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 13, 2021

Hi! Re-discovered this from docs (and mentioned in demo). Some comments and questions:

but also for dvc import[-url]
import-url tracks the original location where we would ignore it when straight-to-remote... something like import-to-remote
it should still track the original location

@efiop Wouldn't #4527 be a requirement for applying this to imports?
Otherwise indeed, it doesn't seem to make sense to enable this for import[-url] (if the result is exactly the same as for dvc add).

Also, the import-to-remote only sounds correct for the import case, what if you don't want to save the original location ("just add" case)? Cc @isidentical Personally, I would focus on add only at first, but up to you.

straight-to-remote scenario... aws s3 cp s3://bucket/path/to/data data; dvc add data; dvc push

Again, wouldn't #4527 cover that somwhat? Maybe you could do import --no-exec wth --store somehow to reach the same "straight-to-remote" outcome.

straight-to-cache scenario

This is only meaningful when you have a custom cache outside the DVC project, right? Does that have any special implementation implications? (Prob not.)

dvc transfer
dvc export-to-remote

@isidentical those names sound more like a disconnected utility, similar to dvc get (cc @shcheklein). If this is about tracking data in the process (just skipping placing it in the cache and/or workspace) then I agree (as implemented) that it makes more sense as a flag for dvc add. (No longer sure because add would include push here, which may be too much...)

add --straight-to-remote --from-url ... too complicated

Agree on avoiding --from-urlby implying --external in --to-remote, if I understand this correctly.

users find --external somehow and misuse it all the time

  • DVC gives a hint to try that, I think 🙂 BTW would we be able to give a hint about --to-remote when needed?

dvc add /external/file -o data would do what user expects

  • Except that you can't check it out to the workspace, right? But def. there are several similar operations and maybe we should reconsider them to see how they can be consolidated... Probably won't be obvious which one is appropriate to most users.

Maybe --no-pull or --transfer? Not finding the perfect name/ui for it so far

I think just add --to-remote is short and accurate enough. Another idea is to have a flag for each case: --no-cache or --skip-cache [straight to remote] AND --no/skip-local/workspace/wdir [straight to cache] (the former includes the latter).

We could potentially reduce both --to-remote and --remote

I definitely agree on letting --to-remote take an optional remote name as argument instead.

@jorgeorpinel
Copy link
Contributor

So I answered some of my questions (scratched above) reviewing UI in #5198. I have one thing to add though, after realizing that add --to-remote includes push.

  1. Is this too much functionality in add? Maybe a standalone command for this is best after all... (dvc store/backup [--external]?)
  2. What would be the result if pushing fails (does dvc push complete the operation after fixing the remote or do you have to try again with -f) ?
  3. Is there any alternative workflow e.g. import --no-commit first, and push separately (again, maybe via import: Allow pushing imported files to remote #4527) ? I think it would be ideal to understand all the ways we deal with external data and try to consolidate them because it's getting confusing 🙂

@jorgeorpinel
Copy link
Contributor

Hey sorry for the delay.

The current behavior says they are not in cache, which is actually true. I guess only thing we can do is somehow make it explicit? like not in local cache?

I like the idea of a new status/message for this, yes. I was thinking more about the changed outs: grouping, as they're not "changed". Maybe transferred:? Idk if it complicates things too much.

@efiop
Copy link
Contributor Author

efiop commented Feb 2, 2021

Support for remote status in regular dvc status is something that we've been talking about for a while #5369 . It will naturally support transfered data once we implement it. Def not part of this issue.

@efiop
Copy link
Contributor Author

efiop commented Feb 2, 2021

@isidentical #5343 is the last part we are missing (not counting for docs) until we can close this ticket, right? Just checking.

@isidentical
Copy link
Contributor

@isidentical #5343 is the last part we are missing (not counting for docs) until we can close this ticket, right? Just checking.

Yes

@efiop efiop mentioned this issue Feb 2, 2021
11 tasks
@dberenbaum dberenbaum mentioned this issue Feb 2, 2021
31 tasks
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 2, 2021

Support for remote status in regular dvc status is something that we've been talking about... Def not part of this issue.

Sure. I'm not suggesting that. I'm saying that the "change outs" message is misleading. No outputs have changed. But I guess that can happen in other circumstances too, so it's also out of scope anyway. But do you think we should try to address this (separately)?

@efiop
Copy link
Contributor Author

efiop commented Feb 2, 2021

@jorgeorpinel Sorry, didn't mean to dismiss it like that 🙁 That was just for the record, no bad intention.

Sure, we should reconsider the status output, it is quite obsolete. Doesn't seem like there are good simple action points there though, we'll need a general redesign of output to address multiple complaints that we've collected over the years. Looks like it is better to create an epic to collect all complaints instead of creating a yet another ticket. Though we can always repurpose it for an epic, no problem.

@efiop efiop closed this as completed Feb 3, 2021
@jorgeorpinel
Copy link
Contributor

Howdy again! Something came up in https://discord.com/channels/485586884165107732/563406153334128681/806673366416883733 about this recently: what happens if you setup a remote, then set is as external cache e.g. with dvc config cache.local (or any other kind), and finally add/import something straight --to-remote? I should try before asking but don't have time this instant. Seems like an unintended confusion or maybe even a possibly problematic circular situation.

@efiop
Copy link
Contributor Author

efiop commented Feb 3, 2021

@jorgeorpinel Nothing, it all gets transfered to remote directly. So those options don't affect --to-remote at all.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 4, 2021

OK its just confusing because the result is that the data gets transferred to the remote, which happens to be your (external) cache, so the data gets "cached" (even when you told DVC not to) but without calculating hashes. I wonder if dvc pull skips the non-download and then no hashes are ever calculated, or something like that.

I guess you can just use dvc checkout? Sorry I'm just thinking out loud. Should QA this properly... The point is that there may be unintended side effects to keep in mind.

@efiop
Copy link
Contributor Author

efiop commented Feb 4, 2021

@jorgeorpinel dvc pull, you need to download it from the remote first.

@alealv
Copy link

alealv commented Feb 4, 2021

Hi folks,

I was the one discussing with @jorgeorpinel on the discord channel. I used DVC in the past so I have some experience with it.
At this new project, I face the mentioned straight-to-cache issue.

We use a server to train ML models with many GPUS, hence I configured DVC as Shared Development Server. The project/workspace is on a small SDD and the DVC cache is on a big NAS HDD.

The problem is that the pipeline I created to process the raw dataset (~3TB) and generate the data use to train (~6TB) is too much for the SDD, and I need/want to have it inside the project folder. The solution is to have a symlink pointing to the data in the NAS.

Tracking data directly

Before using the DVC pipeline I tracked the data with a data.dvc file and achieved what I intended in the following way:

dvc config cache.type hardlink,symlink
dvc add /mnt/nas/data
dvc move /mnt/nas/data ~/project/data

I guess the straight-to-cache solution you were proposing would simplify the process

Expand to see the rest of the discussion

The rest isn't related to this issue so I hid it.

Using a pipeline

But now, I created a pipeline which downloads (raw data comes from public datasets) and process the data. In this scenario I have a dvc.yaml file similar to this one (I don't put the original because it has 158 lines):

stages:
  download-data:
    cmd: wget https://<some-repo> -P /mnt/nas/raw_data
    outs:
      - /mnt/nas/raw_data:
      	cache: false
        persist: true

  proc-data:
    cmd: ./proc_data.sh --input /mnt/nas/raw_data --output /mnt/nas/waves
    outs:
      - /mnt/nas/waves

This works fine, but it's tracking the /mnt/nas/waves directory, which is outside the project. This brings some problems:

  1. Given that I'm in a Shared Development Server different users will not be able to be on different commits at the same time.
  2. The training scripts expect the data in a specific folder inside the project. I can solve this by creating a symlink by hand, but the data will still be tracked outside the repository.

If I do the following:

...
  proc-data:
    cmd: ./proc_data.sh --input /mnt/nas/raw_data --output waves
    outs:
      - waves

It will track the data where I intended, having a symlink waves -> /mnt/nas/dvc-cache/0f/<hash>.dir, which is what I want. But not before generating all the wave files in the SDD and then transferring them to the external cache.

That's why the straight-to-cache option will be optimal. But as I understand you are considering this for some commands but not for the pipeline, I'm I wrong?

Having a local remote as cache

I also asked @jorgeorpinel about this thread and configuring the cache as a remote, because the documentation says:

cache.local - name of a local remote to use as a custom cache directory. (Refer to dvc remote for more information on "local remotes".) This will overwrite the value provided to dvc config cache.dir or dvc cache dir.
https://dvc.org/doc/command-reference/config

I guess with that and this feature it will behave as I wanted.

To be honest, this confuses me a little, about the difference between a local remote and the cache. AFAIK, if I have a local remote and a cache on the same disk but different folders, I would have a double copy of the data, Am I right? I have not considered merging these two, and what are the implications of that.

Sorry for the long explanation, but I wanted to be very clear.

@jorgeorpinel
Copy link
Contributor

Hey Ale nice chatting with you and thanks for posting here instead for visibility. My comments:

Before using the DVC pipeline
dvc add /mnt/nas/data
dvc move /mnt/nas/data ~/project/data

I think you meant dvc add --external there. BTW that's an interesting use of add + move! 👍

I guess the straight-to-cache solution you were proposing would simplify the process

I think that import-url by itself (no need for --to-cache) already achieves the same result: copies the data to the (external) cache first, then links it locally (assuming file links are configured in DVC and supported by the FS).

Or, if you meant setting up a remote with the same path as your external cache (error-prone), and then using add --to-remote. As you can see you still need 2+ steps, and things get a little confusing. So not sure... It's def. another interesting command interaction! But probably not the intended use here (hacky).

That's why the straight-to-cache option will be optimal. But as I understand you are considering this for some commands but not for the pipeline, I'm I wrong?

Correct, this issue doesn't affect pipelines at all. But I guess it's a good idea to consider implementing to-cache/remote for dvc.yaml outs. Thoughts @efiop @isidentical ?

Expand to see the rest of the discussion

The rest isn't related to this issue so I hid it.

now, I created a pipeline
tracking the /mnt/nas/waves directory, which is outside the project brings some problems...

  1. Actually, there should be no problem for users to checkout different versions from the same cache. That's what the share cache patters is for 🙂 (The only issue could be file system errors if users try to write the same file at once — unlikely)
  2. Is there absolutely no way for your scripts to read from /mnt/nas/waves instead? Otherwise, my recs are to add a stage, or an extra command in the training stage (cmd: can contain a list) that creates the link for now.

proc_data.sh --input /mnt/nas/raw_data --output waves will track the data where I intended But not before generating all the wave files in the SDD

Yes, DVC doesn't change your code and can't capture/redirect file system write operations, at least for now (we accept feature requests).

the documentation says:
cache.local - name of a local remote... this confuses me a little

Yeah that doc needs an update, thanks for the heads-up! I'm starting to review it in iterative/dvc.org/pull/2154

if I have a local remote and a cache on the same disk but different folders, I would have a double copy of the data, Am I right?

You are correct.

merging these two, and what are the implications of that

Not a good idea 🙂 — the cache could potentially get corrupted, and it's just confusing! We do sin a little in that we kind of do just that (merge the remote and cache concepts) for external outputs, which is part of the reason why we don't really recommend that anymore.

@alealv
Copy link

alealv commented Feb 5, 2021

think you meant dvc add --external there. BTW that's an interesting use of add + move!

Yes, my bad. I forgot the --external

Actually, there should be no problem for users to checkout different versions from the same cache. That's what the share cache patters is for slightly_smiling_face (The only issue could be file system errors if users try to write the same file at once — unlikely)

I think there will be because when checking out the folder being tracked will change (aka /mnt/nas/data) and this is shared through all users.

Thanks for solving my other doubts.

I could modify the training script but it's not a good idea. What I will do is manually symlink the data folder inside the project.

@jorgeorpinel
Copy link
Contributor

when checking out the folder being tracked will change (aka /mnt/nas/data

dvc checkout doesn't alter the cache at all. It reads .dvc files and dvc.yaml/lock files and creates the appropriate links FROM the cache into your workspace 🙂

@alealv
Copy link

alealv commented Feb 7, 2021

dvc checkout doesn't alter the cache at all. It reads .dvc files and dvc.yaml/lock files and creates the appropriate links FROM the cache into your workspace slightly_smiling_face

Exactly, I wasn't referring to the cache. /mnt/nas/data will be changed and this is shared across all users. If I'm training and somebody checks out to a different commit it will corrupt my training.

@jorgeorpinel
Copy link
Contributor

Ah I see what you mean. Yes, the external data itself should not be shared by several projects! External outputs are considered part of the extended workspace and project workspaces can't overlap, naturally. That's another reason why we don't recommend external outputs except for very specific uses where absolutely no other option is available.

I guess https://dvc.org/doc/user-guide/managing-external-data shouldn't link to https://dvc.org/doc/use-cases/shared-development-server#configure-the-external-shared-cache for instructions to config an external cache, as shared external caches are not compatible with external outputs. Moved this to iterative/dvc.org#654 (comment).

@shcheklein
Copy link
Member

@alealv there is a workaround for this problem - https://github.com/PeterFogh/dvc_dask_use_case . You could use remote://something notation, and for different folks ask them to have their own --local , --global, or --system config to that remote that defines the exact location personally for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

5 participants