Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: Allow pushing imported files to remote #4527

Closed
charlesbaynham opened this issue Sep 3, 2020 · 5 comments
Closed

import: Allow pushing imported files to remote #4527

charlesbaynham opened this issue Sep 3, 2020 · 5 comments
Labels
feature request Requesting a new feature help wanted p2-medium Medium priority, should be done, but less important

Comments

@charlesbaynham
Copy link
Contributor

charlesbaynham commented Sep 3, 2020

Currently, files imported by dvc import come into the local cache, but do not get pushed to any remotes. This can be useful in some cases, however at other times it might be desirable to make full backups of imported files, while still retaining the ability to update the local copies via dvc update.

Discussion with @efiop in Discord at https://discordapp.com/channels/485586884165107732/563406153334128681/751199400931491900 and https://discordapp.com/channels/485586884165107732/565699007037571084/751203534351106119.

To provide this option, it should be enough to just alter

dvc/dvc/output/base.py

Lines 463 to 467 in 79e8e4e

if self.stage.is_repo_import:
cache = NamedCache()
(dep,) = self.stage.deps
cache.external[dep.repo_pair].add(dep.def_path)
return cache
to not always skip imports, according to some flag in the dvcfile.

For the CLI, we though dvc import --backup conveys the meaning. For the DVCfile, an addition to outs:

deps: ...
outs:
- md5: 123456
  path: my_file.txt
  backup: True

So for imports with backup: True, dvc knows:

  • On update / import it should store the files in the local cache (like it already does)
  • On push it should transfer them to the remote (this is new, but easy to implement by avoiding the code branch linked above)
  • On fetch / pull treat these like normal files from e.g. dvc add, so do not interact with the import source, only with the DVC remote

The idea is for imports with --backup to be severed from their import source completely unless the user calls dvc update explicitly.

@efiop efiop added feature request Requesting a new feature good first issue help wanted p2-medium Medium priority, should be done, but less important labels Sep 3, 2020
@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 19, 2020

The backup entry in outs should be added for all outs, with the same meaning: that way you can tell just by looking at outs in a dvc file how the file will behave when you push / pull it, without imports having to be treated differently to normal files.

To keep backwards compatibility, dvc files without backup entries should have default values of backup = False for imports and backup = True for everything else.

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 19, 2020

So the plan of action is:

  • Add backup option to outs in dvc files with default setting dependant on the type of file
  • Change dvc push logic to remove custom logic for imports, replace with skipping push to remote if backup == False
  • Change fetch logic to only update the import from its source if backup == False. Display error (warning?) if files aren't present in remote or in cache, just like for ordinary files (i.e. follow the normal path for files added via dvc add)

That third point could instead have been "only update the import if backup==False or files aren't present in the cache/remote", but I think it's better as above. This way you know for sure that dvc pull only interacts with your remote whereas dvc update goes and fetches your imports.

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 19, 2020

  • Also, somewhere there needs to be a check for incompatible options in outs - if a dvc file has cache: False and backup: True in its outs section, that's impossible and should throw an error.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 21, 2020

The idea is for imports with --backup to be severed from their import source completely unless the user calls dvc update explicitly.

Does update reset backup to false? What about re-importing (running import again but without --backup?

The backup entry in outs should be added for all outs, with the same meaning... without imports having to be treated differently to normal files.
To keep backwards compatibility, dvc files without backup entries should have default values of backup = False for imports and backup = True for everything else.

So if we're still having to differentiate .dvc files based on contents of deps (regular vs. import) for backward compatibility, doesn't that defeat the purpose of adding the field to all outs going fwd for consistency? I'm of the inclination that backup should only have an effect on import .dvc files, TBH as it seems pretty edge and not something mainstream we want to be super visible.

Also, what about outs in dvc.yaml files? would those also always have the backup field? (This would affect docs BTW) — again, feels like a big change for a small use case.

@charlesbaynham
Copy link
Contributor Author

@jorgeorpinel I'll answer in the PR, I'm getting confused between here, the PR and the docs PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature help wanted p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants