-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud versioning #7995
Comments
The versioning cloud cache is more like a workspace instead of a cache space. |
Thanks to great work (e.g. #8106 , etc) by @pmrowla we now have a working POC in the
|
@dberenbaum Can we have here some high-level examples of how we want the common use cases to look like (which commands/flags would be used, how the metadata looks, etc)? My understanding and doubts regarding current status/proposals: SetupCommon for all use cases:
Adding data to the projectDepends on where the existing data is located: Local
Resulting
RemoteNot even sure how this works conceptually
Resulting
Track updatesMy understanding is that no matter the origin of the data, we should have reached an equivalent status for the project. Depends on where the updates were done: Local
Remote
|
Good questions @daavoo. Here are some minor clarifications:
You only need one or the other of the Also, for #8826 (or if you do
This was more of a rough suggestion. It sounds like the local and remote info may be better kept separate. I'm hoping the team can discuss and offer a better schema.
Yes. For a For a Integration with non-DVC toolsExample: Iterating between a data labeling tool and DVC (for model training).
Integration with non-DVC usersExample: Making a data registry where consumers don't need to know DVC.
Use pipelines while keeping data in the cloud (#8411)Example: I want to use DVC pipelines, but I don't want to move my data that is already on the cloud or refactor my code that read/writes from/to the cloud directly (or it is too big to keep a local copy).
|
If understand correctly, @dberenbaum I'm not sure if we have already a place in docs for this, but we can move your examples there. |
🤔 There are some critical differences:
Thoughts on the different scenarios mentioned above: Integration with non-DVC tools
For most use cases, it seems like the better path here is to use DVC to track the annotations and not bother tracking the raw data, which is often immutable/append-only. I also don't see that tools like Label Studio currently support cloud version IDs. Integration with non-DVC users
With or without PipelinesThis one might make sense to support in the future with |
Idea
Cloud storages (S3, GCS, Azure blob) do support file versioning. Why don't use cloud versioning instead of DVC cache in the storage level while keeping the DVC cache locally?
Motivation: Because of DVC name, many users consider DVC as data management and data versioning tool. They expect DVC to work with their existing datasets (for example, an existing directory in S3). They try DVC and stop using it because DVC requires them to convert the dataset to DVC cache.
Solution
DVC has to use cloud versioning when it’s possible while keeping the local cache as is.
More details:
Human-readable cloud storage:
. ├── dir1 │ └── f2 ├── dir2 │ └── f3 └── f1
f1
has 4 versions,dir1/f2
- 2 versions anddir2/f3
- 3 versions.3 snapshots are versioned by DVC & Git.
Note, some of the file versions are not versioned by DVC but still exist in cloud storage:
ver 3
off1
andver 2
ofdir2/f3
.User workflow
The user workflow should stay the -
New change -
dvc push
should recognize if the storage supports versioning** (it should be enabled on bucket level) and if versioning is enabled DVC should copy file and create a new version.As a result, a user will see a file
s3://mybucket/dvcstore/data.xml
instead ofs3://mybucket/dvcstore/f3/6e4c9d82b2fd7d8680271716d47406
If a file is modified and pushed again - a user with see the same
s3://mybucket/dvcstore/data.xml
but the version will change.If versioning is not supported (in bucket or storage type like NFS), DVC should create a regular cache dir in cloud.
Internals
DVC has to support filename mapping to cloud version mapping in .dvc/lock files and perform all operations (pull/push/checkout) regular way.
The source of truth for versioning?
DVC versions set of files as a single snapshot/commit, not individual files. Git commit with its .dvc/lock are the source of truth for the versioning.
Some file modifications might happened in cloud storage. The modification should not break the versioning until a particular version of file is removed.
Note, the files in cloud directory do not necessary represent the newest version of the dataset/workspace because someone can overwrite file (create a new version). Like creating Ver 5 of File 1 in the diagram above without commiting it.
Out of scope
dvc update-from-s3-head
).Technical risks
fsspec
library that DVC uses to access storagesFrom @pmrowla:
a. We might need to add cloud versions of files to .dvc/lock file in additional to md5 checksums
b. Is it possible to handle the directories as is? Like
dir1/dir2/f.csv
is stored tos3://mybucket/dvcstore/dir1/dir2/f.csv
- It looks like .dir files are needed only for local dvc-cache to reflect the cloud storage structure. So, cloud directories work as is. Local/dvc-cache directories should be handled as usual in DVC.
- A shortcut: push the .dir files to cloud (okay solution for this proposal that can be improved later)
- Possible later improvement: to have a special mode for repositories/projects with a small number of files and save all meta info to .dvc and dvc.lock
--run-cache
option).Related links
The text was updated successfully, but these errors were encountered: