-
Notifications
You must be signed in to change notification settings - Fork 1.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for dataset level versioning for HDF5 files. #6660
Comments
DVC has long had an open issue to add block-level versioning rather than file-level: #829. Although I can't commit to any timeline on this yet, it remains a high priority and getting closer to reality since we have recently addressed prerequisites for it. Unfortunately, it's still too far away to recommend you wait for this to be implemented, but it's close enough that I don't think it's likely that we want to adopt some other solution to specifically handle hdf5 with external links. Are you able or willing to put the external links into the same directory as main.h5? If so, you could track the whole directory with DVC. This approach tends to work for users with other partitioned data formats like parquet. Happy to discuss further. |
Please have a look at these changes |
Nice! So you have this working and are interested in seeing if it's something we would want to merge? It looks like a good way of doing things for your use case, and it would be great to promote this to anyone else who's using HDF5 with external links. TBH I'm still not sure about making it part of the main codebase as is since DVC is generally agnostic to file formats for data management. This could open up requests to support other types of metadata files. @efiop what do you think? |
Doesn't look like the linked PR is in any way complete and is more of a draft of how it could work with @annmary-roy It looks like the main problem you have with dvc is file-level granularity, right? If so, I would maybe wait for #829 to see how well it will work for hdf5 files by treating them just like binaries. Just to clarify, if we are talking about dvc support for hdf5, we should treat it not as a storage format, but more like a mount point in the filesystem, that virtually extends your workspace. In that case, actual underlying data storage should happen in dvc with no hd5f specific difference, but saving (aka It is also clear that hdf5 support is not only relevant for data management (dvc add), but also for our pipelines, where one could want to depend on some particular path within the hdf5 file, instead of depending on it as a whole. This case seems to also be solvable by the "mountpoint" approach I've mentioned before. Also, looks like fsspec (the fs backend that we use for our cloud and git operations) could support hdf5 (though there are some details to figure out fsspec/filesystem_spec#5 (comment)) as it already seems to support |
Could you share a reference to any work-in-progress code or design for #829? |
@suparna-bhattacharya There is none, but we plan on working on it in the future. It will likely be very similar to git packs. |
For time being we are planning to go ahead with the folder based approach, where we can put all the splitfiles into a single directory and do an explicit dvc add on the folder and main.h5 . fsspec implementation for hdf5 is something which we may consider later. |
Converting to discussion since this is not actionable yet |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
We are working in the area of AI for science.
HDF5 is one of the file formats used extensively by scientific community to store data. Data can be organized inside HDF5 file under different dataset objects. A single HDF5 can host multiple datasets.
While exploring DVC for AI for science use case, we encountered a problem. Versioning in DVC happens at the file granularity and since these HDF5 files can be very large, it causes storage and network bottlenecks.
A small change inside any of the datasets of a file would cause the entire file to be duplicated and replicated to the storage remote.
We have a potential solution on how this problem can be addressed for Hdf5 files.
HDF5 offers a natural boundary of individual datasets inside the file. It also has a existing feature of external links by which an external datasets residing in a different file can be mounted into the file.
We are planning to split HDF5 files into separate files(using a plugin library) at dataset boundary and linking these smaller files as external links in the main file. The plugin library can be used along with the main HDF5 library, when the user wants dataset level versioning.
This will allow to version at a dataset boundary and still manage as a single unit(by creating external links inside the main file).
For eg – If earlier we have a single main.h5 file with dataset ds1 and ds2 residing inside it, we will now have main.h5, which will be light weight file and, main-ds1.split and main-ds2.split files. Dataset ds1 and dataset ds2 inside main.h5 will point(through external link) to dataset ds1 in main-ds1.split and ds2 in main-ds2.split respectively .
Any application can access the datasets ds1 and ds2 by accessing it through main file. So this split is transparent to applications and they need not be aware of the underlying split. All major operations like read and write to HDF5 will happen the same way for the applications as if it’s one single file.
This way a change in ds1 will effect only main-ds1.split file and main.h5(the main file in which ds1 is linked), leaving main-ds2.split file unchanged, thereby not needing to be duplicated and replicated again to storage remote.
To make the manageability easier with DVC, we would like to propose a feature in DVC .
We would like DVC also to have the capability to be manage these dependent external links inside the main file, along with the main file as single unit. For example, by being able to automatically find the external links mounted on the main file and version these dependent external split files associated with the main file, when the main file is added to DVC.
One scenario is dvc add, a dvc add main.h5, should automatically find the external link files (h5py has API's for this) and create .dvc files for these files, main-ds1.split and main-ds2.split and prompt git add of main.dvc , main-ds1.dvc and main-ds2.dvc
We seek discussion/ feedback on the approach, alternatives.
The text was updated successfully, but these errors were encountered: