Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for dataset level versioning for HDF5 files. #6660

Closed
annmary-roy opened this issue Sep 21, 2021 · 8 comments
Closed

Support for dataset level versioning for HDF5 files. #6660

annmary-roy opened this issue Sep 21, 2021 · 8 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@annmary-roy
Copy link

We are working in the area of AI for science.
HDF5 is one of the file formats used extensively by scientific community to store data. Data can be organized inside HDF5 file under different dataset objects. A single HDF5 can host multiple datasets.
While exploring DVC for AI for science use case, we encountered a problem. Versioning in DVC happens at the file granularity and since these HDF5 files can be very large, it causes storage and network bottlenecks.
A small change inside any of the datasets of a file would cause the entire file to be duplicated and replicated to the storage remote.
We have a potential solution on how this problem can be addressed for Hdf5 files.
HDF5 offers a natural boundary of individual datasets inside the file. It also has a existing feature of external links by which an external datasets residing in a different file can be mounted into the file.
We are planning to split HDF5 files into separate files(using a plugin library) at dataset boundary and linking these smaller files as external links in the main file. The plugin library can be used along with the main HDF5 library, when the user wants dataset level versioning.

This will allow to version at a dataset boundary and still manage as a single unit(by creating external links inside the main file).
For eg – If earlier we have a single main.h5 file with dataset ds1 and ds2 residing inside it, we will now have main.h5, which will be light weight file and, main-ds1.split and main-ds2.split files. Dataset ds1 and dataset ds2 inside main.h5 will point(through external link) to dataset ds1 in main-ds1.split and ds2 in main-ds2.split respectively .
Any application can access the datasets ds1 and ds2 by accessing it through main file. So this split is transparent to applications and they need not be aware of the underlying split. All major operations like read and write to HDF5 will happen the same way for the applications as if it’s one single file.

This way a change in ds1 will effect only main-ds1.split file and main.h5(the main file in which ds1 is linked), leaving main-ds2.split file unchanged, thereby not needing to be duplicated and replicated again to storage remote.

To make the manageability easier with DVC, we would like to propose a feature in DVC .
We would like DVC also to have the capability to be manage these dependent external links inside the main file, along with the main file as single unit. For example, by being able to automatically find the external links mounted on the main file and version these dependent external split files associated with the main file, when the main file is added to DVC.

One scenario is dvc add, a dvc add main.h5, should automatically find the external link files (h5py has API's for this) and create .dvc files for these files, main-ds1.split and main-ds2.split and prompt git add of main.dvc , main-ds1.dvc and main-ds2.dvc

We seek discussion/ feedback on the approach, alternatives.

@dberenbaum
Copy link
Collaborator

DVC has long had an open issue to add block-level versioning rather than file-level: #829. Although I can't commit to any timeline on this yet, it remains a high priority and getting closer to reality since we have recently addressed prerequisites for it. Unfortunately, it's still too far away to recommend you wait for this to be implemented, but it's close enough that I don't think it's likely that we want to adopt some other solution to specifically handle hdf5 with external links.

Are you able or willing to put the external links into the same directory as main.h5? If so, you could track the whole directory with DVC. This approach tends to work for users with other partitioned data formats like parquet.

Happy to discuss further.

@annmary-roy
Copy link
Author

annmary-roy commented Sep 23, 2021

Please have a look at these changes
annmary-roy#1
The intention is to support dataset versioning for HDF5 in DVC with minimal changes/overheads on DVC code.
And provide better manageability of these split files in DVC.
The versioning at a dataset boundary can be achieved when we track the whole folder, but when these hdf5 files are individually tracked, the user would have to add each split file separately.

@dberenbaum
Copy link
Collaborator

Nice! So you have this working and are interested in seeing if it's something we would want to merge?

It looks like a good way of doing things for your use case, and it would be great to promote this to anyone else who's using HDF5 with external links.

TBH I'm still not sure about making it part of the main codebase as is since DVC is generally agnostic to file formats for data management. This could open up requests to support other types of metadata files. @efiop what do you think?

@efiop
Copy link
Contributor

efiop commented Sep 23, 2021

Doesn't look like the linked PR is in any way complete and is more of a draft of how it could work with dvc add. 🙂

@annmary-roy It looks like the main problem you have with dvc is file-level granularity, right? If so, I would maybe wait for #829 to see how well it will work for hdf5 files by treating them just like binaries.

Just to clarify, if we are talking about dvc support for hdf5, we should treat it not as a storage format, but more like a mount point in the filesystem, that virtually extends your workspace. In that case, actual underlying data storage should happen in dvc with no hd5f specific difference, but saving (aka dvc add) and linking (aka checkout) will have to know how to work with hdf5. In that sense, this reminds me of our experimental external outputs feature, that allows one to use clouds like s3/ssh/etc to extend their workspace. Unfortunately, that scenario is due for reconsideration, but local hdf5 could potentially be supported as well, we'll just need to carefully think through the scenario and internal changes for dvc. E.g. from our local repo perspective, local hdf5 file is similar to what git submodules are for git tree.

It is also clear that hdf5 support is not only relevant for data management (dvc add), but also for our pipelines, where one could want to depend on some particular path within the hdf5 file, instead of depending on it as a whole. This case seems to also be solvable by the "mountpoint" approach I've mentioned before.

Also, looks like fsspec (the fs backend that we use for our cloud and git operations) could support hdf5 (though there are some details to figure out fsspec/filesystem_spec#5 (comment)) as it already seems to support zarr ( i see some hdf5 mentions in https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/reference.html?highlight=hdf5 , but I'm not sure if it actually supports it or not). So fsspec fs implementation for hdf5 is likely going to be a pre-requisite to any other integration activities. For example, in your PR you are manually iterating through all files in hdf5 file, but it is much more convienient and powerful to view it as a fs and be able to use things like fs.glob.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Sep 23, 2021
@suparna-bhattacharya
Copy link

Doesn't look like the linked PR is in any way complete and is more of a draft of how it could work with dvc add. 🙂

@annmary-roy It looks like the main problem you have with dvc is file-level granularity, right? If so, I would maybe wait for #829 to see how well it will work for hdf5 files by treating them just like binaries.

Could you share a reference to any work-in-progress code or design for #829?

@efiop
Copy link
Contributor

efiop commented Sep 28, 2021

@suparna-bhattacharya There is none, but we plan on working on it in the future. It will likely be very similar to git packs.

@annmary-roy
Copy link
Author

For time being we are planning to go ahead with the folder based approach, where we can put all the splitfiles into a single directory and do an explicit dvc add on the folder and main.h5 . fsspec implementation for hdf5 is something which we may consider later.

@efiop
Copy link
Contributor

efiop commented Oct 8, 2021

Converting to discussion since this is not actionable yet

@iterative iterative locked and limited conversation to collaborators Oct 8, 2021
@efiop efiop closed this as completed Oct 8, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

4 participants