-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDF5 Cloud Storage #634
Comments
Hey! We definitely have interest in cloud based storage, but so far have been largely eyeing Zarr and possibly arrow for this. This looks interesting. My initial thoughts compared to the other formats: Strengths:
Weaknesses:
Needs investigation:
Could be of interest to @ryan-williams, @ilan-gold, @joshua-gould |
@ivirshup I haven't had time to dig into this on the AnnData side yet, but a good starting point here for remote data (wihtout switching to a new hdf5 reader, which is what what this issue seems to be about) would be to literally just try passing in a URL to a zarr-backed AnnData store to the So for example, if you have an AnnData store you could do adata.write_zarr('path/to/local/my_store.zarr') Then in a shell (since I don't know offhand in gsutil cp -r path/to/local/my_store.zarr gs://my_bucket/ And then import anndata as ad
import aiohttp, requests, zarr, fsspec
adata_cloud = ad.read_zarr(my_google_url) This doesn't seem to work for my examples (I get an empty Just wanted to chime in. The TL;DR is basically that zarr supports remote data in theory and my inability to get it working here probably has more to do with my inexperience with zarr than anything AnnData specific. I'll try to look into this more. |
It seems like it's more or less drop in. I was a bit confused with the dispatching logic at first, but my naive attempt seems to work: HDFGroup@1a4833f, at least for something like:
I use the "hdf5://" prefix on the filename to indicate that this is meant to be written to the server rather than a local HDF5 file. After this runs I can do:
I haven't tried running any of the benchmarks, but likely it would significantly slower than writing to local HDF5 files since there's extra latency in making off-box requests to the server. Benefit is that you can target AWS S3, Azure Blob, or other object storage systems. Since the server (HSDS) mediates access to the storage system, users don't need to have credentials to a cloud provider, just a username/password with the service. Also, if the client is running outside the cloud provider, there's less data movement since only actual read and write selections need to be transferred (rather than entire files). You do need to have the service running - it can be setup on Docker or Kubernetes. Fairly easy to install and scale up or down based on usage requirements. Let me know if this seems interesting to anyone. If so I'd be happy to flesh out the h5pyd integration. |
Sorry for the long response time here! I would really like to see this functionality integrated upstream. Is the goal here for Getting |
No problem! Thanks for getting back on this. The goal of HSDS is to support the use of HDF in a cloud-native context. This means having a REST-based API, ability to run in distributed systems (e.g. Kubernetes), dynamically scale, and to work well with object storage backends (e.g. S3). Since it's based on the HDF5 data model, it's relatively simple for h5pyd to support most of h5py's api. So it's something less than an entirely new backend for anndata - see the PR. We pull in both the h5py and h5pyd packages and then tweak the dispatch logic based on if we are dealing with HDF5 files or HSDS server. BTW, I've been thinking about sparse data support in HSDS recently and judging from the example above, that's important for AnnData. It would be interesting to think about sparse-specific methods in h5pyd/HSDS. (Of course that would make the backend logic more complicated, but I don't think we'll see sparse data methods in h5py anytime soon) |
Hi, I was wondering if there is any update on this issue. I have a big file on |
@djarecka found the
@Koncopd said in #1322 (comment) that s3fs works well. I think we should
|
Right now, you can do: import anndata as ad
import h5py
import remfile
ADATA_URI = "https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad"
file_h5 = h5py.File(remfile.File(ADATA_URI), "r")
# Read the whole file
adata = ad.experimental.read_elem(file_h5)
# Read the file like "backed"
# This is specialized to X, but you could put the `SparseDataset` or even `h5py.Dataset` anywhere in the object
def read_w_sparse_dataset(group: "h5py.Group | zarr.Group") -> ad.AnnData:
return ad.AnnData(
X=ad.experimental.sparse_dataset(group["X"]),
**{
k: ad.experimental.read_elem(group[k]) if k in group else {}
for k in ["layers", "obs", "var", "obsm", "varm", "uns", "obsp", "varp"]
}
)
adata = read_w_sparse_dataset(file_h5)
adata.X
# CSRDataset: backend hdf5, shape (131212, 32285), data_dtype float32
See also |
Was just looking into the |
Is there interest in storing data in the cloud? E.g. using AWS S3. With HDF5 this is problematic since h5py requires a posix-based filesystem. I maintain the h5pyd project (https://github.com/HDFGroup/h5pyd) which gets around this by providing a h5py compatible api to a sharded data store (similar to zarr). I think it should be possible to have h5ad support either h5py or h5pyd, but first wanted to gauge interest in this approach.
Thanks!
The text was updated successfully, but these errors were encountered: