Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): custom reopen with read_elem_as_dask for remote h5ad #1665

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Sep 4, 2024

Not sure about the typing on reopen (with Generator[...] as before, it was complaining, but a normal Callable with return felt wrong); however, the following seems to work:

import dask.distributed as dd
import remfile
import h5py
from anndata.experimental import read_elem_as_dask

# optionally log all requests to see roughly that this is lazy https://stackoverflow.com/a/24588289/8060591
import logging
import contextlib
from http.client import HTTPConnection

def debug_all_requests():

    HTTPConnection.debuglevel = 1
    logging.basicConfig()
    logging.getLogger().setLevel(logging.DEBUG)
    requests_log = logging.getLogger("requests.packages.urllib3")
    requests_log.setLevel(logging.DEBUG)
    requests_log.propagate = True

def debug_no_requests():

    HTTPConnection.debuglevel = 0
    root_logger = logging.getLogger()
    root_logger.setLevel(logging.WARNING)
    root_logger.handlers = []
    requests_log = logging.getLogger("requests.packages.urllib3")
    requests_log.setLevel(logging.WARNING)
    requests_log.propagate = False

ADATA_URI = "https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/expression_matrices/WMB-10Xv2/20230630/WMB-10Xv2-TH-log2.h5ad"

cluster = dd.LocalCluster(n_workers=1, threads_per_worker=1)
client = dd.Client(cluster)

file_h5 = h5py.File(remfile.File(ADATA_URI), "r")
def reopen():
   yield h5py.File(remfile.File(ADATA_URI), "r")["X"]

elem = read_elem_as_dask(file_h5["X"], reopen=reopen)
elem[:20, :].compute()

The question is whether or not we'd want to internalize this - my guess is "no." I think we should only promise support for core zarr/hdf5. The flip side then is how do we test this....good questions to grapple with!

@ilan-gold ilan-gold added this to the 0.12.0 milestone Sep 4, 2024
@ilan-gold ilan-gold changed the title (feat): allow for custom reopen with read_elem_as_dask (feat): custom reopen with read_elem_as_dask for remote h5ad Sep 4, 2024
@ilan-gold
Copy link
Contributor Author

I should note that I could not figure out how to extract the remfile.RemFile.RemFile object out of the h5py.File object. If I could, maybe no need for a reopen param...

Copy link

codecov bot commented Sep 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.40%. Comparing base (d7643e9) to head (abc13bd).
Report is 25 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1665      +/-   ##
==========================================
- Coverage   86.87%   84.40%   -2.47%     
==========================================
  Files          39       39              
  Lines        6033     6033              
==========================================
- Hits         5241     5092     -149     
- Misses        792      941     +149     
Files with missing lines Coverage Δ
src/anndata/_io/specs/lazy_methods.py 100.00% <100.00%> (ø)
src/anndata/_io/specs/registry.py 95.45% <100.00%> (-0.57%) ⬇️

... and 7 files with indirect coverage changes

) -> Generator[StorageType, None, None]:
) -> Callable[[], Iterator[StorageType]]:
Copy link
Member

@flying-sheep flying-sheep Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous type was correct, the new one isn’t.

The way decorators interact with typing is that you normally type the decorated function (e.g. if it contains yield, the return type is Generator). The decorator than transforms the function from whatever it is to whatever the decorator wants.

I.e.

@contextmanager
def foo(*args: Unpack[Args]) -> Generator[Ret, None, None]: ...

is the same as

def _foo(*args: Unpack[Args]) -> Generator[Ret, None, None]: ...

foo: Callable[Args, AbstractContextManager[Ret]] = contextmanager(_foo)

(I’m not 100% sure I got the “unpack” syntax right, but you know what I mean)

@@ -67,13 +67,18 @@ def make_dask_chunk(
*,
wrap: Callable[[ArrayStorageType], ArrayStorageType]
| Callable[[H5Group | ZarrGroup], _CSRDataset | _CSCDataset] = lambda g: g,
reopen: None | Callable[[], Iterator[StorageType]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the idea is “reopen is a callable that can be transformed into a context manager using contextlib.contextmanager.

Why not just “reopen is a callable that returns an contextlib.AbstractContextManager[StorageType]?

@ivirshup
Copy link
Member

ivirshup commented Sep 5, 2024

Do you know what the difference is between what remfile and fsspec are doing?

@ilan-gold
Copy link
Contributor Author

@ivirshup My understanding is that fsspec + kerchunk requires writing a separate JSON index file: https://fsspec.github.io/kerchunk/reference.html#kerchunk.hdf.SingleHdf5ToZarr. remfile reads directly.

@ilan-gold
Copy link
Contributor Author

@ivirshup any thoughts on this for 0.12?

Copy link

Benchmark changes

Change Before [d7643e9] <0.11.0rc2~16> After [abc13bd] Ratio Benchmark (Parameter)
- 1.55±0.01ms 1.37±0.1ms 0.88 sparse_dataset.SparseCSRContiguousSlice.time_getitem((10000, 10000), '0:1000')

Comparison: https://github.com/scverse/anndata/compare/d7643e966b7cfaf8f5c732f1f020b0674db1def9..abc13bd785ab1049c17443c620fe0fad276fd4f9
Last changed: Thu, 17 Oct 2024 13:59:37 +0000

More details: https://github.com/scverse/anndata/pull/1665/checks?check_run_id=31678498264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants