Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(performance): speedup for zarr-based sparse indexing #1790

Merged
merged 12 commits into from
Dec 10, 2024

Conversation

ilan-gold
Copy link
Contributor

import scipy.sparse as sp
import numpy as np
import anndata as ad
import zarr
from pathlib import Path

data_path = Path('data/foo.zarr')
if not data_path.exists():
    z = zarr.open_group('data/foo.zarr', mode='w')
    arr = sp.random(1_000_000, 10_000, format="csr", random_state=np.random.default_rng())
    ad.io.write_elem(z, "X", arr)
z = zarr.open_group('data/foo.zarr', mode='r')

adata = ad.AnnData(X=ad.io.sparse_dataset(z["X"]))
idx = np.random.randint(0, adata.shape[0], 1024)
idx.sort()
%time adata[idx].X

On my computer, this is a 3-5X speedup.

  • Closes #
  • Tests added
  • Release note added (or unnecessary)

Copy link

scverse-benchmark bot commented Dec 8, 2024

Benchmark changes

Change Before [d055963] After [e45a05f] Ratio Benchmark (Parameter)
- 1.8306233062330624 1.193271083444494 0.65 readwrite.H5ADReadSuite.track_read_full_memratio('pbmc3k')
- 511±10ms 16.8±0.7ms 0.03 sparse_dataset.SparseCSRContiguousSlice.time_getitem((10000, 10000), ':9000:-1')
- 2.47±0.01s 78.1±3ms 0.03 sparse_dataset.SparseCSRContiguousSlice.time_getitem((10000, 10000), '::-2')
- 521±3ms 48.5±4ms 0.09 sparse_dataset.SparseCSRContiguousSlice.time_getitem((10000, 10000), 'alternating')
- 507±10ms 17.7±0.9ms 0.03 sparse_dataset.SparseCSRContiguousSlice.time_getitem((10000, 10000), 'arange')

Comparison: https://github.com/scverse/anndata/compare/d055963dc70915ad921965a03d2b7342a098dd6b..e45a05f785aa361068c06b92a3070190fd5ed25f
Last changed: Tue, 10 Dec 2024 14:35:46 +0000

More details: https://github.com/scverse/anndata/pull/1790/checks?check_run_id=34197556781

@ilan-gold
Copy link
Contributor Author

Very good :)

@ilan-gold ilan-gold changed the title (feat): speedup for zarr-based sparse indexing (performance): speedup for zarr-based sparse indexing Dec 8, 2024
Copy link

codecov bot commented Dec 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.57%. Comparing base (d055963) to head (e45a05f).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1790      +/-   ##
==========================================
- Coverage   87.04%   84.57%   -2.47%     
==========================================
  Files          40       40              
  Lines        6089     6095       +6     
==========================================
- Hits         5300     5155     -145     
- Misses        789      940     +151     
Files with missing lines Coverage Δ
src/anndata/_core/sparse_dataset.py 93.63% <100.00%> (+0.12%) ⬆️

... and 8 files with indirect coverage changes

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! I think for clarity, wherever possible you should unpack an entry of indptr_list into start, end instead of using s[0] and s[1].

you could either do indptr_limits = [(x.indptr[s.start], x.indptr[s.stop + 1]) for s in slices] or so, or just define indptr_slices as above.

src/anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved
src/anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved
@ilan-gold ilan-gold enabled auto-merge (squash) December 10, 2024 14:35
@ilan-gold ilan-gold merged commit df213f6 into main Dec 10, 2024
16 checks passed
@ilan-gold ilan-gold deleted the ig/zarr_sparse_speedup branch December 10, 2024 14:53
meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Dec 10, 2024
flying-sheep pushed a commit that referenced this pull request Dec 10, 2024
…-based sparse indexing) (#1801)

Co-authored-by: Ilan Gold <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants