(performance): speedup for `zarr`-based sparse indexing #1790

ilan-gold · 2024-12-08T13:15:10Z

import scipy.sparse as sp
import numpy as np
import anndata as ad
import zarr
from pathlib import Path

data_path = Path('data/foo.zarr')
if not data_path.exists():
    z = zarr.open_group('data/foo.zarr', mode='w')
    arr = sp.random(1_000_000, 10_000, format="csr", random_state=np.random.default_rng())
    ad.io.write_elem(z, "X", arr)
z = zarr.open_group('data/foo.zarr', mode='r')

adata = ad.AnnData(X=ad.io.sparse_dataset(z["X"]))
idx = np.random.randint(0, adata.shape[0], 1024)
idx.sort()
%time adata[idx].X

On my computer, this is a 3-5X speedup.

Closes #
Tests added
Release note added (or unnecessary)

scverse-benchmark · 2024-12-08T13:26:50Z

Benchmark changes

Change	Before [`d055963`]	After [`e45a05f`]	Ratio	Benchmark (Parameter)
-	1.8306233062330624	1.193271083444494	0.65	readwrite.H5ADReadSuite.track_read_full_memratio('pbmc3k')
-	511±10ms	16.8±0.7ms	0.03	sparse_dataset.SparseCSRContiguousSlice.time_getitem((10000, 10000), ':9000:-1')
-	2.47±0.01s	78.1±3ms	0.03	sparse_dataset.SparseCSRContiguousSlice.time_getitem((10000, 10000), '::-2')
-	521±3ms	48.5±4ms	0.09	sparse_dataset.SparseCSRContiguousSlice.time_getitem((10000, 10000), 'alternating')
-	507±10ms	17.7±0.9ms	0.03	sparse_dataset.SparseCSRContiguousSlice.time_getitem((10000, 10000), 'arange')

Comparison: https://github.com/scverse/anndata/compare/d055963dc70915ad921965a03d2b7342a098dd6b..e45a05f785aa361068c06b92a3070190fd5ed25f
Last changed: Tue, 10 Dec 2024 14:35:46 +0000

More details: https://github.com/scverse/anndata/pull/1790/checks?check_run_id=34197556781

ilan-gold · 2024-12-08T13:41:51Z

Very good :)

.github/workflows/test-gpu.yml

codecov · 2024-12-08T13:57:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.57%. Comparing base (d055963) to head (e45a05f).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1790      +/-   ##
==========================================
- Coverage   87.04%   84.57%   -2.47%     
==========================================
  Files          40       40              
  Lines        6089     6095       +6     
==========================================
- Hits         5300     5155     -145     
- Misses        789      940     +151

Files with missing lines	Coverage Δ
src/anndata/_core/sparse_dataset.py	`93.63% <100.00%> (+0.12%)`	⬆️

... and 8 files with indirect coverage changes

flying-sheep

great! I think for clarity, wherever possible you should unpack an entry of indptr_list into start, end instead of using s[0] and s[1].

you could either do indptr_limits = [(x.indptr[s.start], x.indptr[s.stop + 1]) for s in slices] or so, or just define indptr_slices as above.

src/anndata/_core/sparse_dataset.py

…rse indexing

…-based sparse indexing) (#1801) Co-authored-by: Ilan Gold <[email protected]>

ilan-gold added 2 commits December 8, 2024 13:59

(feat): speedup for zarr-based sparse indexing

5ead0b4

(chore): release note

b3be61a

ilan-gold added this to the 0.11.2 milestone Dec 8, 2024

ilan-gold added performance 🐌 backend: zarr type: sparse 🫥 run-gpu-ci benchmark labels Dec 8, 2024

(fix): no zarr mod

fb93662

github-actions bot removed the run-gpu-ci label Dec 8, 2024

(fix): constraints for gpu

ce9651e

ilan-gold commented Dec 8, 2024

View reviewed changes

.github/workflows/test-gpu.yml Outdated Show resolved Hide resolved

ilan-gold requested a review from flying-sheep December 8, 2024 13:42

ilan-gold added the run-gpu-ci label Dec 8, 2024

ilan-gold changed the title ~~(feat): speedup for zarr-based sparse indexing~~ (performance): speedup for zarr-based sparse indexing Dec 8, 2024

github-actions bot removed the run-gpu-ci label Dec 8, 2024

(fix): not one line

158a0f6

ilan-gold added the run-gpu-ci label Dec 8, 2024

github-actions bot removed the run-gpu-ci label Dec 8, 2024

flying-sheep requested changes Dec 10, 2024

View reviewed changes

src/anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

src/anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

ilan-gold added 2 commits December 10, 2024 13:13

(chore): indptr_list -> indptr_indices

881975d

(refactor): use second data structure for limits

fcd8a17

ilan-gold added the skip-gpu-ci label Dec 10, 2024

Merge branch 'main' into ig/zarr_sparse_speedup

75bc81e

ilan-gold requested a review from flying-sheep December 10, 2024 13:39

ilan-gold added 2 commits December 10, 2024 14:40

(fix): remove copy wierdness

5e97b5e

(fix): other erroneous diff

7eba01c

fmt

d252557

flying-sheep reviewed Dec 10, 2024

View reviewed changes

src/anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

flying-sheep approved these changes Dec 10, 2024

View reviewed changes

(chore): clean up offsets creation

e45a05f

flying-sheep approved these changes Dec 10, 2024

View reviewed changes

ilan-gold enabled auto-merge (squash) December 10, 2024 14:35

ilan-gold merged commit df213f6 into main Dec 10, 2024
16 checks passed

ilan-gold deleted the ig/zarr_sparse_speedup branch December 10, 2024 14:53

meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Dec 10, 2024

Backport PR scverse#1790: (performance): speedup for zarr-based spa…

2ab87c4

…rse indexing

meeseeksmachine mentioned this pull request Dec 10, 2024

Backport PR #1790 on branch 0.11.x ((performance): speedup for zarr-based sparse indexing) #1801

Merged

flying-sheep pushed a commit that referenced this pull request Dec 10, 2024

Backport PR #1790 on branch 0.11.x ((performance): speedup for zarr…

bd93c32

…-based sparse indexing) (#1801) Co-authored-by: Ilan Gold <[email protected]>

flying-sheep assigned ilan-gold Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(performance): speedup for `zarr`-based sparse indexing #1790

(performance): speedup for `zarr`-based sparse indexing #1790

ilan-gold commented Dec 8, 2024

scverse-benchmark bot commented Dec 8, 2024 •

edited

Loading

ilan-gold commented Dec 8, 2024

codecov bot commented Dec 8, 2024 •

edited

Loading

flying-sheep left a comment •

edited

Loading

(performance): speedup for zarr-based sparse indexing #1790

(performance): speedup for zarr-based sparse indexing #1790

Conversation

ilan-gold commented Dec 8, 2024

scverse-benchmark bot commented Dec 8, 2024 • edited Loading

Benchmark changes

ilan-gold commented Dec 8, 2024

codecov bot commented Dec 8, 2024 • edited Loading

Codecov Report

flying-sheep left a comment • edited Loading

Choose a reason for hiding this comment

(performance): speedup for `zarr`-based sparse indexing #1790

(performance): speedup for `zarr`-based sparse indexing #1790

scverse-benchmark bot commented Dec 8, 2024 •

edited

Loading

codecov bot commented Dec 8, 2024 •

edited

Loading

flying-sheep left a comment •

edited

Loading