-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feat): [WIP] support physical quantities via Pint #1481
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1481 +/- ##
==========================================
- Coverage 84.22% 84.19% -0.04%
==========================================
Files 35 35
Lines 5685 5699 +14
==========================================
+ Hits 4788 4798 +10
- Misses 897 901 +4
|
Hmm, generally units are a great idea. Two questions:
|
Thanks!
What I'd expect as an The utility of having multiple I referred to the suggestions in https://pint.readthedocs.io/en/stable/getting/pint-in-your-projects.html#having-a-shared-registry when considering how to handle needing an instantiated
The
Absolutely! The test code I implemented adds one scalar I believe this would even allow In [1]: import numpy as np
In [2]: from pint import UnitRegistry
In [3]: reg = UnitRegistry()
In [4]: coords_1 = np.array([[1, 2], [2, 1]]) * reg.mm
In [5]: coords_1
Out[5]:
array([[1, 2],
[2, 1]]) <Unit('millimeter')>
In [6]: coords_2 = np.array([[0, 3], [4, 0]]) * reg.um
In [7]: coords_2
Out[7]:
array([[0, 3],
[4, 0]]) <Unit('micrometer')>
In [8]: np.vstack([coords_1, coords_2])
Out[8]:
array([[1. , 2. ],
[2. , 1. ],
[0. , 0.003],
[0.004, 0. ]]) <Unit('millimeter')> (Whether or not it's useful to mix spatial coordinates between two different experiments, it's still an improvement to have that concatenation be unit-aware.) |
@flying-sheep After writing my comment and re-reading yours, I could be persuaded that My initial vision for this functionality was to make it easy on users to deal with physical quantities in I would be relatively annoyed if I couldn't compare those quantities easily as an
and I would probably be left with the impression that unit handling is an obstacle rather than an enhancement. |
Yeah, those are the considerations. Basically the big question is “shall the registry live in each AnnData object or be global” There’s also mixes possible, like having a context manager that overrides the (otherwise global) registry. @ivirshup @ilan-gold what do you think about this? |
Two main things:
|
Works for me. I'll make an issue when I get a chance, unless someone beats me to it.
As noted on https://spatialdata.scverse.org/en/latest/, SpatialData is under active development and doesn't yet claim to offer a stable API. We want to distribute the results of HuBMAP computational pipelines in a stable (enough) format, and 'plain' H5AD is how we're already publishing sc/snRNA-seq processing results. Our use case is sharing VIsium data in Using a library for handling physical units eliminates a whole class of bugs, and using Pint has simplified a lot of our code, including in handling of physical pixel sizes read from OME-XML metadata. I think there's value in allowing users to use |
@@ -47,6 +47,7 @@ dependencies = [ | |||
"exceptiongroup; python_version<'3.11'", | |||
"natsort", | |||
"packaging>=20.0", | |||
"pint", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might lean towards putting this in its own optional dependency section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flying-sheep thoughts?
_writer.write_elem(g, "magnitude", v.magnitude, dataset_kwargs=dataset_kwargs) | ||
_writer.write_elem(g, "units", str(v.units), dataset_kwargs=dataset_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the units should be written into the attrs
instead of being given their own array. It is just a short string, right? This would also make it easier for interoperability purposes. Arrays with this extra metadata wouldn't break existing readers.
uns={"size": 100 * ad.units["um"]}, | ||
obsm={"X_spatial": X_arr * ad.units.mm}, | ||
) | ||
write(tmp_path / name, adata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd follow the pattern from above, unless I'm missing a good reason not to i.e.,
read = lambda pth: getattr(ad, f"read_{diskfmt}")(pth)
write = lambda adata, pth: getattr(adata, f"write_{diskfmt}")(pth)
and the corresponding fixtures for the diskfmt
Hello!
The HuBMAP Consortium makes extensive use of AnnData, writing gene expression count matrices (and more) as
.h5ad
files for public download.As we're continuing to extend our computational pipelines to handle sequencing spatial transcriptomics with associated imaging data (e.g., Visium), with spatial coordinates registered to pixel coordinates, it's become clear that it would be very useful to store physical units directly in the AnnData/
.h5ad
structures. We've had a very pleasant experience using the Pint package in other contexts, and this package interoperates naturally with NumPy and similar (though not as nicely with SciPy sparse matrices, unfortunately).This draft PR implements serialization of
pint.Quantity
objects to H5AD and Zarr, separately writing the magnitude and unit of eachQuantity
, and then instantiating a newQuantity
object when reading from disk.There are a few caveats to this -- the
pint
package allows instantiating multipleUnitRegistry
objects, and quantities associated with one registry cannot be compared to quantities from another registry. As such, theanndata
package instantiates aUnitRegistry
on import, accessible atanndata.units
and used to instantiateQuantity
objects when loading data. Users should almost definitely use thatanndata.units
registry when creating their ownQuantity
objects, to interoperate withQuantity
objects created by the package.This PR adds
pint
as an unconditional requirement foranndata
.I added a test method which demonstrates serialization and loading of
pint.Quantity
-wrapped arrays and scalars, which works as expected. I ran the full test suite on my machine and encountered no new failures (10 tests failed on currentmain
and on my branch, on macOS 14.4.1, and 29 failed on Ubuntu 22.04).This PR does not add documentation; I'm happy to do so but wanted to check whether this approach looked reasonable to everyone before expending that effort.