Unification of X and layers #244

flying-sheep · 2019-11-06T10:00:55Z

@gtca’s proposal in #706:

This is an implementation proposal draft for unifying .X and layers as mentioned e.g. in #244. Expanding on previous issues, this one tracks concrete implementation details for achieving the end goal.

At this stage, feedback on the topic is appreciated — either here or in #244.

.X attribute as a reference

The proposal is to use adata.X as a reference to one of the layers. In the text of the proposal below, *.X or *adata.X will denote the data in the layer that adata.X points to (akin to pointer dereferencing).

Benefits

Some benefits include:

Reduce the implementation complexity as .X and .layers items store conceptually identical entities.

Reduce memory footprint in common workflows. E.g. when having layers with spliced and unspliced counts, adata.X will just point to one of them instead of containing its copy.

Make it easier to implement idempotent APIs (to be expanded in other issues).

Options

1. Explicit layer name

.X links to an explicit layer. This reference can be changed by a function that is run on an AnnData object. E.g. a normalisation function can create a new layer and update the reference instead of modifying the values in-place. In-place APIs will continue to modify the values in the *.X.

When *.X is deleted, .X can be either None or, less favourably, point to the layer that was defined previously in case there's a memory of which layers were linked to in .X before.

2. Last layer

.X can link to adata.layers[list(adata.layers.keys())[-1]]. This is easier to reason about but might be an issue when there's no explicit order in layers (e.g. raw counts -> normalised counts -> scaled counts) and new layers are added (e.g. unspliced counts).

Additional methods

There should be a way to explicitly change which layer .X points to.

Alternatives

An alternative, as mentioned in #244, would be to have .X as a specific layer. While this is simpler in the short-term, that might not provide some of the benefits listed above.

Effect for the end user

Taking the current scanpy API as an example where the matrix is modified in-place, nothing changes for the end user:
sc.pp.scale(adata)
# modifies values stored in the layer
# that .X links to
On disk

For HDF5, a non-breaking change should be achievable with the HDF5 references functionality. It is to be checked that R (rhdf5, hdf5r) and Julia HDF5 libraries are able to work with references.

For Zarr, feedback is needed if this is possible. It is probably not an issue if it's not as Zarr storage for AnnData is less mature/common.

Potential

This is to be expanded in other issues to come.

Idempotent analysis API

In practice, the scaling example above typically looks like this together with its context:
adata.layers["lognorm"] = adata.X.copy()
sc.pp.scale(adata)
This feature would allow to implement idempotent APIs, i.e.
sc.pp.scale(adata, in_layer="lognorm", out_layer="scaled")
# .X now links to .layers["scaled"]
that would not modify counts, and, importantly, will not affect the end user API provided that the appropriate default arguments for in_layer and out_layer are chosen.

prior considerations by me:

X should become a layer in AnnData.layers. AnnData.X should just be a shortcut for that. Answering the questions in #242 (comment):

The key needs to be determined. I think None fits the idea “default” best, and I’ll use it as a strawman, but "X" and other options can be discussed.

Scanpy functions allowing to specify the layers they operate on should probably have the signature layers: Union[Literal['all'], Sequence[str]] = (None,) or ... = 'all', indicating operation on X or all layers.

Iteration over layers.items() should yield X as the first key-value pair (Accordingly for keys/values of course)

Disk representation: X should probably be stored in the layers group, if we can stomach the backwards compat code

Raw should get layers, see layers for .raw #96. I think sharing a base class between Raw and AnnData makes a lot of sense.

@ivirshup said “obs is to obsm as X is to layers”, but I don’t think so: obsm will e.g. also contain dimension reductions like X_pca, so I don’t think obsm[None] should be obs.

The text was updated successfully, but these errors were encountered:

flying-sheep · 2023-07-11T15:00:56Z

this might be a breaking change, depending on our approach

flying-sheep mentioned this issue Nov 6, 2019

v0.7 #171

Closed

8 tasks

ivirshup mentioned this issue Jan 11, 2022

concat when some values of X are None #678

Open

gtca mentioned this issue Feb 11, 2022

[RFC] Make .X a layer #706

Closed

gtca mentioned this issue Jun 15, 2022

Add counts layer automatically scverse/scanpy#2261

Open

1 task

github-actions bot added the stale label Jul 5, 2023

ivirshup added the enhancement label Jul 5, 2023

github-actions bot removed the stale label Jul 6, 2023

flying-sheep added the topic: api label Jul 10, 2023

scverse deleted a comment from github-actions bot Jul 10, 2023

flying-sheep added the breaking change ‼️ A breaking change that needs a major version update label Jul 11, 2023

flying-sheep added this to 2023-08-03 Scanpy tasks for Core Hackathon Jul 28, 2023

flying-sheep moved this to Todo in 2023-08-03 Scanpy tasks for Core Hackathon Jul 28, 2023

flying-sheep removed this from 2023-08-03 Scanpy tasks for Core Hackathon Jul 28, 2023

flying-sheep mentioned this issue Oct 12, 2023

Missing layers after conversion from SCE theislab/anndata2ri#136

Closed

grst mentioned this issue Nov 22, 2023

Support for Awkward Arrays in AnnData's X #1235

Open

This was referenced Oct 9, 2024

Unify X and layers #1707

Draft

Evaluate if suitable field exists in AnnData or if alternative is needed theislab/ehrdata#35

Closed

Support ellipsis indexing #1696

Closed

Master issue: store time series data theislab/ehrdata#24

Open

eroell mentioned this issue Oct 10, 2024

3D array: choose obs and var, show 3rd dimension as lineplot/scatterplot vitessce/vitessce#1954

Open

ilan-gold self-assigned this Oct 17, 2024

flying-sheep self-assigned this Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unification of X and layers #244

Unification of X and layers #244

flying-sheep commented Nov 6, 2019 •

edited

Loading

`.X` attribute as a reference

Benefits

Options

1. Explicit layer name

2. Last layer

Additional methods

Alternatives

Effect for the end user

On disk

Potential

Idempotent analysis API

flying-sheep commented Jul 11, 2023

Unification of X and layers #244

Unification of X and layers #244

Comments

flying-sheep commented Nov 6, 2019 • edited Loading

.X attribute as a reference

Benefits

Options

1. Explicit layer name

2. Last layer

Additional methods

Alternatives

Effect for the end user

On disk

Potential

Idempotent analysis API

flying-sheep commented Jul 11, 2023

flying-sheep commented Nov 6, 2019 •

edited

Loading

`.X` attribute as a reference