Skip to content

Commit

Permalink
Update prose description of on-disk representation
Browse files Browse the repository at this point in the history
* Added as a section to `docs/index.rst` so it gets rendered
* Fixed some formatting
* Commented out bash examples
* Editting for brevity and clarity
  • Loading branch information
ivirshup committed Dec 9, 2019
1 parent 40d1d3b commit 87707a2
Show file tree
Hide file tree
Showing 2 changed files with 92 additions and 101 deletions.
192 changes: 91 additions & 101 deletions docs/fileformat-prose.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,18 @@
Disk representation
-------------------
On-disk format
--------------

.. note::
these docs are written for anndata 0.7.
These docs are written for anndata 0.7.
Files written before this version may differ in some conventions, but will still be read by newer versions of the library.

AnnData objects are saved on disk to hierarchichal array stores like `HDF5` and `zarr`.
AnnData objects are saved on disk to hierarchichal array stores like `HDF5 <https://en.wikipedia.org/wiki/Hierarchical_Data_Format>`_ (via `h5py <http://docs.h5py.org/en/stable/>`_) and `Zarr <https://zarr.readthedocs.io/en/stable/>`_.
This allows us to have very similar structures in disk and on memory.

AnnData objects can hold three kinds of objects in it’s dimensioned mappings (i.e. `X`, `obsm`, `layers` etc.).
These are (1) dense arrays, (2) sparse arrays, and (3) data frames.
In general, `AnnData` objects can hold three kinds of values: (1) dense arrays, (2) sparse arrays, and (3) DataFrames.
As an example we’ll look into a typical `.h5ad` object that’s been through an analysis.
This structure should be largely equivalent to Zarr structure, though there are a few minor differences.

.. note::
I’ve started using h5py since I couldn’t figure out a nice way to print attributes from bash.
.. I’ve started using h5py since I couldn’t figure out a nice way to print attributes from bash.
.. code:: python
Expand All @@ -22,48 +21,47 @@ As an example we’ll look into a typical `.h5ad` object that’s been through a
>>> list(f.keys())
['X', 'layers', 'obs', 'obsm', 'uns', 'var', 'varm']
.. code:: bash
.. .. code:: bash
$ h5ls 02_processed.h5ad
X Group
layers Group
obs Group
obsm Group
uns Group
var Group
varm Group
.. $ h5ls 02_processed.h5ad
.. X Group
.. layers Group
.. obs Group
.. obsm Group
.. uns Group
.. var Group
.. varm Group
Dense arrays
~~~~~~~~~~~~

Dense arrays have the most simple representation on disk, as they have clear equivalents in hdf5 and zarr.
Any dense array will be written to the file as a `dataset`.
We can see an example of this for principle components stored in the `obsm` group:
Dense arrays have the most simple representation on disk, as they have native equivalents in `HDF5 Datasets <http://docs.h5py.org/en/stable/high/dataset.html#>`_ and `Zarr Arrays <https://zarr.readthedocs.io/en/stable/tutorial.html#creating-an-array>`_.
We can see an example of this with dimensionality reductions stored in the `obsm` group:

.. code:: python
>>> f["obsm"].visititems(lambda k, v: print(f"{k}: {v}"))
X_pca: <HDF5 dataset "X_pca": shape (38410, 50), type "<f4">
X_umap: <HDF5 dataset "X_umap": shape (38410, 2), type "<f4">
>>> f["obsm"].visititems(print)
X_pca <HDF5 dataset "X_pca": shape (38410, 50), type "<f4">
X_umap <HDF5 dataset "X_umap": shape (38410, 2), type "<f4">
.. code:: bash
.. .. code:: bash
$ h5ls 02_processed.h5ad/obsm
X_pca Dataset {38410, 50}
X_umap Dataset {38410, 2}
.. $ h5ls 02_processed.h5ad/obsm
.. X_pca Dataset {38410, 50}
.. X_umap Dataset {38410, 2}
Sparse arrays
~~~~~~~~~~~~~

Sparse arrays don’t have a native representations in hdf5 or zarr, so we use a representation as close to the in memory datasets as we can.
Currently two sparse data formats are supported by `AnnData` objects.
These are CSC and CSR formats, where a two dimensional sparse array is represented by three one dimensional arrays, `indptr`, `indices`, and `data`.
A full description of this format is out of scope for this document, but are widley available (`wikipedia description`_)
Sparse arrays don’t have a native representations in HDF5 or Zarr, so we've defined our own based on their in-memory structure.
Currently two sparse data formats are supported by `AnnData` objects, CSC and CSR (corresponding to :class:`scipy.sparse.csc_matrix` and :class:`scipy.sparse.csr_matrix` respectivley).
These formats represent a two-dimensional sparse array with three one-dimensional arrays, `indptr`, `indices`, and `data`.

.. note::

.. _wikipedia description: https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)
A full description of these formats is out of scope for this document, but are `easy to find <https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)>`_.

On disk we represent a sparse array as a group.
These the kind and shape of sparse array can be identified by their attributes:
We represent a sparse array as a `Group` on-disk, where the kind and shape of the sparse array is defined in the `Group`'s attributes:

.. code:: python
Expand All @@ -76,105 +74,97 @@ Inside the group are the three constituent arrays:

.. code:: python
>>> f["X"].visititems(lambda k, v: print(f"{k}: {v}"))
data: <HDF5 dataset "data": shape (41459314,), type "<f4">
indices: <HDF5 dataset "indices": shape (41459314,), type "<i4">
indptr: <HDF5 dataset "indptr": shape (38411,), type "<i4">
>>> f["X"].visititems(print)
data <HDF5 dataset "data": shape (41459314,), type "<f4">
indices <HDF5 dataset "indices": shape (41459314,), type "<i4">
indptr <HDF5 dataset "indptr": shape (38411,), type "<i4">
.. code:: bash
.. .. code:: bash
$ h5ls 02_processed.h5ad/X
data Dataset {41459314/Inf}
indices Dataset {41459314/Inf}
indptr Dataset {38411/Inf}
.. $ h5ls 02_processed.h5ad/X
.. data Dataset {41459314/Inf}
.. indices Dataset {41459314/Inf}
.. indptr Dataset {38411/Inf}
DataFrames
~~~~~~~~~~

Data frames are saved as a columnar format in a group, so each column of a dataframe gets it’s own dataset.
To maintain the efficiency of categorical values they are stored by as their numeric codes with their values saved in a reserved subgroup `__categories`.
DataFrames are saved as a columnar format in a group, so each column of a DataFrame gets it’s own dataset.
To maintain efficiency with categorical values, only the numeric codes are stored for each row, while categories values are saved in a reserved subgroup `__categories`.

Dataframes can be identified from other groups by their attributes:

.. code:: python
>>> dict(f["obs"].attrs)
{'_index': 'Cell',
'column-order': array(['sample', 'cell_type', 'n_genes_by_counts',
'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts',
'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes',
'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes',
'total_counts_mito', 'log1p_total_counts_mito', 'pct_counts_mito',
'label_by_score'], dtype=object),
'encoding-type': 'dataframe',
'encoding-version': '0.1.0'}
>>> dict(f["obs"].attrs)
{'_index': 'Cell',
'column-order': array(['sample', 'cell_type', 'n_genes_by_counts',
'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts',
'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes',
'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes',
'total_counts_mito', 'log1p_total_counts_mito', 'pct_counts_mito',
'label_by_score'], dtype=object),
'encoding-type': 'dataframe',
'encoding-version': '0.1.0'}
These attributes identify the column used as an index, the order of the original columns, and some type information.

.. code:: python
>>> f["obs"].visititems(lambda k, v: print(f"{k}: {v}"))
Cell: <HDF5 dataset "Cell": shape (38410,), type "|O">
__categories: <HDF5 group "/obs/__categories" (3 members)>
__categories/cell_type: <HDF5 dataset "cell_type": shape (22,), type "|O">
__categories/label_by_score: <HDF5 dataset "label_by_score": shape (16,), type "|O">
__categories/sample: <HDF5 dataset "sample": shape (41,), type "|O">
cell_type: <HDF5 dataset "cell_type": shape (38410,), type "|i1">
label_by_score: <HDF5 dataset "label_by_score": shape (38410,), type "|i1">
log1p_n_genes_by_counts: <HDF5 dataset "log1p_n_genes_by_counts": shape (38410,), type "<f8">
log1p_total_counts: <HDF5 dataset "log1p_total_counts": shape (38410,), type "<f4">
log1p_total_counts_mito: <HDF5 dataset "log1p_total_counts_mito": shape (38410,), type "<f4">
n_genes_by_counts: <HDF5 dataset "n_genes_by_counts": shape (38410,), type "<i4">
pct_counts_in_top_100_genes: <HDF5 dataset "pct_counts_in_top_100_genes": shape (38410,), type "<f8">
pct_counts_in_top_200_genes: <HDF5 dataset "pct_counts_in_top_200_genes": shape (38410,), type "<f8">
pct_counts_in_top_500_genes: <HDF5 dataset "pct_counts_in_top_500_genes": shape (38410,), type "<f8">
pct_counts_in_top_50_genes: <HDF5 dataset "pct_counts_in_top_50_genes": shape (38410,), type "<f8">
pct_counts_mito: <HDF5 dataset "pct_counts_mito": shape (38410,), type "<f4">
sample: <HDF5 dataset "sample": shape (38410,), type "|i1">
total_counts: <HDF5 dataset "total_counts": shape (38410,), type "<f4">
total_counts_mito: <HDF5 dataset "total_counts_mito": shape (38410,), type "<f4">
Categorical series can be identified by the presence of the attribute `"categories"`, which contains a pointer to their categorical values:

*Note:* as `zarr` does not have reference objects, in zarr files the `categories` attribute is an absolute path to the category values.
>>> f["obs"].visititems(print)
Cell <HDF5 dataset "Cell": shape (38410,), type "|O">
__categories <HDF5 group "/obs/__categories" (3 members)>
__categories/cell_type <HDF5 dataset "cell_type": shape (22,), type "|O">
__categories/label_by_score <HDF5 dataset "label_by_score": shape (16,), type "|O">
__categories/sample <HDF5 dataset "sample": shape (41,), type "|O">
cell_type <HDF5 dataset "cell_type": shape (38410,), type "|i1">
label_by_score <HDF5 dataset "label_by_score": shape (38410,), type "|i1">
log1p_n_genes_by_counts <HDF5 dataset "log1p_n_genes_by_counts": shape (38410,), type "<f8">
...
Categorical Series can be identified by the presence of the attribute `"categories"`, which contains a pointer to the categories' values:

.. code:: python
>>> dict(f["obs/cell_type"].attrs)
{'categories': <HDF5 object reference>}
>>> dict(f["obs/cell_type"].attrs)
{'categories': <HDF5 object reference>}
Other values:
-------------
.. note::

In `zarr`, as there are no reference objects, the `categories` attribute is an absolute path to the category values.

Other values
~~~~~~~~~~~~

Mappings
~~~~~~~~
^^^^^^^^

Mappings are stored as native groups in an `h5ad` file.
These can be identified as being seperate from dataframes and sparse arrays since they don’t have any special attributes.
These are used for any `Mapping` in the AnnData object, including the default `obsm`, `varm`, `layers`, and `uns`.
This definition is used recursivley within `uns`:
Mappings are simply stored as `Group` s on disk.
These are distinct from DataFrames and sparse arrays since they don’t have any special attributes.
A `Group` is created for any `Mapping` in the AnnData object, including the standard `obsm`, `varm`, `layers`, and `uns`.
Notably, this definition is used recursivley within `uns`:

.. code:: python
>>> f["uns"].visititems(print)
...
pca <HDF5 group "/uns/pca" (2 members)>
pca/variance <HDF5 dataset "variance": shape (50,), type "<f4">
pca/variance_ratio <HDF5 dataset "variance_ratio": shape (50,), type "<f4">
...
>>> f["uns"].visititems(print)
...
pca <HDF5 group "/uns/pca" (2 members)>
pca/variance <HDF5 dataset "variance": shape (50,), type "<f4">
pca/variance_ratio <HDF5 dataset "variance_ratio": shape (50,), type "<f4">
...
Scalars
~~~~~~~
^^^^^^^

Zero dimensional arrays are used for scalar values (i.e. single values like strings, numbers or booleans).
These should only occur inside of `uns`, and are common inside of saved parameters:

.. code:: python
>>> f["uns/neighbors/params"].visititems(print)
method <HDF5 dataset "method": shape (), type "|O">
metric <HDF5 dataset "metric": shape (), type "|O">
n_neighbors <HDF5 dataset "n_neighbors": shape (), type "<i8">
>>> f["uns/neighbors/params"].visititems(print)
method <HDF5 dataset "method": shape (), type "|O">
metric <HDF5 dataset "metric": shape (), type "|O">
n_neighbors <HDF5 dataset "n_neighbors": shape (), type "<i8">
>>> f["uns/neighbors/params/metric"][()]
'euclidean'
>>> f["uns/neighbors/params/metric"][()]
'euclidean'
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,6 @@ of data and learned annotations. It was initially built for
:hidden:

api
fileformat-prose
benchmarks
references

0 comments on commit 87707a2

Please sign in to comment.