5, maybe 10 minutes to AnnData

Last modified: November 15, 2021.

I regularly use Scanpy to analyze single-cell genomics data. Scanpy's functionality heavily depends on the data being stored in an AnnData object, which provides Scanpy a systematic way of storing and retrieving intermediate analysis results, like principal components scores, UMAP embeddings, cluster labels, etc. While the Scanpy documentation site has tutorials covering common use cases, there remains a lack of tutorials for AnnData specifically. Here I cover the basics.

Why AnnData?

AnnData ("Annotated Data") is specifically designed for tabular data. By this I mean that we have \(N\) observations (typically cells), each of which can be represented as \(d\)-dimensional vectors, where each dimension corresponds to a variable or feature (typically a gene). Both the rows and columns of this \(N \times d\) matrix are special in the sense that they are indexed. In scRNA-seq, each row corresponds to a cell with a barcode, and each column corresponds to one gene. Furthermore, for each cell and each gene we might have additional metadata, like (1) donor information for each cell, or (2) alternative gene symbols for each gene. Finally, we might have other unstructured metadata like color palletes we are using for plotting. Without going into every fancy Python-based data structure, you'll have to take my word that no other alternative really exists that:

  1. Handles sparsity

  2. Handles unstructured data

  3. Handles observation- and feature-level metadata

  4. Is user-friendly

Initializing AnnData

Let's start by building a basic AnnData object with some sparse count information, perhaps representing gene expression counts.

import anndata
import numpy as np
from scipy.sparse import csr_matrix
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)))
adata = anndata.AnnData(counts)
adata
AnnData object with n_obs × n_vars = 100 × 2000

We see that AnnData provides a representation with summary stastics of the data The initial data we passed are accessible as a sparse matrix using adata.X.

adata.X
<100x2000 sparse matrix of type ''
	with 126055 stored elements in Compressed Sparse Row format>

Now, we provide the index to both the obs and var axes using .obs_names (resp. .var_names).

adata.obs_names = [f"Cell_{i}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i}" for i in range(adata.n_vars)]
print(adata.obs_names[:10])
Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
       'Cell_7', 'Cell_8', 'Cell_9'],
      dtype='object')

Subsetting AnnData

These index values can be used to subset the AnnData, which provides a view of the AnnData object. We can imagine this to be useful to subset the AnnData to particular cell types or gene modules of interest. The rules for subsetting AnnData are quite similar to that of a Pandas DataFrame. You can use values in the obs/var_names, boolean masks, or cell index integers.

adata[["Cell_1", "Cell_10"], ["Gene_5", "Gene_1900"]]
View of AnnData object with n_obs × n_vars = 2 × 2

Adding aligned metadata

Observation/Variable level

So we have the core of our object and now we'd like to add metadata at both the observation and variable levels. This is pretty simple with AnnData, both adata.obs and adata.var are Pandas DataFrames.

ct = np.random.choice(["B", "T", "Monocyte"], size=(adata.n_obs,))
adata.obs["cell_type"] = ct
adata.obs
        cell_type
Cell_0   Monocyte
Cell_1          B
Cell_2          T
Cell_3   Monocyte
Cell_4          T
...           ...
Cell_95         T
Cell_96         T
Cell_97         T
Cell_98         B
Cell_99         B

[100 rows x 1 columns]

We can also see now that the AnnData representation has been updated:

adata
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'

Subsetting using metadata

We can also subset the AnnData using these randomly generated cell types:

bdata = adata[adata.obs.cell_type == "B"]
bdata
View of AnnData object with n_obs × n_vars = 35 × 2000
    obs: 'cell_type'

Observation/Variable level matrices

We might also have metadata at either level that has many dimensions to it, such as a UMAP embedding of the data. For this type of metadata, AnnData has the .obsm/.varm attributes. We use keys to identify the different matrices we insert. The restriction of .obsm/.varm are that .obsm matrices must length equal to the number of observations as .n_obs and .varm matrices must length equal to .n_vars. They can each independently have different number of dimensions.

Let's start with a randomly generated matrix that we can interpret as a UMAP embedding of the data we'd like to store, as well as some random gene-level metadata:

adata.obsm["X_umap"] = np.random.normal(0, 1, size=(adata.n_obs, 2))
adata.varm["gene_stuff"] = np.random.normal(0, 1, size=(adata.n_vars, 5))
adata.obsm
AxisArrays with keys: X_umap

Again, the AnnData representation is updated.

adata
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    obsm: 'X_umap'
    varm: 'gene_stuff'

A few more notes about .obsm/.varm

  1. The "array-like" metadata can originate from a Pandas DataFrame, scipy sparse matrix, or numpy dense array.

  2. When using scanpy, their values (columns) are not easily plotted, where instead items from .obs are easily plotted on, e.g., UMAP plots.

Unstructured metadata

AnnData has .uns, which allows for any unstructured metadata. This can be anything, like a list or a dictionary with some general information that was useful in the analysis of our data.

adata.uns["random"] = [1, 2, 3]
adata.uns
OverloadedDict, wrapping:
	OrderedDict([('random', [1, 2, 3])])
With overloaded keys:
	['neighbors'].

Layers

Finally, we may have different forms of our original core data, perhaps one that is normalized and one that is not. These can be stored in different layers in AnnData. For example, let's log transform the original data and store it in a layer:

adata.layers["log_transformed"] = np.log1p(adata.X)
adata
AnnData object with n_obs × n_vars = 100 × 2000
    obs: 'cell_type'
    uns: 'random'
    obsm: 'X_umap'
    varm: 'gene_stuff'
    layers: 'log_transformed'

Outputting DataFrames

We can also ask AnnData to return us a DataFrame from one of the layers:

adata.to_df(layer="log_transformed")
           Gene_0    Gene_1    Gene_2  ...  Gene_1997  Gene_1998  Gene_1999
Cell_0   0.693147  0.000000  1.386294  ...   0.693147   1.098612   0.693147
Cell_1   0.000000  0.000000  0.693147  ...   1.386294   1.098612   1.386294
Cell_2   0.000000  1.098612  0.000000  ...   0.693147   0.693147   0.000000
Cell_3   0.693147  0.000000  1.098612  ...   1.386294   0.000000   0.000000
Cell_4   1.098612  1.098612  1.609438  ...   0.000000   0.000000   0.693147
...           ...       ...       ...  ...        ...        ...        ...
Cell_95  0.693147  0.693147  1.098612  ...   0.693147   0.693147   1.098612
Cell_96  0.000000  0.693147  0.000000  ...   1.098612   0.693147   0.693147
Cell_97  0.000000  1.098612  0.000000  ...   0.000000   1.609438   1.098612
Cell_98  0.693147  1.098612  0.000000  ...   0.693147   0.693147   0.000000
Cell_99  1.098612  0.000000  0.000000  ...   0.693147   0.000000   1.098612

[100 rows x 2000 columns]

We see that the .obs_names/.var_names are used in the creation of this Pandas object.

Wrapping up

AnnData has become the standard for single-cell analysis in Python and for good reason – it's straightforward to use and faciliatates more reproducible analyses with it's key-based storage. It's even becoming easier to convert to the popular R-based formats for single-cell analysis. There is still a lot that I don't cover here. I do encourage looking through the AnnData API docs for more useful properties/methods.