# 5, maybe 10 minutes to AnnData#

November 13, 2021

I regularly use Scanpy to analyze single-cell genomics data. Scanpy’s functionality heavily depends on the data being stored in an AnnData object, which provides Scanpy a systematic way of storing and retrieving intermediate analysis results, like principal components scores, UMAP embeddings, cluster labels, etc. While the Scanpy documentation site has tutorials covering common use cases, there remains a lack of tutorials for AnnData specifically. Here I cover the basics.

## Why AnnData?#

AnnData (“Annotated Data”) is specifically designed for tabular data. By this I mean that we have $$N$$ observations (typically cells), each of which can be represented as $$d$$-dimensional vectors, where each dimension corresponds to a variable or feature (typically a gene). Both the rows and columns of this $$N \times d$$ matrix are special in the sense that they are indexed. In scRNA-seq, each row corresponds to a cell with a barcode, and each column corresponds to one gene. Furthermore, for each cell and each gene we might have additional metadata, like (1) donor information for each cell, or (2) alternative gene symbols for each gene. Finally, we might have other unstructured metadata like color palletes we are using for plotting. Without going into every fancy Python-based data structure, you’ll have to take my word that no other alternative really exists that:

1. Handles sparsity

2. Handles unstructured data

3. Handles observation- and feature-level metadata

4. Is user-friendly

## Initializing AnnData#

Let’s start by building a basic AnnData object with some sparse count information, perhaps representing gene expression counts.

import anndata
import numpy as np
from scipy.sparse import csr_matrix
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)))

/tmp/ipykernel_1729/4088771229.py:5: FutureWarning: X.dtype being converted to np.float32 from int64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass AnnData(X, dtype=X.dtype, ...) to get the future behavour.

AnnData object with n_obs × n_vars = 100 × 2000


We see that AnnData provides a representation with summary stastics of the data The initial data we passed are accessible as a sparse matrix using adata.X.

adata.X

<100x2000 sparse matrix of type '<class 'numpy.float32'>'
with 126251 stored elements in Compressed Sparse Row format>


Now, we provide the index to both the obs and var axes using .obs_names (resp. .var_names).

adata.obs_names = [f"Cell_{i}" for i in range(adata.n_obs)]

Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
'Cell_7', 'Cell_8', 'Cell_9'],
dtype='object')


### Subsetting AnnData#

These index values can be used to subset the AnnData, which provides a view of the AnnData object. We can imagine this to be useful to subset the AnnData to particular cell types or gene modules of interest. The rules for subsetting AnnData are quite similar to that of a Pandas DataFrame. You can use values in the obs/var_names, boolean masks, or cell index integers.

adata[["Cell_1", "Cell_10"], ["Gene_5", "Gene_1900"]]

View of AnnData object with n_obs × n_vars = 2 × 2


### Observation/Variable level#

So we have the core of our object and now we’d like to add metadata at both the observation and variable levels. This is pretty simple with AnnData, both adata.obs and adata.var are Pandas DataFrames.

ct = np.random.choice(["B", "T", "Monocyte"], size=(adata.n_obs,))

cell_type
Cell_0 Monocyte
Cell_1 Monocyte
Cell_2 Monocyte
Cell_3 B
Cell_4 B
... ...
Cell_95 Monocyte
Cell_96 Monocyte
Cell_97 T
Cell_98 T
Cell_99 Monocyte

100 rows × 1 columns

We can also see now that the AnnData representation has been updated:

adata

AnnData object with n_obs × n_vars = 100 × 2000
obs: 'cell_type'


We can also subset the AnnData using these randomly generated cell types:

bdata = adata[adata.obs.cell_type == "B"]
bdata

View of AnnData object with n_obs × n_vars = 35 × 2000
obs: 'cell_type'


## Observation/Variable level matrices#

We might also have metadata at either level that has many dimensions to it, such as a UMAP embedding of the data. For this type of metadata, AnnData has the .obsm/.varm attributes. We use keys to identify the different matrices we insert. The restriction of .obsm/.varm are that .obsm matrices must length equal to the number of observations as .n_obs and .varm matrices must length equal to .n_vars. They can each independently have different number of dimensions.

Let’s start with a randomly generated matrix that we can interpret as a UMAP embedding of the data we’d like to store, as well as some random gene-level metadata:

adata.obsm["X_umap"] = np.random.normal(0, 1, size=(adata.n_obs, 2))

AxisArrays with keys: X_umap


Again, the AnnData representation is updated.

adata

AnnData object with n_obs × n_vars = 100 × 2000
obs: 'cell_type'
obsm: 'X_umap'
varm: 'gene_stuff'


A few more notes about .obsm/.varm

1. The “array-like” metadata can originate from a Pandas DataFrame, scipy sparse matrix, or numpy dense array.

2. When using scanpy, their values (columns) are not easily plotted, where instead items from .obs are easily plotted on, e.g., UMAP plots.

AnnData has .uns, which allows for any unstructured metadata. This can be anything, like a list or a dictionary with some general information that was useful in the analysis of our data.

adata.uns["random"] = [1, 2, 3]

OverloadedDict, wrapping:
OrderedDict([('random', [1, 2, 3])])
['neighbors'].


## Layers#

Finally, we may have different forms of our original core data, perhaps one that is normalized and one that is not. These can be stored in different layers in AnnData. For example, let’s log transform the original data and store it in a layer:

adata.layers["log_transformed"] = np.log1p(adata.X)

AnnData object with n_obs × n_vars = 100 × 2000
obs: 'cell_type'
uns: 'random'
obsm: 'X_umap'
varm: 'gene_stuff'
layers: 'log_transformed'


### Outputting DataFrames#

We can also ask AnnData to return us a DataFrame from one of the layers:

adata.to_df(layer="log_transformed")

Gene_0 Gene_1 Gene_2 Gene_3 Gene_4 Gene_5 Gene_6 Gene_7 Gene_8 Gene_9 ... Gene_1990 Gene_1991 Gene_1992 Gene_1993 Gene_1994 Gene_1995 Gene_1996 Gene_1997 Gene_1998 Gene_1999
Cell_0 1.386294 0.000000 0.000000 0.000000 0.000000 0.693147 0.000000 0.693147 0.693147 1.098612 ... 1.386294 0.000000 0.000000 0.000000 0.693147 1.098612 0.000000 0.000000 0.693147 0.000000
Cell_1 0.693147 1.098612 0.000000 0.000000 0.000000 0.693147 0.693147 0.693147 0.000000 0.693147 ... 1.098612 0.000000 0.693147 1.386294 1.098612 0.000000 0.000000 0.000000 1.098612 0.693147
Cell_2 1.098612 1.098612 0.000000 0.693147 0.693147 0.693147 1.098612 0.000000 0.693147 0.000000 ... 0.000000 0.000000 0.693147 1.098612 0.000000 0.693147 0.693147 1.098612 1.386294 0.000000
Cell_3 0.693147 0.000000 1.098612 0.693147 0.000000 1.098612 1.098612 0.693147 0.000000 0.693147 ... 1.386294 1.098612 1.098612 1.098612 1.098612 1.098612 1.098612 0.000000 1.098612 1.098612
Cell_4 0.000000 1.098612 1.098612 0.000000 0.000000 0.000000 0.693147 1.098612 0.693147 0.693147 ... 0.693147 1.098612 0.000000 0.693147 0.693147 0.000000 0.000000 0.693147 1.098612 0.693147
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Cell_95 0.693147 1.098612 1.386294 0.693147 0.693147 1.386294 0.693147 0.693147 0.693147 0.693147 ... 0.693147 0.000000 0.693147 0.000000 0.000000 0.000000 0.000000 0.000000 1.098612 1.386294
Cell_96 0.693147 1.098612 0.693147 0.693147 0.693147 0.000000 0.693147 0.693147 1.098612 0.693147 ... 0.693147 0.693147 0.693147 1.386294 0.000000 0.000000 0.000000 0.693147 0.693147 0.693147
Cell_97 0.693147 0.693147 0.000000 0.693147 0.000000 0.000000 1.098612 0.000000 0.693147 0.693147 ... 0.693147 0.000000 1.098612 1.386294 0.000000 0.693147 0.000000 0.000000 0.693147 0.693147
Cell_98 0.693147 0.693147 0.000000 0.693147 0.693147 0.693147 0.000000 0.000000 1.098612 1.609438 ... 0.693147 0.000000 0.693147 0.000000 0.693147 0.000000 0.000000 1.098612 0.693147 0.693147
Cell_99 0.693147 1.098612 0.693147 0.000000 0.000000 0.693147 0.693147 1.098612 0.000000 0.693147 ... 0.000000 0.000000 1.098612 1.098612 0.000000 1.386294 0.000000 0.000000 0.000000 0.693147

100 rows × 2000 columns

We see that the .obs_names/.var_names are used in the creation of this Pandas object.

## Wrapping up#

AnnData has become the standard for single-cell analysis in Python and for good reason – it’s straightforward to use and faciliatates more reproducible analyses with it’s key-based storage. It’s even becoming easier to convert to the popular R-based formats for single-cell analysis. There is still a lot that I don’t cover here. I do encourage looking through the AnnData API docs for more useful properties/methods.