5, maybe 10 minutes to AnnData¶
November 13, 2021
5-10 min read
I regularly use Scanpy to analyze single-cell genomics data. Scanpy’s functionality heavily depends on the data being stored in an AnnData object, which provides Scanpy a systematic way of storing and retrieving intermediate analysis results, like principal components scores, UMAP embeddings, cluster labels, etc. While the Scanpy documentation site has tutorials covering common use cases, there remains a lack of tutorials for AnnData specifically. Here I cover the basics.
Why AnnData?¶
AnnData (“Annotated Data”) is specifically designed for tabular data. By this I mean that we have \(N\) observations (typically cells), each of which can be represented as \(d\)-dimensional vectors, where each dimension corresponds to a variable or feature (typically a gene). Both the rows and columns of this \(N \times d\) matrix are special in the sense that they are indexed. In scRNA-seq, each row corresponds to a cell with a barcode, and each column corresponds to one gene. Furthermore, for each cell and each gene we might have additional metadata, like (1) donor information for each cell, or (2) alternative gene symbols for each gene. Finally, we might have other unstructured metadata like color palletes we are using for plotting. Without going into every fancy Python-based data structure, you’ll have to take my word that no other alternative really exists that:
Handles sparsity
Handles unstructured data
Handles observation- and feature-level metadata
Is user-friendly
Initializing AnnData¶
Let’s start by building a basic AnnData object with some sparse count information, perhaps representing gene expression counts.
import anndata
import numpy as np
from scipy.sparse import csr_matrix
counts = csr_matrix(np.random.poisson(1, size=(100, 2000)))
adata = anndata.AnnData(counts)
adata
AnnData object with n_obs × n_vars = 100 × 2000
We see that AnnData provides a representation with summary stastics of the data The initial data we passed are accessible as a sparse matrix using adata.X
.
adata.X
<100x2000 sparse matrix of type '<class 'numpy.int64'>'
with 126611 stored elements in Compressed Sparse Row format>
Now, we provide the index to both the obs
and var
axes using .obs_names
(resp. .var_names
).
adata.obs_names = [f"Cell_{i}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i}" for i in range(adata.n_vars)]
print(adata.obs_names[:10])
Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
'Cell_7', 'Cell_8', 'Cell_9'],
dtype='object')
Subsetting AnnData¶
These index values can be used to subset the AnnData, which provides a view of the AnnData object. We can imagine this to be useful to subset the AnnData to particular cell types or gene modules of interest. The rules for subsetting AnnData are quite similar to that of a Pandas DataFrame. You can use values in the obs/var_names
, boolean masks, or cell index integers.
adata[["Cell_1", "Cell_10"], ["Gene_5", "Gene_1900"]]
View of AnnData object with n_obs × n_vars = 2 × 2
Adding aligned metadata¶
Observation/Variable level¶
So we have the core of our object and now we’d like to add metadata at both the observation and variable levels. This is pretty simple with AnnData, both adata.obs
and adata.var
are Pandas DataFrames.
ct = np.random.choice(["B", "T", "Monocyte"], size=(adata.n_obs,))
adata.obs["cell_type"] = ct
adata.obs
cell_type | |
---|---|
Cell_0 | T |
Cell_1 | Monocyte |
Cell_2 | Monocyte |
Cell_3 | Monocyte |
Cell_4 | T |
... | ... |
Cell_95 | B |
Cell_96 | B |
Cell_97 | Monocyte |
Cell_98 | Monocyte |
Cell_99 | Monocyte |
100 rows × 1 columns
We can also see now that the AnnData representation has been updated:
adata
AnnData object with n_obs × n_vars = 100 × 2000
obs: 'cell_type'
Subsetting using metadata¶
We can also subset the AnnData using these randomly generated cell types:
bdata = adata[adata.obs.cell_type == "B"]
bdata
View of AnnData object with n_obs × n_vars = 31 × 2000
obs: 'cell_type'
Observation/Variable level matrices¶
We might also have metadata at either level that has many dimensions to it, such as a UMAP embedding of the data. For this type of metadata, AnnData has the .obsm/.varm
attributes. We use keys to identify the different matrices we insert. The restriction of .obsm/.varm
are that .obsm
matrices must length equal to the number of observations as .n_obs
and .varm
matrices must length equal to .n_vars
. They can each independently have different number of dimensions.
Let’s start with a randomly generated matrix that we can interpret as a UMAP embedding of the data we’d like to store, as well as some random gene-level metadata:
adata.obsm["X_umap"] = np.random.normal(0, 1, size=(adata.n_obs, 2))
adata.varm["gene_stuff"] = np.random.normal(0, 1, size=(adata.n_vars, 5))
adata.obsm
AxisArrays with keys: X_umap
Again, the AnnData representation is updated.
adata
AnnData object with n_obs × n_vars = 100 × 2000
obs: 'cell_type'
obsm: 'X_umap'
varm: 'gene_stuff'
A few more notes about .obsm/.varm
The “array-like” metadata can originate from a Pandas DataFrame, scipy sparse matrix, or numpy dense array.
When using scanpy, their values (columns) are not easily plotted, where instead items from
.obs
are easily plotted on, e.g., UMAP plots.
Unstructured metadata¶
AnnData has .uns
, which allows for any unstructured metadata. This can be anything, like a list or a dictionary with some general information that was useful in the analysis of our data.
adata.uns["random"] = [1, 2, 3]
adata.uns
OrderedDict([('random', [1, 2, 3])])
Layers¶
Finally, we may have different forms of our original core data, perhaps one that is normalized and one that is not. These can be stored in different layers in AnnData. For example, let’s log transform the original data and store it in a layer:
adata.layers["log_transformed"] = np.log1p(adata.X)
adata
AnnData object with n_obs × n_vars = 100 × 2000
obs: 'cell_type'
uns: 'random'
obsm: 'X_umap'
varm: 'gene_stuff'
layers: 'log_transformed'
Outputting DataFrames¶
We can also ask AnnData to return us a DataFrame from one of the layers:
adata.to_df(layer="log_transformed").iloc[:, :5]
Gene_0 | Gene_1 | Gene_2 | Gene_3 | Gene_4 | |
---|---|---|---|---|---|
Cell_0 | 0.693147 | 1.386294 | 0.000000 | 0.693147 | 1.386294 |
Cell_1 | 0.693147 | 1.098612 | 0.693147 | 1.098612 | 1.098612 |
Cell_2 | 1.098612 | 0.000000 | 0.000000 | 0.693147 | 1.098612 |
Cell_3 | 0.693147 | 1.098612 | 0.000000 | 0.693147 | 0.000000 |
Cell_4 | 1.098612 | 0.693147 | 0.693147 | 0.000000 | 0.000000 |
... | ... | ... | ... | ... | ... |
Cell_95 | 1.098612 | 0.000000 | 0.693147 | 1.098612 | 0.693147 |
Cell_96 | 0.693147 | 0.000000 | 1.098612 | 0.000000 | 0.000000 |
Cell_97 | 0.693147 | 0.693147 | 0.000000 | 0.693147 | 0.000000 |
Cell_98 | 0.000000 | 1.098612 | 1.098612 | 0.693147 | 0.693147 |
Cell_99 | 0.000000 | 0.693147 | 1.098612 | 0.000000 | 1.609438 |
100 rows × 5 columns
We see that the .obs_names/.var_names
are used in the creation of this Pandas object.
Wrapping up¶
AnnData has become the standard for single-cell analysis in Python and for good reason – it’s straightforward to use and faciliatates more reproducible analyses with it’s key-based storage. It’s even becoming easier to convert to the popular R-based formats for single-cell analysis. There is still a lot that I don’t cover here. I do encourage looking through the AnnData API docs for more useful properties/methods.