Getting started with single-cell genomics
Last modified: November 15, 2021.
I started working with single-cell RNA-sequencing (scRNA-seq) data in 2016 during my Master's. As the field was rapidly proliferating at the time, there were a lack of resources available to quickly learn about how single-cell data are generated as well as best practices for how it should be analyzed. As single-cell genomics has become more routine, there are now so many resources that it can be hard to know where to start. Here I provide some of my favorite resources that I regularly share with undergraduates and new PhD students. This intentionally short list of resources is intended to quickly acclimate those with computational backgrounds; however, they are all of general interest. I will continue to update this post as new material becomes available. For a more comprehensive set of resources, I recommend Ming Tang's analysis notes.
These videos provide a high-level overview of many popular biological questions that can be answered with single-cell data, as well as some description of the underlying technologies.
This last video is from the Stat 115 course at Harvard, which is fully available online, and goes more in depth with single-cell technologies.
Wagner, A., Regev, A., & Yosef, N. (2016). Revealing the vectors of cellular identity with single-cell genomics. Nature biotechnology, 34(11), 1145-1160. [Paper]
Regev, A., Teichmann, S. A., Lander, E. S., Amit, I., Benoist, C., Birney, E., ... & Yosef, N. (2017). Science forum: the human cell atlas. Elife, 6, e27041. [Paper]
The development of computational methods for single-cell genomics data has become very popular in the computational biology/applied stats/applied machine learning fields. In fact, there are about two new scRNA-seq computational methods for every three studies (with new data) published (source). These methods overwhelmingly target a core set of tasks, and most are designed to handle the technical characteristics of these data.
Luecken, M. D., & Theis, F. J. (2019). Current best practices in single‐cell RNA‐seq analysis: a tutorial. Molecular systems biology, 15(6), e8746. [Paper]
Lähnemann, D., Köster, J., Szczurek, E., McCarthy, D. J., Hicks, S. C., Robinson, M. D., ... & Schönhuth, A. (2020). Eleven grand challenges in single-cell data science. Genome biology, 21(1), 1-35. [Paper]
Amezquita, R. A., Lun, A. T., Becht, E., Carey, V. J., Carpp, L. N., Geistlinger, L., ... & Hicks, S. C. (2020). Orchestrating single-cell analysis with Bioconductor. Nature methods, 17(2), 137-145. [Paper] [Online book] (caveat: I would first focus on the motivation sections and not the code sections of the online book.)
A lot of high quality open-source software has been developed just for single-cell data analysis. I am biased as I mostly use Python and therefore only present Python packages here. I assume that R is the more popular language for single-cell data analysis due to Seurat; however, given the power of Python-based machine learning frameworks like PyTorch, Jax, and Tensorflow, along with datasets that regularly approach >100,000 cells, I believe Python has a brighter future in the field. That said, for now, many common analyses can be done with either language, though certain tasks may require language-specific packages. Therefore, it's also beneficial to understand popular data structures in both languages and how to convert between them (perhaps described in another tutorial). Also, in Python, all packages use the cell by genes orientation for the data, while in R it's genes by cells.
Scanpy and its basic tutorial. Scanpy underlies most Python-based single-cell analyses and for good reason. It uses the AnnData data structure for storing the data, which has become the most widely accepted input to other Python-based methods. It also provides nice plotting functions for exploratory analysis. Familiarity with the Scanpy workflow is essential.
scvi-tools, which I help develop, provides access to many popular probabilistic models for single-cell genomics as well as an interface to build novel probabilistic/deep learning models using PyTorch, PyTorch Lightning, and Pyro. I present a Scanpy/scvi-tools tutorial here, with corresponding video.
cellxgene for interactive visualization.
It's ok to be confused
There's not always a good reason for why things are done the way they are (and this can make a good research direction!)
While these resources represent a starting point, it's important to read publications that apply these computational techniques and technologies. At first, the papers will contain jargon that does not make sense, and that's ok.