From PCA to VAE

Principal components analysis (PCA) is a dimensionality reduction technique that aims to preserve variability in the data. A Variational Autoencoder (VAE) (Kingma et al. (2014)) is a deep learning approach to, among other things, generate novel images of handwritten digits and celebrities. So how are these two models related?

Exploration of PCA

To begin, let's explore PCA. We assume we have a dataset D={xi}i=1N\mathcal{D} = \{x_i\}_{i=1}^N, where xiRdx_i \in \mathbb{R}^d. In the case of scRNA-seq data, each dimension would correspond to one gene. For simplicity, we assume our data have been mean-centered, i.e., E[xi]=0\mathbb{E}[x_i] = 0. PCA learns a sequence of unit vectors such that:

  1. The vectors are mutually orthogonal

  2. The squared distance of points to lines in direction of vectors is minimzed

Let us unpack this. To find the first principal component, we solve the following optimization problem.

minu:u2=11Ni=1Nxi(uxi)u22\min_{u: \|u\|_2 = 1} \frac{1}{N}\sum_{i=1}^N \|x_i - (u^\top x_i)u\|_2^2

Another interpretation of PCA is that it projects the data onto lines (components) such that the variance of the resulting projected points is maximized. To see this consider

code_test = 1

References

Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. ICLR 2014