UMAP for Flow Cytometry - Part 1
a bit of embedding theory on tSNE and UMAP
Flow cytometry is a powerful technique for phenotypic analysis of cells and cell populations. One main challenge in flow cytometry analysis is to visualise the resulting high-dimensional data to understand data at single-cell levels.
This is where dimensionality reduction techniques come at play, in particular with the advent of spectral and mass cytometry. Most biological data are non-linear so using algorithms based on principal component (the famous PCA) will lose local distributions even if they maintain global ones. The most used embedding algorithm (i.e. that transform high-dimension in low-dimension) is currently t-SNE, which stands for t-Stochastic Neighbour Embedding. I will tell you here why UMAP killed t-SNE as nicely said in this article.
A little bit about t-SNE before killing it
Let’s start by explaining briefly how t-SNE works:
Developped by L.J.P. van der Maaten, the algorithm aims is to minimise the Kullback-Leibler (KL) divergence between the low-dimensional embedding and the high-dimensional data. The KL divergence can be defined as a measure of how one probability distribution diverges from a second. (;¬_¬)
Ok, let's plot KL divergence to better understand it using scipy.stats norm
and numpy
for distribution and seaborn
and matplotlib.pyplot
for plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
Let's create a little function that plots two normal distribution and calculate the KL:
def kl_divergence(x):
p = norm.pdf(x, 0, 2)
q = norm.pdf(x, 3, 2)
kl = p * np.log(p /q)
return p, q, kl
x = np.arange(-10, 10, 0.1)
p, q, kl = kl_divergence(x)
sns.set('paper', 'whitegrid')
plt.plot(x, p, c='b')
plt.plot(x, q, c='r')
plt.plot(x, kl, lw=0.1, c='b')
plt.fill_between(x, kl, alpha=0.2, color='b')
plt.title('KL(P||Q)')
plt.ylim(-0.1, 0.35)
plt.show()
So the idea is to end up with:
Steps
- In high dimension, t-SNE tries to determine the probability of similarity between each data points. To do so, t-SNE:
- measures the (‘Euclidean’, sort of straight) distance between one point to all the other points, pairwise.
- computes a similarity score using a Gaussian distribution kernel 🌰: the probability a point x would have another point as neighbour if this neighbour was picked from a normal distribution centred on the point x.
- These similarity scores are then normalised to account for cluster density.
The expected density around each point that determines the width of the kernel is controlled by the perplexity parameters. - To account for unbalanced density and thus disagreeing scores between two points, the similarity are symmetrised by averaging them.
This gives you a matrix of similarity scores for the high-dimensional data.
- Following dimension reduction initialisation with PCA or at random, the data are embedded in low dimension. The process describe above is performed again but this time to avoid overcrowding clusters (all data points ending up in the same area) a t-distribution and not a Gaussian is used leaving heavy tails for far apart values.
- The KL divergence cost function is minimised by gradient descent, i.e. by small steps.
The use of KL diverge prevent t-SNE from conserving global distance in low-dimension.
Parameters
t-SNE has two important hyperparameters to tune:
-
perplexity
: number of nearest neighbours whose geometric distance is preserved in embedding, i.e. the number of points that you think are in a cluster a minimum that t-SNE will be considering.Typical value range from 5-50, but the larger the dataset the larger the perplexity.
-
max. number of iteration
:ideally the more the better, but it is limited by computational time. A default value of 1000 is good for exploration.
Steps
- UMAP builts a graph of the high dimensional data
As t-SNE, UMAP relies on building a graph of the high-dimensional data, i.e. a network of nodes (point) connected by edges (line). The edge are weighted by the probability of the two points to be connected. The connection between to points is made when the radius around one point overlap the radius of another points.
How to choose the radius? UMAP chooses a radius for each points using the distance between the point and its k-th nearest neighbour. Then it computes the likelihood of connection as inversely proportional to the radius growth, making the graph 'fuzzy'. This last point is to solve the curse of dimensionality where the distances between points become more similar with higher dimensions.
To avoid ending up with isolated points and maintaining local structure, UMAP ensures that each point is connected to a least its closest neighbour. As in t-SNE, UMAP can generate connections with incompatible weights, however instead of just averaging the weighs UMAP uses the union of the two 'fuzzy simplical sets' (⊙_☉).
Example of fuzzy radii and weighted connections adapted from the [page](https://pair-code.github.io/understanding-umap/) of A.Coenen and A.Pearce
- UMAP optimises the low dimension
As in t-SNE, the point is to optimise the low dimension data (that do not have varying metrics like the high dimension above but Euclidean distances) so they are the closest to the ‘fuzzy’ topology define above. However, to do so instead of using KL divergence, UMAP uses cross-entropy as a loss of function. This function is not too far from KL divergence, but, by using attractive and repulsive forces, it maintains global structure better. Moreover, unlike t-SNE, UMAP does not use random initialisation for embedding but spectral embedding so UMAP is more reproducible from run to run.
Finally, UMAP uses stochastic gradient descent (mini batch) instead of gradient descent so it is computed faster.
Parameters
-
n_neighbors
: number of neighbours for the high dimensional graph. A low value will provides with local structure while a high value will focus on global structure and lose fine details. -
min_dist
: minimum distance between points in low dimension,i.e. how tight the points are clustered. Large values results in less pack clusters and loses focus on global structure.
To go further
For a really playful and clear detailed explanation, checkout this website.