Flow cytometry is a powerful technique for phenotypic analysis of cells and cell populations. One main challenge in flow cytometry analysis is to visualise the resulting high-dimensional data to understand data at single-cell levels.

This is where dimensionality reduction techniques come at play, in particular with the advent of spectral and mass cytometry. Most biological data are non-linear so using algorithms based on principal component (the famous PCA) will lose local distributions even if they maintain global ones. The most used embedding algorithm (i.e. that transform high-dimension in low-dimension) is currently t-SNE, which stands for t-Stochastic Neighbour Embedding. I will tell you here why UMAP killed t-SNE as nicely said in this article.

A little bit about t-SNE before killing it

Let’s start by explaining briefly how t-SNE works:
Developped by L.J.P. van der Maaten, the algorithm aims is to minimise the Kullback-Leibler (KL) divergence between the low-dimensional embedding and the high-dimensional data. The KL divergence can be defined as a measure of how one probability distribution diverges from a second. （；¬＿¬)

A quick note on KL divergence

Ok, let's plot KL divergence to better understand it using scipy.stats norm and numpy for distribution and seaborn and matplotlib.pyplot for plotting

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

Let's create a little function that plots two normal distribution and calculate the KL:

def kl_divergence(x):
    p = norm.pdf(x, 0, 2)  
    q = norm.pdf(x, 3, 2)
    kl = p * np.log(p /q)
    return p, q, kl

x = np.arange(-10, 10, 0.1)
p, q, kl = kl_divergence(x)

sns.set('paper', 'whitegrid')
plt.plot(x, p, c='b')  
plt.plot(x, q, c='r')
plt.plot(x, kl, lw=0.1, c='b')
plt.fill_between(x, kl, alpha=0.2, color='b')
plt.title('KL(P||Q)')
plt.ylim(-0.1, 0.35)
plt.show()

So the idea is to end up with:

Steps

In high dimension, t-SNE tries to determine the probability of similarity between each data points. To do so, t-SNE:

measures the (‘Euclidean’, sort of straight) distance between one point to all the other points, pairwise.
computes a similarity score using a Gaussian distribution kernel 🌰: the probability a point x would have another point as neighbour if this neighbour was picked from a normal distribution centred on the point x.
These similarity scores are then normalised to account for cluster density.
The expected density around each point that determines the width of the kernel is controlled by the perplexity parameters.
To account for unbalanced density and thus disagreeing scores between two points, the similarity are symmetrised by averaging them.
This gives you a matrix of similarity scores for the high-dimensional data.

Following dimension reduction initialisation with PCA or at random, the data are embedded in low dimension. The process describe above is performed again but this time to avoid overcrowding clusters (all data points ending up in the same area) a t-distribution and not a Gaussian is used leaving heavy tails for far apart values.

The KL divergence cost function is minimised by gradient descent, i.e. by small steps.

The use of KL diverge prevent t-SNE from conserving global distance in low-dimension.

Parameters

t-SNE has two important hyperparameters to tune:

perplexity: number of nearest neighbours whose geometric distance is preserved in embedding, i.e. the number of points that you think are in a cluster a minimum that t-SNE will be considering.

Typical value range from 5-50, but the larger the dataset the larger the perplexity.
max. number of iteration:ideally the more the better, but it is limited by computational time. A default value of 1000 is good for exploration.

Note: t-SNE has a cost function that is not convex, i.e. it can be stuck at a local minima and with different initialisation we can get different results.

Pros and cons

Advantages:

preserves local structure and clusters

Disadvantages:

loss of global structure
slow computation
suffers from overcrowding in large datasets

To go further

For simple and visual explanation, I would recommend StatQuest from Josh Starmer and this popular tutorial

UMAP

Uniform Manifold Approximation and Projection or UMAP was developed in 2018 by McInnes. UMAP has already been integrated into proprietary software for flow analysis democratising its use but its python library offers a wider range of possibility.

Steps

UMAP builts a graph of the high dimensional data

As t-SNE, UMAP relies on building a graph of the high-dimensional data, i.e. a network of nodes (point) connected by edges (line). The edge are weighted by the probability of the two points to be connected. The connection between to points is made when the radius around one point overlap the radius of another points.

How to choose the radius? UMAP chooses a radius for each points using the distance between the point and its k-th nearest neighbour. Then it computes the likelihood of connection as inversely proportional to the radius growth, making the graph 'fuzzy'. This last point is to solve the curse of dimensionality where the distances between points become more similar with higher dimensions.

To avoid ending up with isolated points and maintaining local structure, UMAP ensures that each point is connected to a least its closest neighbour. As in t-SNE, UMAP can generate connections with incompatible weights, however instead of just averaging the weighs UMAP uses the union of the two 'fuzzy simplical sets' (⊙_☉).

fuzzy

Example of fuzzy radii and weighted connections adapted from the [page](https://pair-code.github.io/understanding-umap/) of A.Coenen and A.Pearce

UMAP optimises the low dimension

As in t-SNE, the point is to optimise the low dimension data (that do not have varying metrics like the high dimension above but Euclidean distances) so they are the closest to the ‘fuzzy’ topology define above. However, to do so instead of using KL divergence, UMAP uses cross-entropy as a loss of function. This function is not too far from KL divergence, but, by using attractive and repulsive forces, it maintains global structure better. Moreover, unlike t-SNE, UMAP does not use random initialisation for embedding but spectral embedding so UMAP is more reproducible from run to run.

Note: it is still stochastic so it does change a bit.

Finally, UMAP uses stochastic gradient descent (mini batch) instead of gradient descent so it is computed faster.

Parameters

n_neighbors: number of neighbours for the high dimensional graph. A low value will provides with local structure while a high value will focus on global structure and lose fine details.
min_dist: minimum distance between points in low dimension,i.e. how tight the points are clustered. Large values results in less pack clusters and loses focus on global structure.

Pros and cons

Advantages:

preserves local structure and clusters
better preservation of global structure
better reproducibility
fast runtime

Disadvantages:

hyperparameter choice is crucial and depends on the aims

To go further

For a really playful and clear detailed explanation, checkout this website.