UMAP for Flow Cytometry - Part 2
an application of UMAP on real flow cytometry data for embedding and visualisation
UMAP for flow cytometry data
UMAP can be used for flow analysis with several aims as described in details in the doc from UMAP, e.g. to plot high-dimensional data or to reduce dimension before clustering.
Here, we will explore how UMAP can help in the visualisation of cardiac immune cell response during chronic kidney disease by looking at changes in cardiac leukocyte subtypes.
A real cool tool available for python is Flowkit. You can use this tool to extract data that have been gated with GatingML 2.0 or FlowJo 10. I willl focus here on a FlowJo workspace as this is were I did my gating.
To work with FlowJo workspace you need to first call a Session
to get your FCS and import your workspace in it. Then you need to apply the gating to the sample group (the group from your FlowJo workspace structure).
fks_fj = fk.Session('../data/raw_data')
fks_fj.import_flowjo_workspace('../data/raw_data/FJ_workspace.wsp')
analysis = fks_fj.analyze_samples('group')
To get your labelled gated sampled all you need to do is:
1.Get a dataframe of the gating raw results where in the column events you will have an array of booleans for each gates
gates = pd.DataFrame(fks_fj.get_gating_results('group', 'sample')._raw_results).loc['events'] #where sample is your sample's name ◔‿◔
gates = pd.DataFrame(gates.droplevel(1)).transpose()
2.Explode the arrays into single rows
gated = pd.DataFrame()
for col in gates.columns:
gated[col] = gates[col].explode(ignore_index=True)
3.Merge it back with the fluorescence values of each single cell
Here you can also select only a subpopulation if you know what is your population of interest (for example not the singlets)
features = gated.merge(fkj.get_sample('sample').as_dataframe('raw'), left_index=True, right_index=True)
fks_fj.get_gate_hierarchy
is a useful call to check your gating strategy if you have forgotten it
Now that you have extracted your data, there is just one last crucial step.
Preprocessing: arcsinh
Data from high intensity values are spread on a log scale but the problem with log-scale is that it does not do well on the 'low-end' of the intensity scale, especially since some fluorescence values are zero or sometimes negative (following baseline correction). To solve this problem, bi-exponential transformation have been applied to flow data to approach log for high values and linearity for data around zero.
The ArcSinh transform is close to the classic bi-exponential, with a linear representation near zero and a log-like scale beyond a certain threshold, but it is equation-based. The threshold or co-factor determines the compression of the low-end values, the higher the more data are compressed close to zero. Here, data were transformed using arcsinh with a co-factor of 150 as suggested by Bendall.
Arcsinh transformation adapted from [Bendall et al.](https://science.sciencemag.org/content/332/6030/687)
umap_data = np.arcsinh(features.values / 150.0).astype(np.float32, order='C')
Adapted from Epelman et al. 2014
Let's explore what UMAP do. The library provides a really detailed documentation and also nice functions to explore UMAP.
UMAP is used in the style of most sklearn models: instantiate and fit or fit/transform.
The library also offers utility functions:
plot.points
: plot a scatter plot.
plot.connectivity
: to plot the connection between each points.
plot.diagnostic
: to diagnose the embedding.
Reminder of parameters:
n_neighbors
: number of neighbours for the high dimensional graph.min_dist
: minimum distance between points in low dimension,i.e. how tight the points are clustered.
import umap.plot
import umap
#common usage
mapper = umap.UMAP()
embedding = mapper.fit(umap_data)
Let's test what happen when we modify the hyperparameters:
def draw_umap(n_neighbors=15, min_dist=0.1, title=''):
mapper = umap.UMAP(
n_neighbors=n_neighbors,
min_dist=min_dist,
random_state=42
).fit(umap_data)
umap.plot.points(mapper, labels=labels, theme='blue')
plt.title(title, fontsize=15)
plt.show()
On the following plots, the colours represent gated cells. You can truly see that the identified populations are nicely separated by UMAP.
umap.plot.connectivity(mapper, show_points=True, width=400, height=400)
plt.show()
umap.plot.diagnostic(mapper, diagnostic_type='pca', width=400, height=400)
plt.show()
Parametric UMAP
Parametric UMAP is really powerful as it relies on a neural network to tune the hyperparameters. You can tune the network if needed but I found that it performed really well out of the box.
Loss is binary cross-entropy so we need to scale the data in range 0-1 with min max scaler:
from sklearn.preprocessing import MinMaxScaler
scale = MinMaxScaler()
scaled_umap_data = scale.fit_transform(umap_data)
embedder = ParametricUMAP()
embedding = embedder.fit_transform(scaled_umap_data)
embedder.encoder.summary()
Conclusion
Parametric UMAP was similar to non-parametric and showed clusters in a truly consistent embedding. The downside is that it offers less tuning possibilities as compared to non-parametric UMAP. Most flow analysis software offers easy access to UMAP analysis but the original package offers wider potentialities that are also accessible.
UMAP is a powerful algorithm. I showed its use here in a simple flow cytometry analysis but it has numerous biological applications from flow to RNA sequencing.