UMAP for flow cytometry data

UMAP can be used for flow analysis with several aims as described in details in the doc from UMAP, e.g. to plot high-dimensional data or to reduce dimension before clustering.
Here, we will explore how UMAP can help in the visualisation of cardiac immune cell response during chronic kidney disease by looking at changes in cardiac leukocyte subtypes.

Getting the data ready for it

From FlowJo to gated data

A real cool tool available for python is Flowkit. You can use this tool to extract data that have been gated with GatingML 2.0 or FlowJo 10. I willl focus here on a FlowJo workspace as this is were I did my gating.

To work with FlowJo workspace you need to first call a Session to get your FCS and import your workspace in it. Then you need to apply the gating to the sample group (the group from your FlowJo workspace structure).

fks_fj = fk.Session('../data/raw_data')
fks_fj.import_flowjo_workspace('../data/raw_data/FJ_workspace.wsp')
analysis = fks_fj.analyze_samples('group')

To get your labelled gated sampled all you need to do is:

1.Get a dataframe of the gating raw results where in the column events you will have an array of booleans for each gates

gates = pd.DataFrame(fks_fj.get_gating_results('group', 'sample')._raw_results).loc['events'] #where sample is your sample's name ◔‿◔
gates = pd.DataFrame(gates.droplevel(1)).transpose()

2.Explode the arrays into single rows

gated = pd.DataFrame()
for col in gates.columns:
    gated[col] = gates[col].explode(ignore_index=True)

3.Merge it back with the fluorescence values of each single cell

Here you can also select only a subpopulation if you know what is your population of interest (for example not the singlets)

features  = gated.merge(fkj.get_sample('sample').as_dataframe('raw'), left_index=True, right_index=True)

Tip: fks_fj.get_gate_hierarchy is a useful call to check your gating strategy if you have forgotten it

Now that you have extracted your data, there is just one last crucial step.

Preprocessing: arcsinh

Data from high intensity values are spread on a log scale but the problem with log-scale is that it does not do well on the 'low-end' of the intensity scale, especially since some fluorescence values are zero or sometimes negative (following baseline correction). To solve this problem, bi-exponential transformation have been applied to flow data to approach log for high values and linearity for data around zero.
The ArcSinh transform is close to the classic bi-exponential, with a linear representation near zero and a log-like scale beyond a certain threshold, but it is equation-based. The threshold or co-factor determines the compression of the low-end values, the higher the more data are compressed close to zero. Here, data were transformed using arcsinh with a co-factor of 150 as suggested by Bendall.

Arcsinh transformation adapted from [Bendall et al.](https://science.sciencemag.org/content/332/6030/687)

umap_data = np.arcsinh(features.values / 150.0).astype(np.float32, order='C')

Note: It is good practice to validate the cofactor value and the transformation by visualising each channel to avoid under/over compression.

Our data

The data use in this case study come from a flow analysis of immune cells in healthy heart vs chronic kidney disease heart. The panel uses 10 antibodies to identify 7 populations of immune cells.

Adapted from Epelman et al. 2014

Non-parametric UMAP

Let's explore what UMAP do. The library provides a really detailed documentation and also nice functions to explore UMAP.

UMAP is used in the style of most sklearn models: instantiate and fit or fit/transform.

The library also offers utility functions:

plot.points: plot a scatter plot.
plot.connectivity: to plot the connection between each points.
plot.diagnostic: to diagnose the embedding.

Reminder of parameters:n_neighbors: number of neighbours for the high dimensional graph. min_dist: minimum distance between points in low dimension,i.e. how tight the points are clustered.

import umap.plot
import umap

#common usage
mapper = umap.UMAP()
embedding = mapper.fit(umap_data)

Neighbours and minimum distance

Let's test what happen when we modify the hyperparameters:

def draw_umap(n_neighbors=15, min_dist=0.1, title=''):
   
    mapper = umap.UMAP(
        n_neighbors=n_neighbors,
        min_dist=min_dist,
        random_state=42
    ).fit(umap_data)
    
    umap.plot.points(mapper, labels=labels, theme='blue')
    plt.title(title, fontsize=15)
    plt.show()

On the following plots, the colours represent gated cells. You can truly see that the identified populations are nicely separated by UMAP.

Conclusion

A low number of neighbours conserves local structure while a high value focuses more on global structure and loses fine details.
Larger distance results in less packed clusters and loses focus on global structure.

The point is to find an ideal balance for your problematic.

Diagnostic and connectivity

umap.plot.connectivity(mapper, show_points=True, width=400, height=400)
plt.show()

umap.plot.diagnostic(mapper, diagnostic_type='pca', width=400, height=400)
plt.show()

Parametric UMAP

Parametric UMAP is really powerful as it relies on a neural network to tune the hyperparameters. You can tune the network if needed but I found that it performed really well out of the box.

Loss is binary cross-entropy so we need to scale the data in range 0-1 with min max scaler:

from sklearn.preprocessing import MinMaxScaler

scale = MinMaxScaler()
scaled_umap_data = scale.fit_transform(umap_data)

embedder = ParametricUMAP()
embedding = embedder.fit_transform(scaled_umap_data)

embedder.encoder.summary()

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
flatten_11 (Flatten)         (None, 10)                0         
_________________________________________________________________
dense_33 (Dense)             (None, 100)               1100      
_________________________________________________________________
dense_34 (Dense)             (None, 100)               10100     
_________________________________________________________________
dense_35 (Dense)             (None, 100)               10100     
_________________________________________________________________
z (Dense)                    (None, 2)                 202       
=================================================================
Total params: 21,502
Trainable params: 21,502
Non-trainable params: 0
_________________________________________________________________

Conclusion

Parametric UMAP was similar to non-parametric and showed clusters in a truly consistent embedding. The downside is that it offers less tuning possibilities as compared to non-parametric UMAP. Most flow analysis software offers easy access to UMAP analysis but the original package offers wider potentialities that are also accessible.

UMAP is a powerful algorithm. I showed its use here in a simple flow cytometry analysis but it has numerous biological applications from flow to RNA sequencing.