Clustering for Gene Expression

Sneiderman, Robby

Applied ML

Clustering for Gene Expression

Graph-based clustering on KNN graphs (Louvain, Leiden) is the default for single-cell expression. UMAP is for visualization, not clustering. Batch correction belongs upstream of the graph.

AdvancedTier 3CurrentReference~15 min

Prerequisites

K Means Clustering Spectral Clustering

Prereq Map

Why This Matters

The first analytical question on a single-cell or bulk gene-expression dataset is: which samples group together. The answer drives cell-type discovery, marker-gene selection, trajectory inference, and almost every downstream comparison. Yet the clustering itself is shockingly sensitive to upstream choices: normalization, feature selection, batch correction, and the resolution parameter. Two analysts running "the same pipeline" often disagree on cluster count by a factor of two.

The dominant pattern is graph-based clustering on a K-nearest-neighbor graph in a low-dimensional embedding (PCA components, scVI latents). Modularity-maximizing algorithms (Louvain, Leiden) partition the graph; UMAP visualizes it. This pipeline scales to millions of cells and respects the manifold structure of cell-state space better than centroid-based methods like k-means clustering.

The traps are mostly methodological. Resolution is a free parameter; batch effects survive PCA and contaminate clusters; UMAP coordinates are routinely treated as a faithful embedding when they are not. Reading a clustering plot well is half the discipline.

Core Ideas

KNN graph plus modularity. Compute pairwise distances in latent space, build the directed KNN graph (typically $K \in [10, 50]$ ), symmetrize. Define modularity for partition $C$ as $Q = \frac{1}{2m} \sum_{ij} (A_{ij} - \gamma \frac{k_i k_j}{2m}) \delta(c_i, c_j)$ where $A$ is the adjacency, $k_i$ the degree, $m$ the edge count, and $\gamma$ the resolution parameter. Louvain (Blondel et al. 2008) greedily maximizes $Q$ by moving nodes between communities. Leiden (Traag, Waltman, van Eck 2019, Sci. Rep. 9) fixes a known Louvain failure where communities can become disconnected during refinement; Leiden guarantees well-connected partitions and converges to higher modularity.

Resolution parameter governs cluster count. Higher $\gamma$ yields more, smaller clusters. There is no principled choice; common practice is to scan $\gamma \in [0.1, 2.0]$ , examine cluster stability across runs, and pick a level matching the biological grain of interest. Tools like clustree visualize how cells split as $\gamma$ increases.

UMAP for visualization vs. embedding. UMAP (McInnes, Healy, Melville 2018, arXiv:1802.03426) optimizes a cross-entropy objective between fuzzy simplicial sets in high and low dimensions. It produces compelling 2D plots but distorts global distances and density. Cluster on the high-dimensional latent (PCA, scVI), not on UMAP coordinates. Boundaries that look crisp in 2D often correspond to gradients in the underlying space; cluster splits that look small can hide biologically distinct subpopulations. See t-SNE and UMAP for the dimensionality-reduction side.

Batch correction before the graph. If samples come from different patients, technologies, or processing batches, naive PCA mixes biology with batch. Harmony (Korsunsky et al. 2019, Nature Methods 16) iterates between soft clustering and per-batch linear correction in PCA space, removing batch-specific cluster centroids while preserving cell-type structure. Run Harmony on the PCA embedding, build the KNN graph in the corrected space, then cluster. Other options: scVI/scANVI (latent-space integration), BBKNN (batch-balanced KNN), MNN (mutual nearest neighbors). The Luecken et al. 2022 benchmark in Nature Methods compares them on cell-type preservation vs. batch removal.

Comparison to k-means and hierarchical. K-means assumes spherical clusters of similar variance, which fails for cell types of vastly different abundance and shape in latent space. Hierarchical clustering scales as $O(n^2)$ in memory and is impractical for $n > 10^5$ . Graph-based methods scale well and impose no centroid assumption. K-means and hierarchical remain useful for small datasets, gene clustering (where $g \sim 10^4$ ), and bulk RNA-seq sample clustering.

Common Confusions

Watch Out

Cluster count is not data-determined

There is no algorithm that returns the correct number of cell types from expression data alone. Resolution, KNN $K$ , and integration choices all change cluster count. Treat clusters as an analytical lens, not ground truth. Validate with known marker genes and biological consistency across replicates.

Watch Out

Running Harmony after clustering does not undo batch effects

Batch correction must happen before the KNN graph is built. If clusters are constructed in uncorrected space and then re-labeled with Harmony embeddings, the batch-driven structure persists in the cluster membership. The order is: normalize, select features, embed, integrate, build graph, cluster.

References

Canonical:

Blondel, Guillaume, Lambiotte, Lefebvre. "Fast unfolding of communities in large networks (Louvain)." J. Stat. Mech. 2008, P10008. https://iopscience.iop.org/article/10.1088/1742-5468/2008/10/P10008
Traag, Waltman, van Eck. "From Louvain to Leiden: guaranteeing well-connected communities." Scientific Reports 9, 2019. https://www.nature.com/articles/s41598-019-41695-z

Current:

McInnes, Healy, Melville. "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv:1802.03426, 2018. https://arxiv.org/abs/1802.03426
Korsunsky et al. "Fast, sensitive and accurate integration of single-cell data with Harmony." Nature Methods 16, 2019. https://www.nature.com/articles/s41592-019-0619-0
Luecken et al. "Benchmarking atlas-level data integration in single-cell genomics." Nature Methods 19, 2022. https://www.nature.com/articles/s41592-021-01336-8
Luecken, Theis. "Current best practices in single-cell RNA-seq analysis: a tutorial." Mol. Syst. Biol. 15, 2019. https://www.embopress.org/doi/full/10.15252/msb.20188746

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics