Applied ML
Clustering for Gene Expression
Graph-based clustering on KNN graphs (Louvain, Leiden) is the default for single-cell expression. UMAP is for visualization, not clustering. Batch correction belongs upstream of the graph.
Prerequisites
Why This Matters
The first analytical question on a single-cell or bulk gene-expression dataset is: which samples group together. The answer drives cell-type discovery, marker-gene selection, trajectory inference, and almost every downstream comparison. Yet the clustering itself is shockingly sensitive to upstream choices: normalization, feature selection, batch correction, and the resolution parameter. Two analysts running "the same pipeline" often disagree on cluster count by a factor of two.
The dominant pattern is graph-based clustering on a K-nearest-neighbor graph in a low-dimensional embedding (PCA components, scVI latents). Modularity-maximizing algorithms (Louvain, Leiden) partition the graph; UMAP visualizes it. This pipeline scales to millions of cells and respects the manifold structure of cell-state space better than centroid-based methods like k-means clustering.
The traps are mostly methodological. Resolution is a free parameter; batch effects survive PCA and contaminate clusters; UMAP coordinates are routinely treated as a faithful embedding when they are not. Reading a clustering plot well is half the discipline.
Core Ideas
KNN graph plus modularity. Compute pairwise distances in latent space, build the directed KNN graph (typically ), symmetrize. Define modularity for partition as where is the adjacency, the degree, the edge count, and the resolution parameter. Louvain (Blondel et al. 2008) greedily maximizes by moving nodes between communities. Leiden (Traag, Waltman, van Eck 2019, Sci. Rep. 9) fixes a known Louvain failure where communities can become disconnected during refinement; Leiden guarantees well-connected partitions and converges to higher modularity.
Resolution parameter governs cluster count. Higher yields more, smaller clusters. There is no principled choice; common practice is to scan , examine cluster stability across runs, and pick a level matching the biological grain of interest. Tools like clustree visualize how cells split as increases.
UMAP for visualization vs. embedding. UMAP (McInnes, Healy, Melville 2018, arXiv:1802.03426) optimizes a cross-entropy objective between fuzzy simplicial sets in high and low dimensions. It produces compelling 2D plots but distorts global distances and density. Cluster on the high-dimensional latent (PCA, scVI), not on UMAP coordinates. Boundaries that look crisp in 2D often correspond to gradients in the underlying space; cluster splits that look small can hide biologically distinct subpopulations. See t-SNE and UMAP for the dimensionality-reduction side.
Batch correction before the graph. If samples come from different patients, technologies, or processing batches, naive PCA mixes biology with batch. Harmony (Korsunsky et al. 2019, Nature Methods 16) iterates between soft clustering and per-batch linear correction in PCA space, removing batch-specific cluster centroids while preserving cell-type structure. Run Harmony on the PCA embedding, build the KNN graph in the corrected space, then cluster. Other options: scVI/scANVI (latent-space integration), BBKNN (batch-balanced KNN), MNN (mutual nearest neighbors). The Luecken et al. 2022 benchmark in Nature Methods compares them on cell-type preservation vs. batch removal.
Comparison to k-means and hierarchical. K-means assumes spherical clusters of similar variance, which fails for cell types of vastly different abundance and shape in latent space. Hierarchical clustering scales as in memory and is impractical for . Graph-based methods scale well and impose no centroid assumption. K-means and hierarchical remain useful for small datasets, gene clustering (where ), and bulk RNA-seq sample clustering.
Common Confusions
Cluster count is not data-determined
There is no algorithm that returns the correct number of cell types from expression data alone. Resolution, KNN , and integration choices all change cluster count. Treat clusters as an analytical lens, not ground truth. Validate with known marker genes and biological consistency across replicates.
Running Harmony after clustering does not undo batch effects
Batch correction must happen before the KNN graph is built. If clusters are constructed in uncorrected space and then re-labeled with Harmony embeddings, the batch-driven structure persists in the cluster membership. The order is: normalize, select features, embed, integrate, build graph, cluster.
References
Canonical:
- Blondel, Guillaume, Lambiotte, Lefebvre. "Fast unfolding of communities in large networks (Louvain)." J. Stat. Mech. 2008, P10008. https://iopscience.iop.org/article/10.1088/1742-5468/2008/10/P10008
- Traag, Waltman, van Eck. "From Louvain to Leiden: guaranteeing well-connected communities." Scientific Reports 9, 2019. https://www.nature.com/articles/s41598-019-41695-z
Current:
- McInnes, Healy, Melville. "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv:1802.03426, 2018. https://arxiv.org/abs/1802.03426
- Korsunsky et al. "Fast, sensitive and accurate integration of single-cell data with Harmony." Nature Methods 16, 2019. https://www.nature.com/articles/s41592-019-0619-0
- Luecken et al. "Benchmarking atlas-level data integration in single-cell genomics." Nature Methods 19, 2022. https://www.nature.com/articles/s41592-021-01336-8
- Luecken, Theis. "Current best practices in single-cell RNA-seq analysis: a tutorial." Mol. Syst. Biol. 15, 2019. https://www.embopress.org/doi/full/10.15252/msb.20188746
Related Topics
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- K-Means Clusteringlayer 1 · tier 1
- Spectral Clusteringlayer 2 · tier 2
Derived topics
1- Graph Neural Networkslayer 3 · tier 2
Graph-backed continuations