ML Methods
Contrastive Learning
Learning representations by pulling positive pairs together and pushing negative pairs apart, with theoretical grounding in mutual information maximization.
Prerequisites
Why This Matters
Contrastive learning is the dominant paradigm for self-supervised representation learning. It produces representations competitive with supervised pretraining while using no labels. The key models in this space (CLIP, SimCLR, MoCo) all use contrastive objectives. Understanding the theory behind contrastive learning explains why augmentation choices matter, why large batch sizes help, and what the loss function actually optimizes.
The Contrastive Setup
Given an input , create two views (positive) by applying random augmentations. Sample other inputs as negatives. Learn an encoder such that and are close while and are far apart.
Positive Pair
Two views form a positive pair if and only if they are derived from the same underlying data point (e.g., two augmentations of the same image). The contrastive objective trains the encoder to map positive pairs to nearby points in representation space.
InfoNCE Loss
For anchor , positive , and negatives where :
Here is cosine similarity and is a temperature parameter. This is a softmax cross-entropy over options where the correct answer is the positive pair.
Core Theory
InfoNCE as Mutual Information Lower Bound
Statement
The InfoNCE loss with negative samples satisfies:
where is the mutual information between the anchor and positive views. Minimizing InfoNCE maximizes a lower bound on mutual information.
Intuition
InfoNCE is an -way classification problem: identify the positive among candidates. The quantity is a lower bound on , and perfect classification (loss ) certifies at least bits of shared information. The representation itself can encode strictly more MI than ; the bound just cannot witness it without more negatives.
Proof Sketch
Write the InfoNCE objective as a density ratio estimation problem. The optimal critic is . Apply the variational bound on mutual information from Barber and Agakov (2003). The term appears because the bound saturates: you cannot extract more than bits from an -way classification.
Why It Matters
This explains two empirical observations. First, larger batch sizes (more negatives) give a tighter MI lower bound — caps how much MI the InfoNCE estimator can certify, not how much MI the encoder can actually capture. So small both loosens the bound and weakens the gradient signal at high MI. Second, the quality of augmentations matters because they determine what information is shared between positive pairs and thus what contains.
Failure Mode
The bound becomes loose when the true MI is much larger than . With , the InfoNCE estimator can only certify up to nats, even though the encoder may carry far more shared information. If the task requires recovering more MI than the bound can witness, increasing tightens the bound and sharpens the gradient. Also, the bound says nothing about which information is captured; bad augmentations can make the model learn shortcuts.
Alignment and Uniformity Decomposition
Statement
The contrastive loss decomposes into two competing objectives:
Alignment:
Uniformity:
Good contrastive representations minimize alignment (positive pairs are close) while minimizing uniformity (all representations spread evenly on the hypersphere).
Intuition
Alignment alone leads to collapse (map everything to one point). Uniformity alone leads to random representations. The contrastive loss balances both: pull positives together, spread everything else uniformly.
Proof Sketch
Decompose the InfoNCE gradient into a term pulling positive pairs together (alignment) and a term pushing random pairs apart (uniformity). On the unit sphere, the uniform distribution maximizes entropy, and the repulsive term in InfoNCE pushes toward this maximum entropy distribution.
Why It Matters
This decomposition explains why contrastive methods avoid representation collapse without explicit mechanisms like stop-gradients (used in BYOL/SimSiam). The negative samples provide the uniformity pressure that prevents collapse.
Failure Mode
If the number of negatives is too small, uniformity pressure is weak and representations can partially collapse (cluster into a few modes instead of spreading uniformly). Temperature also mediates this tradeoff: very small overweights hard negatives at the expense of uniformity.
Key Architectures
SimCLR
SimCLR (Chen et al., 2020) applies two random augmentations to each image in a batch of size , producing views. Every other augmented view in the batch serves as a negative. The loss uses cosine similarity with temperature and a 2-layer MLP projection head on top of the encoder.
Key finding: the projection head is critical. Representations before the projection head transfer better than those after it. The projection head discards information about augmentations that is useful for downstream tasks.
MoCo
MoCo (He et al., 2020) decouples the number of negatives from batch size using a momentum-updated encoder and a queue of past representations. The key encoder is updated by gradient descent; the momentum encoder tracks it via exponential moving average: with . This allows a large, consistent negative set (65536 negatives) with normal batch sizes.
CLIP: Image-Text Contrastive
CLIP (Radford et al., 2021) applies contrastive learning across modalities. Given a batch of pairs, pull matching pairs together and push all non-matching pairs apart. The loss is symmetric InfoNCE applied to both the image-to-text and text-to-image directions. CLIP learns representations that enable zero-shot transfer via natural language prompts.
Common Confusions
log N caps the InfoNCE bound, not the representation
The ceiling lives in the bound , not in the representation itself. The encoder can encode more MI than ; InfoNCE just cannot certify it with candidates. In practice, beyond a certain the bound is tight enough that further gains diminish and compute cost grows quadratically (all pairwise similarities). Diminishing returns set in around for vision tasks.
Contrastive learning does not maximize all mutual information
It maximizes MI between views as filtered by the augmentation distribution. If augmentations destroy color information, the learned representation will not encode color. The augmentation policy implicitly defines what information is task-relevant.
The projection head is not the final representation
SimCLR trains with a projection head, but you discard it at evaluation time. The representation before the projection head generalizes better because the head absorbs augmentation-specific information that hurts transfer.
Summary
- Contrastive learning optimizes a lower bound on mutual information between views; the bound itself saturates at for candidates, but the encoder can carry more MI than the bound certifies
- The loss balances alignment (positive pairs close) and uniformity (representations spread on the hypersphere)
- Augmentation choice determines what information is preserved in the learned representation
- Large negative sets improve the bound; MoCo achieves this with a momentum encoder and queue
- CLIP extends the paradigm to cross-modal (image-text) contrastive learning
Exercises
Problem
You train SimCLR with batch size , producing views. How many negative pairs does each anchor have? What is the upper bound on mutual information recoverable from the InfoNCE loss?
Problem
Explain why the projection head in SimCLR improves downstream performance even though it is discarded. Specifically: if color jitter is used as an augmentation, what information does the projection head learn to discard, and why is that harmful for downstream tasks?
Related Comparisons
References
Canonical:
- Oord et al., "Representation Learning with Contrastive Predictive Coding" (2018), Section 2
- Chen et al., "A Simple Framework for Contrastive Learning" (SimCLR, 2020)
Current:
- He et al., "Momentum Contrast for Unsupervised Visual Representation Learning" (MoCo, 2020)
- Wang & Isola, "Understanding Contrastive Representation Learning through Alignment and Uniformity" (ICML 2020)
- Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (CLIP, 2021)
Next Topics
- Data augmentation theory: why augmentation choices determine representation quality
- Diffusion models: an alternative generative paradigm that displaced contrastive pretraining in some domains
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Feedforward Networks and Backpropagationlayer 2 · tier 1
Derived topics
4- CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraininglayer 4 · tier 1
- Diffusion Modelslayer 4 · tier 1
- Data Augmentation Theorylayer 2 · tier 2
- Representation Learning in Cosmologylayer 4 · tier 3