Beyond LLMS
JEPA and Joint Embedding
LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA, V-JEPA, V-JEPA 2, and the connection to contrastive learning and world models.
Why This Matters
Large language models learn by predicting the next token. That objective works well for discrete text, but it is a poor fit for high-dimensional continuous signals such as images and video. Predicting every pixel of the next frame forces the model to spend capacity on lighting, texture noise, and other details that may not matter for recognition or planning.
JEPA (Joint Embedding Predictive Architecture) is Yann LeCun's proposal for learning predictive representations of the physical world: predict in the space of abstract representations, not directly in raw sensory space. The bet is that representation-space prediction can keep predictable, task-relevant structure while discarding unpredictable detail.
Mental Model
Imagine two views of the world: a context view and a target view. Instead of reconstructing the target view pixel by pixel from the context, JEPA encodes both views into abstract representations and predicts the target representation from the context representation. The model never needs to generate pixels; it only needs to match the target representation.
This is like summarizing a scene rather than painting it. The model can encode that a ball is falling without also predicting the exact shadow on every frame.
Formal Setup
Joint Embedding Architecture
A joint embedding architecture maps two inputs and to representations and (or different encoders ). An energy function measures compatibility: low energy means and are related, high energy means they are not.
The key distinction from generative models: there is no decoder that reconstructs from . The loss operates entirely in representation space.
The explicit energy form used throughout this page is , where is the predictor network, is an auxiliary latent or conditioning variable specifying what to predict (e.g., the position of a masked region, or an action), and is a distance in representation space (typically or cosine). The predictor lets the context representation route to many possible targets under different choices of , and measures how far the prediction falls from the actual target representation.
Joint Embedding Predictive Architecture (JEPA)
JEPA adds a predictor between the context encoder and the target encoder. Given context input and target input :
- Encode context:
- Encode target: (target encoder, often an EMA of )
- Predict target representation: where specifies what to predict (e.g., which spatial region)
- Minimize prediction error in representation space:
The predictor operates in abstract space. It never sees or generates pixels.
Collapse
Complete (representation) collapse occurs when the encoders map all inputs to the same representation, trivially achieving zero prediction error. If for all , then regardless of input.
Dimensional (informational) collapse is a weaker, subtler failure mode: the embeddings do not collapse to a single point, but instead span a low-rank subspace of representation space. The covariance of the learned embeddings has many vanishing eigenvalues, so whole directions carry no information even though the embeddings are not literally constant. Jing, Vincent, LeCun, and Tian (2022, arXiv:2110.09348) document this in contrastive self-supervised learning and link it to the projector architecture and the interaction of the contrastive loss with data augmentation. Non-contrastive methods like BYOL, VICReg, and Barlow Twins are also susceptible to dimensional collapse, and VICReg's variance and covariance regularizers are designed specifically to counter it.
Preventing both forms of collapse is the central technical challenge of joint embedding methods.
Main Theorems
JEPA as Energy-Based Learning
Statement
The JEPA training objective can be written as an energy-based model:
where is a distance metric (e.g., or cosine distance), denotes stop-gradient (no gradients flow through the target encoder), and is updated via exponential moving average: , where is the retention coefficient. Values of close to mean the target encoder changes slowly (matching BYOL's ); would set each step, giving no stable anchor. The symbol refers to this EMA retention coefficient throughout the page (distinct from the InfoNCE temperature used in the contrastive section below, where the symbol is reused by convention; context disambiguates).
The energy landscape has three failure modes:
- Complete collapse: energy is uniformly low because all representations are identical
- Dimensional collapse: energy is low on average but embeddings span only a low-rank subspace, so some directions carry no information (Jing et al. 2022)
- Fragmentation: energy is uniformly high (representations carry no useful information about the data)
A well-trained JEPA has low energy for compatible pairs and high energy for incompatible pairs, with the energy surface shaped by the architecture and training dynamics rather than by an explicit contrastive loss.
Intuition
The JEPA objective pulls the predicted representation toward the actual target representation , but only through the predictor and context encoder. The target encoder is a slowly moving anchor. This asymmetry prevents the trivial solution where both encoders collapse to a constant. The EMA update ensures the target representations change slowly, giving the predictor a stable learning signal.
Why It Matters
Unlike contrastive learning, JEPA-style objectives do not require explicit negative pairs. Instead, they rely on target-network asymmetry, masking, predictor structure, and sometimes variance or covariance regularization to avoid collapse. The energy-based framing connects JEPA to LeCun's broader argument that future world models should predict abstract states rather than raw observations.
Collapse Prevention via Asymmetry (Empirical Observation)
This is an architectural principle supported by empirical evidence and partial theoretical analysis, not a theorem. Consider the loss with . The collapsed solution for all is a fixed point of the gradient dynamics, but it is empirically unstable when:
- The predictor is sufficiently expressive but not trivially so
- The EMA retention is close to (slow target updates)
- The data distribution has sufficient diversity
Tian, Chen, and Ganguli (2021, "Understanding Self-Supervised Learning Dynamics without Contrastive Pairs", arXiv:2102.06810) analyze this in a linearized BYOL setting. They show that under a linear predictor the encoder eigenspectrum is pulled toward the top principal components of the data covariance, and the fully collapsed equilibrium is dynamically unstable when the predictor and EMA are configured appropriately. The analysis depends on linearity and specific balance conditions; the nonlinear deep-network case remains an empirical claim rather than a proved theorem.
If the target encoder moves slowly (high ), it provides a stable target that changes only gradually. The predictor and context encoder must do real work to match these targets. If the context encoder collapses, the predictor cannot distinguish between different targets, so the loss remains high. BYOL and DINO demonstrated that collapse can be avoided without negative pairs. JEPA inherits these techniques and adds structured prediction in representation space as a further source of non-trivial learning signal.
I-JEPA: Images
I-JEPA (Image JEPA) applies the JEPA framework to images using Vision Transformers. The key design choices:
- Context: a set of visible patches from an image
- Target: representations of masked patches (the patches the model must predict)
- Masking strategy: large, semantically meaningful blocks are masked (not random individual patches)
- Prediction: the predictor takes context patch representations plus positional information about the target patches, and outputs predictions in representation space
Critical difference from Masked Autoencoders (MAE): MAE reconstructs pixels of masked patches. I-JEPA predicts abstract representations of masked patches. This means I-JEPA does not need a pixel decoder and focuses on high-level structure rather than texture details.
I-JEPA reports strong transfer to downstream tasks such as classification and detection without the hand-crafted augmentations used by many contrastive and self-distillation methods. That is the key experimental claim: good visual features can be learned from masked representation prediction rather than from pixel reconstruction or augmentation invariance alone.
V-JEPA: Video
V-JEPA extends the framework to video by masking spatiotemporal blocks. Given a video clip, a large fraction of space-time patches are masked, and the model predicts their representations from the visible patches.
The temporal dimension is what makes V-JEPA relevant to world models. Predicting future or missing video representations forces the encoder to preserve temporally predictive structure: object persistence, motion, contact, and scene change. That is not the same as proving physical understanding, but it is closer to planning-relevant representation learning than single-image invariance.
V-JEPA 2 and Action-Conditioned Planning
V-JEPA 2 extends the line from video representation learning toward planning. The base model is trained on large-scale video using self-supervised feature prediction. A separate action-conditioned variant, V-JEPA 2-AC, is then post-trained on a smaller amount of robot video to support goal-conditioned planning in latent space.
The important distinction is architectural: V-JEPA 2 does not need to render future pixels to evaluate candidate actions. It predicts and compares latent representations of future states. This makes the approach different from pixel-generating video simulators such as diffusion world models; it is a latent predictive model aimed at understanding, prediction, and planning.
V-JEPA 2.1 continues the same line by emphasizing denser visual features while retaining global video understanding. For TheoremPath, the research question is not whether JEPA has "beaten" autoregressive or diffusion models. The better question is narrower: when does latent prediction preserve exactly the information needed for control, and when does it discard something the planner later needs?
JEPA vs. Other Self-Supervised Methods
| Method | Prediction space | Collapse prevention | Requires augmentations | Requires negatives |
|---|---|---|---|---|
| MAE | Pixel space | Not applicable (reconstruction) | No | No |
| SimCLR | Representation space | Contrastive loss (negatives) | Yes (heavy) | Yes |
| BYOL | Representation space | EMA + predictor | Yes (heavy) | No |
| DINO | Representation space | EMA + centering | Yes (heavy) | No |
| I-JEPA | Representation space | EMA + predictor + masking | No | No |
The JEPA design removes both augmentation dependence and the need for negative pairs. This is significant because hand-crafted augmentations (random crops, color jitter) inject domain-specific inductive biases that may not transfer across modalities.
Connection to Contrastive Learning
Contrastive learning (SimCLR, MoCo) can be seen as a special case of joint embedding where collapse is prevented by explicitly pushing apart negative pairs. The InfoNCE loss:
requires a set of negative examples . JEPA aims for the same broad outcome, informative representations that do not collapse, without using explicit negatives. The mechanism is not identical: JEPA-style methods use target-network asymmetry, masking, predictors, and sometimes regularization terms rather than directly pushing sampled negative pairs apart.
Scope. LeCun's 2022 position paper "A Path Towards Autonomous Machine Intelligence" frames JEPA as a unifying energy-based view under which contrastive methods correspond to one particular form of energy shaping (pushing negatives up, pulling positives down). This is a framing argument, not a proved reduction: there is no theorem stating that any InfoNCE-trained model can be rewritten as a JEPA, or that JEPA's implicit regularization is equivalent to a contrastive loss with specific negatives. Treat "JEPA subsumes contrastive learning" as LeCun's interpretive claim in that paper, not an established theorem.
Common Confusions
JEPA is not a replacement for LLMs
JEPA is a self-supervised learning framework for continuous data such as images, video, and audio. It does not generate text, answer questions, or reason verbally by itself. It is a research direction for building predictive world representations that could complement language models, not replace them. The common mistake is to compare JEPA directly to a chatbot instead of to other representation-learning and world-modeling objectives.
Predicting representations is not easier than predicting pixels
Predicting in representation space avoids the generation bottleneck (no need to model irrelevant details) but introduces a new challenge: the representation space is learned simultaneously with the predictor. If the encoder learns trivial representations, prediction is easy but useless. The difficulty shifts from modeling pixels to preventing collapse and ensuring the representations capture meaningful structure.
JEPA is not just BYOL with masking
BYOL predicts the representation of the same input under a different augmentation. I-JEPA predicts the representation of a different part of the input (masked patches). This is a structural prediction task (what is in the missing region?) rather than an invariance task (the representation should not change under crops and color jitter). The masking formulation generalizes naturally to video and spatiotemporal prediction.
Summary
- JEPA predicts in representation space, not pixel space, avoiding the generation bottleneck
- The architecture: context encoder, target encoder (EMA), predictor
- Collapse prevention via EMA, stop-gradient, and the predictor bottleneck
- I-JEPA: masking image patches, predicting their representations
- V-JEPA and V-JEPA 2: masking spatiotemporal video patches and moving toward action-conditioned latent planning
- I-JEPA and V-JEPA avoid hand-crafted augmentations and explicit negative pairs, a departure from contrastive methods
- JEPA is a framework for building world models, not a replacement for LLMs
Exercises
Problem
Explain why Masked Autoencoders (MAE) must model low-level details like texture while I-JEPA does not. What architectural difference causes this?
Problem
Why is the EMA momentum parameter critical for preventing collapse in JEPA? What happens if (no momentum, target encoder equals context encoder) and what happens if (target encoder never updates)?
Problem
JEPA predicts representations of masked patches using positional information about which patches to predict. Could you extend this to predict representations of future video frames given past frames and a proposed action? Describe the architecture and explain how this relates to world models for planning.
Related Comparisons
References
Canonical:
- LeCun, "A Path Towards Autonomous Machine Intelligence" (2022). OpenReview. The position paper connecting JEPA-style latent prediction to world models, actor modules, cost modules, and planning.
- Assran et al., "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" (CVPR 2023, arXiv:2301.08243). I-JEPA. Reports 77.3% top-1 ImageNet linear probe with a ViT-H/14 encoder, trained without hand-crafted augmentations.
Current:
- Bardes et al., "Revisiting Feature Prediction for Learning Visual Representations from Video" (Meta AI, 2024, arXiv:2404.08471). V-JEPA.
- Assran et al., "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning" (2025, arXiv:2506.09985). V-JEPA 2 and V-JEPA 2-AC robot-planning experiments.
- Assran et al., "V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning" (2026, arXiv:2603.14482). Dense visual features with global video understanding.
- Grill et al., "Bootstrap Your Own Latent" (NeurIPS 2020, arXiv:2006.07733). BYOL, direct precursor to JEPA-style non-contrastive methods; uses EMA target with .
- Chen and He, "Exploring Simple Siamese Representation Learning" (CVPR 2021, arXiv:2011.10566). SimSiam. Analysis of collapse prevention via stop-gradient and predictor.
- Tian, Chen, Ganguli, "Understanding Self-Supervised Learning Dynamics without Contrastive Pairs" (ICML 2021, arXiv:2102.06810). Theoretical analysis of BYOL-style dynamics and the instability of the collapsed equilibrium in the linear regime.
- Jing, Vincent, LeCun, Tian, "Understanding Dimensional Collapse in Contrastive Self-Supervised Learning" (ICLR 2022, arXiv:2110.09348). Distinguishes dimensional collapse from complete collapse; diagnoses low-rank embedding subspaces.
- Bardes, Ponce, LeCun, "VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning" (ICLR 2022, arXiv:2105.04906). Explicit variance and covariance regularizers to prevent dimensional collapse.
- Zbontar, Jing, Misra, LeCun, Deny, "Barlow Twins: Self-Supervised Learning via Redundancy Reduction" (ICML 2021, arXiv:2103.03230). Cross-correlation objective against dimensional collapse.
- He, Chen, Xie, Li, Dollar, Girshick, "Masked Autoencoders Are Scalable Vision Learners" (CVPR 2022, arXiv:2111.06377). MAE baseline contrasted with I-JEPA; reconstructs pixels rather than representations.
- Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision" (Meta AI, 2023, arXiv:2304.07193). Large-scale self-distillation with EMA teacher; related collapse-prevention mechanics.
- Caron et al., "Emerging Properties in Self-Supervised Vision Transformers" (ICCV 2021, arXiv:2104.14294). DINO; EMA teacher plus centering.
Next Topics
The natural next steps from JEPA:
- World models and planning: using learned representations for planning in imagination
- Vision transformer lineage: the ViT architectures that serve as encoders in I-JEPA and V-JEPA
Last reviewed: April 22, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Variational Autoencoderslayer 3 · tier 1
- Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
- Autoencoderslayer 2 · tier 2
- Self-Supervised Visionlayer 4 · tier 2
- World Models and Planninglayer 4 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.