JEPA and Joint Embedding

Sneiderman, Robby

Beyond LLMS

JEPA and Joint Embedding

LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA, V-JEPA, V-JEPA 2, and the connection to contrastive learning and world models.

AdvancedTier 2FrontierSupporting~55 min

Prerequisites

Autoencoders Variational Autoencoders History of AI Self Supervised Vision

Prereq Map

Why This Matters

Large language models learn by predicting the next token. That objective works well for discrete text, but it is a poor fit for high-dimensional continuous signals such as images and video. Predicting every pixel of the next frame forces the model to spend capacity on lighting, texture noise, and other details that may not matter for recognition or planning.

JEPA (Joint Embedding Predictive Architecture) is Yann LeCun's proposal for learning predictive representations of the physical world: predict in the space of abstract representations, not directly in raw sensory space. The bet is that representation-space prediction can keep predictable, task-relevant structure while discarding unpredictable detail.

Mental Model

Imagine two views of the world: a context view and a target view. Instead of reconstructing the target view pixel by pixel from the context, JEPA encodes both views into abstract representations and predicts the target representation from the context representation. The model never needs to generate pixels; it only needs to match the target representation.

This is like summarizing a scene rather than painting it. The model can encode that a ball is falling without also predicting the exact shadow on every frame.

Formal Setup

Definition

Joint Embedding Architecture

A joint embedding architecture maps two inputs $x$ and $y$ to representations $s_x = f_\theta(x)$ and $s_y = f_\theta(y)$ (or different encoders $f_\theta, g_\phi$ ). An energy function $E(s_x, s_y)$ measures compatibility: low energy means $x$ and $y$ are related, high energy means they are not.

The key distinction from generative models: there is no decoder that reconstructs $y$ from $s_x$ . The loss operates entirely in representation space.

The explicit energy form used throughout this page is $E(s_x, s_y) = D(p_\psi(s_x, z), s_y)$ , where $p_\psi$ is the predictor network, $z$ is an auxiliary latent or conditioning variable specifying what to predict (e.g., the position of a masked region, or an action), and $D$ is a distance in representation space (typically $\ell_2$ or cosine). The predictor $p_\psi$ lets the context representation $s_x$ route to many possible targets $s_y$ under different choices of $z$ , and $D$ measures how far the prediction falls from the actual target representation.

Definition

Joint Embedding Predictive Architecture (JEPA)

JEPA adds a predictor $p_\psi$ between the context encoder and the target encoder. Given context input $x$ and target input $y$ :

Encode context: $s_x = f_\theta(x)$
Encode target: $s_y = g_\phi(y)$ (target encoder, often an EMA of $f_\theta$ )
Predict target representation: $\hat{s}_y = p_\psi(s_x, z)$ where $z$ specifies what to predict (e.g., which spatial region)
Minimize prediction error in representation space: $\mathcal{L} = D(\hat{s}_y, s_y)$

The predictor $p_\psi$ operates in abstract space. It never sees or generates pixels.

Definition

Collapse

Complete (representation) collapse occurs when the encoders map all inputs to the same representation, trivially achieving zero prediction error. If $f_\theta(x) = c$ for all $x$ , then $\hat{s}_y = c = s_y$ regardless of input.

Dimensional (informational) collapse is a weaker, subtler failure mode: the embeddings do not collapse to a single point, but instead span a low-rank subspace of representation space. The covariance of the learned embeddings has many vanishing eigenvalues, so whole directions carry no information even though the embeddings are not literally constant. Jing, Vincent, LeCun, and Tian (2022, arXiv:2110.09348) document this in contrastive self-supervised learning and link it to the projector architecture and the interaction of the contrastive loss with data augmentation. Non-contrastive methods like BYOL, VICReg, and Barlow Twins are also susceptible to dimensional collapse, and VICReg's variance and covariance regularizers are designed specifically to counter it.

Preventing both forms of collapse is the central technical challenge of joint embedding methods.

Main Theorems

Proposition

JEPA as Energy-Based Learning

Statement

The JEPA training objective can be written as an energy-based model:

$\mathcal{L}(\theta, \phi, \psi) = \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ D\big(p_\psi(f_\theta(x), z), \; \text{sg}[g_\phi(y)]\big) \right]$

where $D$ is a distance metric (e.g., $\ell_2$ or cosine distance), $\text{sg}[\cdot]$ denotes stop-gradient (no gradients flow through the target encoder), and $g_\phi$ is updated via exponential moving average: $\phi \leftarrow \tau \phi + (1-\tau) \theta$ , where $\tau \in [0.99, 0.9999]$ is the retention coefficient. Values of $\tau$ close to $1$ mean the target encoder changes slowly (matching BYOL's $\tau \approx 0.996$ ); $\tau = 0$ would set $\phi \leftarrow \theta$ each step, giving no stable anchor. The symbol $\tau$ refers to this EMA retention coefficient throughout the page (distinct from the InfoNCE temperature used in the contrastive section below, where the symbol is reused by convention; context disambiguates).

The energy landscape has three failure modes:

Complete collapse: energy is uniformly low because all representations are identical
Dimensional collapse: energy is low on average but embeddings span only a low-rank subspace, so some directions carry no information (Jing et al. 2022)
Fragmentation: energy is uniformly high (representations carry no useful information about the data)

A well-trained JEPA has low energy for compatible $(x,y)$ pairs and high energy for incompatible pairs, with the energy surface shaped by the architecture and training dynamics rather than by an explicit contrastive loss.

Intuition

The JEPA objective pulls the predicted representation $\hat{s}_y$ toward the actual target representation $s_y$ , but only through the predictor and context encoder. The target encoder is a slowly moving anchor. This asymmetry prevents the trivial solution where both encoders collapse to a constant. The EMA update ensures the target representations change slowly, giving the predictor a stable learning signal.

Why It Matters

Unlike contrastive learning, JEPA-style objectives do not require explicit negative pairs. Instead, they rely on target-network asymmetry, masking, predictor structure, and sometimes variance or covariance regularization to avoid collapse. The energy-based framing connects JEPA to LeCun's broader argument that future world models should predict abstract states rather than raw observations.

report a correction →

Collapse Prevention via Asymmetry (Empirical Observation)

This is an architectural principle supported by empirical evidence and partial theoretical analysis, not a theorem. Consider the loss $\mathcal{L} = \|p_\psi(f_\theta(x)) - \text{sg}[g_\phi(y)]\|^2$ with $\phi \leftarrow \tau\phi + (1-\tau)\theta$ . The collapsed solution $f_\theta(x) = c$ for all $x$ is a fixed point of the gradient dynamics, but it is empirically unstable when:

The predictor $p_\psi$ is sufficiently expressive but not trivially so
The EMA retention $\tau$ is close to $1$ (slow target updates)
The data distribution has sufficient diversity

Tian, Chen, and Ganguli (2021, "Understanding Self-Supervised Learning Dynamics without Contrastive Pairs", arXiv:2102.06810) analyze this in a linearized BYOL setting. They show that under a linear predictor the encoder eigenspectrum is pulled toward the top principal components of the data covariance, and the fully collapsed equilibrium is dynamically unstable when the predictor and EMA are configured appropriately. The analysis depends on linearity and specific balance conditions; the nonlinear deep-network case remains an empirical claim rather than a proved theorem.

If the target encoder moves slowly (high $\tau$ ), it provides a stable target that changes only gradually. The predictor and context encoder must do real work to match these targets. If the context encoder collapses, the predictor cannot distinguish between different targets, so the loss remains high. BYOL and DINO demonstrated that collapse can be avoided without negative pairs. JEPA inherits these techniques and adds structured prediction in representation space as a further source of non-trivial learning signal.

I-JEPA: Images

I-JEPA (Image JEPA) applies the JEPA framework to images using Vision Transformers. The key design choices:

Context: a set of visible patches from an image
Target: representations of masked patches (the patches the model must predict)
Masking strategy: large, semantically meaningful blocks are masked (not random individual patches)
Prediction: the predictor takes context patch representations plus positional information about the target patches, and outputs predictions in representation space

Critical difference from Masked Autoencoders (MAE): MAE reconstructs pixels of masked patches. I-JEPA predicts abstract representations of masked patches. This means I-JEPA does not need a pixel decoder and focuses on high-level structure rather than texture details.

I-JEPA reports strong transfer to downstream tasks such as classification and detection without the hand-crafted augmentations used by many contrastive and self-distillation methods. That is the key experimental claim: good visual features can be learned from masked representation prediction rather than from pixel reconstruction or augmentation invariance alone.

V-JEPA: Video

V-JEPA extends the framework to video by masking spatiotemporal blocks. Given a video clip, a large fraction of space-time patches are masked, and the model predicts their representations from the visible patches.

The temporal dimension is what makes V-JEPA relevant to world models. Predicting future or missing video representations forces the encoder to preserve temporally predictive structure: object persistence, motion, contact, and scene change. That is not the same as proving physical understanding, but it is closer to planning-relevant representation learning than single-image invariance.

V-JEPA 2 and Action-Conditioned Planning

V-JEPA 2 extends the line from video representation learning toward planning. The base model is trained on large-scale video using self-supervised feature prediction. A separate action-conditioned variant, V-JEPA 2-AC, is then post-trained on a smaller amount of robot video to support goal-conditioned planning in latent space.

The important distinction is architectural: V-JEPA 2 does not need to render future pixels to evaluate candidate actions. It predicts and compares latent representations of future states. This makes the approach different from pixel-generating video simulators such as diffusion world models; it is a latent predictive model aimed at understanding, prediction, and planning.

V-JEPA 2.1 continues the same line by emphasizing denser visual features while retaining global video understanding. For TheoremPath, the research question is not whether JEPA has "beaten" autoregressive or diffusion models. The better question is narrower: when does latent prediction preserve exactly the information needed for control, and when does it discard something the planner later needs?

JEPA vs. Other Self-Supervised Methods

Method	Prediction space	Collapse prevention	Requires augmentations	Requires negatives
MAE	Pixel space	Not applicable (reconstruction)	No	No
SimCLR	Representation space	Contrastive loss (negatives)	Yes (heavy)	Yes
BYOL	Representation space	EMA + predictor	Yes (heavy)	No
DINO	Representation space	EMA + centering	Yes (heavy)	No
I-JEPA	Representation space	EMA + predictor + masking	No	No

The JEPA design removes both augmentation dependence and the need for negative pairs. This is significant because hand-crafted augmentations (random crops, color jitter) inject domain-specific inductive biases that may not transfer across modalities.

Connection to Contrastive Learning

Contrastive learning (SimCLR, MoCo) can be seen as a special case of joint embedding where collapse is prevented by explicitly pushing apart negative pairs. The InfoNCE loss:

$\mathcal{L}_{\text{NCE}} = -\log \frac{\exp(\text{sim}(s_x, s_y)/\tau)}{\sum_{k} \exp(\text{sim}(s_x, s_k)/\tau)}$

requires a set of negative examples $\{s_k\}$ . JEPA aims for the same broad outcome, informative representations that do not collapse, without using explicit negatives. The mechanism is not identical: JEPA-style methods use target-network asymmetry, masking, predictors, and sometimes regularization terms rather than directly pushing sampled negative pairs apart.

Scope. LeCun's 2022 position paper "A Path Towards Autonomous Machine Intelligence" frames JEPA as a unifying energy-based view under which contrastive methods correspond to one particular form of energy shaping (pushing negatives up, pulling positives down). This is a framing argument, not a proved reduction: there is no theorem stating that any InfoNCE-trained model can be rewritten as a JEPA, or that JEPA's implicit regularization is equivalent to a contrastive loss with specific negatives. Treat "JEPA subsumes contrastive learning" as LeCun's interpretive claim in that paper, not an established theorem.

Common Confusions

Watch Out

JEPA is not a replacement for LLMs

JEPA is a self-supervised learning framework for continuous data such as images, video, and audio. It does not generate text, answer questions, or reason verbally by itself. It is a research direction for building predictive world representations that could complement language models, not replace them. The common mistake is to compare JEPA directly to a chatbot instead of to other representation-learning and world-modeling objectives.

Watch Out

Predicting representations is not easier than predicting pixels

Predicting in representation space avoids the generation bottleneck (no need to model irrelevant details) but introduces a new challenge: the representation space is learned simultaneously with the predictor. If the encoder learns trivial representations, prediction is easy but useless. The difficulty shifts from modeling pixels to preventing collapse and ensuring the representations capture meaningful structure.

Watch Out

JEPA is not just BYOL with masking

BYOL predicts the representation of the same input under a different augmentation. I-JEPA predicts the representation of a different part of the input (masked patches). This is a structural prediction task (what is in the missing region?) rather than an invariance task (the representation should not change under crops and color jitter). The masking formulation generalizes naturally to video and spatiotemporal prediction.

Summary

JEPA predicts in representation space, not pixel space, avoiding the generation bottleneck
The architecture: context encoder, target encoder (EMA), predictor
Collapse prevention via EMA, stop-gradient, and the predictor bottleneck
I-JEPA: masking image patches, predicting their representations
V-JEPA and V-JEPA 2: masking spatiotemporal video patches and moving toward action-conditioned latent planning
I-JEPA and V-JEPA avoid hand-crafted augmentations and explicit negative pairs, a departure from contrastive methods
JEPA is a framework for building world models, not a replacement for LLMs

Exercises

ExerciseCore

Problem

Explain why Masked Autoencoders (MAE) must model low-level details like texture while I-JEPA does not. What architectural difference causes this?

ExerciseAdvanced

Problem

Why is the EMA momentum parameter $\tau$ critical for preventing collapse in JEPA? What happens if $\tau = 0$ (no momentum, target encoder equals context encoder) and what happens if $\tau = 1$ (target encoder never updates)?

ExerciseResearch

Problem

JEPA predicts representations of masked patches using positional information about which patches to predict. Could you extend this to predict representations of future video frames given past frames and a proposed action? Describe the architecture and explain how this relates to world models for planning.

Related Comparisons

Autoregressive Models vs. JEPA

References

Canonical:

LeCun, "A Path Towards Autonomous Machine Intelligence" (2022). OpenReview. The position paper connecting JEPA-style latent prediction to world models, actor modules, cost modules, and planning.
Assran et al., "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture" (CVPR 2023, arXiv:2301.08243). I-JEPA. Reports 77.3% top-1 ImageNet linear probe with a ViT-H/14 encoder, trained without hand-crafted augmentations.

Current:

Bardes et al., "Revisiting Feature Prediction for Learning Visual Representations from Video" (Meta AI, 2024, arXiv:2404.08471). V-JEPA.
Assran et al., "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning" (2025, arXiv:2506.09985). V-JEPA 2 and V-JEPA 2-AC robot-planning experiments.
Assran et al., "V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning" (2026, arXiv:2603.14482). Dense visual features with global video understanding.
Grill et al., "Bootstrap Your Own Latent" (NeurIPS 2020, arXiv:2006.07733). BYOL, direct precursor to JEPA-style non-contrastive methods; uses EMA target with $\tau \approx 0.996$ .
Chen and He, "Exploring Simple Siamese Representation Learning" (CVPR 2021, arXiv:2011.10566). SimSiam. Analysis of collapse prevention via stop-gradient and predictor.
Tian, Chen, Ganguli, "Understanding Self-Supervised Learning Dynamics without Contrastive Pairs" (ICML 2021, arXiv:2102.06810). Theoretical analysis of BYOL-style dynamics and the instability of the collapsed equilibrium in the linear regime.
Jing, Vincent, LeCun, Tian, "Understanding Dimensional Collapse in Contrastive Self-Supervised Learning" (ICLR 2022, arXiv:2110.09348). Distinguishes dimensional collapse from complete collapse; diagnoses low-rank embedding subspaces.
Bardes, Ponce, LeCun, "VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning" (ICLR 2022, arXiv:2105.04906). Explicit variance and covariance regularizers to prevent dimensional collapse.
Zbontar, Jing, Misra, LeCun, Deny, "Barlow Twins: Self-Supervised Learning via Redundancy Reduction" (ICML 2021, arXiv:2103.03230). Cross-correlation objective against dimensional collapse.
He, Chen, Xie, Li, Dollar, Girshick, "Masked Autoencoders Are Scalable Vision Learners" (CVPR 2022, arXiv:2111.06377). MAE baseline contrasted with I-JEPA; reconstructs pixels rather than representations.
Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision" (Meta AI, 2023, arXiv:2304.07193). Large-scale self-distillation with EMA teacher; related collapse-prevention mechanics.
Caron et al., "Emerging Properties in Self-Supervised Vision Transformers" (ICCV 2021, arXiv:2104.14294). DINO; EMA teacher plus centering.

Next Topics

The natural next steps from JEPA:

World models and planning: using learned representations for planning in imagination
Vision transformer lineage: the ViT architectures that serve as encoders in I-JEPA and V-JEPA

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Variational Autoencoderslayer 3 · tier 1
Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
Autoencoderslayer 2 · tier 2
Self-Supervised Visionlayer 4 · tier 2
World Models and Planninglayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.