Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM

Sneiderman, Robby

Beyond LLMS

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM

Dosovitskiy et al. 2021 (ViT) showed that a plain transformer applied to image patches matches CNN backbones once given enough pretraining data. The follow-on lineage (DeiT, Swin, MAE, DINOv2, SAM, register tokens, NaViT, ViT-22B, SigLIP-2) traded design knobs along three axes: how much locality bias to keep, what supervision to train against, and whether tokens are spatial-uniform or content-adaptive. This page covers the patch-embedding complexity, Swin's shifted-window argument, the MAE 75% masking ratio, DINO self-distillation with centering, and the 2024 register-token finding that closes the long-standing CLIP/ViT artifact-token gap.

AdvancedTier 1CurrentCore spine~80 min

Prerequisites

Transformer Architecture Convolutional Neural Networks CNNS for Medical Imaging

Quiz (4)Prereq Map

Why This Matters

For a decade, convolutional neural networks dominated computer vision. ResNet (2015) was the default backbone for classification, detection, and segmentation. Then the Vision Transformer (ViT, Dosovitskiy et al. 2021) showed that a plain transformer, the same architecture used for language, could match or exceed CNNs on image tasks once given sufficient pretraining data. The page that follows traces the design space the field has explored since: how much locality bias to keep (ViT $\to$ Swin $\to$ ConvNeXt), what supervision to train against (supervised $\to$ contrastive $\to$ masked $\to$ self-distilled), and whether tokens are spatial-uniform or content-adaptive (NaViT, register tokens, Perceiver IO).

The lineage matters because these models are the visual backbone of nearly every modern multimodal system: CLIP-style contrastive encoders condition Stable Diffusion 1.x / 2.x / SDXL / SD3, MAE-pretrained ViTs power the vision tower of LLaVA and BLIP-2, DINOv2 features serve as frozen spatial representations for robotics and 3D, and SAM provides promptable segmentation under a ViT-H/16 image encoder. Choosing among them is no longer "ViT vs. CNN" but a question of which inductive bias and which supervision signal you want inherited downstream.

The four threads to follow:

The patch-embedding primitive and its $O(N^2 d)$ cost (ViT, NaViT).
The locality / hierarchy axis (Swin's windows, ConvNeXt's modernized convolutions, MaxViT's hybrids).
The supervision axis (DeiT distillation, MAE masked reconstruction, DINO self-distillation, EVA's CLIP-feature targets).
The 2024-2025 artifact / native-resolution / scale corrections (register tokens, NaViT, ViT-22B, AM-RADIO distillation, SigLIP-2, Pixtral).

Mental Model

A CNN processes an image by sliding small filters across the spatial grid, building features from local to global through successive layers. A ViT chops the image into a uniform grid of patches, projects each patch to a $d$ -dimensional vector, and lets self-attention figure out how patches relate. Every patch can attend to every other patch from layer one. There is no built-in locality bias, no built-in scale hierarchy, no built-in translation equivariance. The architecture trades all of those CNN priors for raw flexibility, and it pays the bill in data: ViT loses to ResNet on ImageNet-1k from scratch, and only wins after pretraining on ImageNet-21k or JFT-300M.

The lineage is a series of decisions about how much of the CNN prior to re-introduce. Swin reintroduces locality and hierarchy, MAE reintroduces data efficiency through masked self-supervision, DINOv2 produces features that transfer broadly because the objective is not tied to any label set, and register tokens fix a 2024-discovered failure mode where ViTs were silently using a small set of low-information tokens as scratchpad memory.

The CNN Foundation

Definition

CNN Design Principles

The structural priors that made CNNs the default vision architecture from 2012 to 2020:

Local connectivity. Each unit in a convolutional layer connects only to a small spatial neighborhood of the previous layer.
Weight sharing. The same kernel is applied at every spatial position, giving translation equivariance: $f(\text{shift}(x)) = \text{shift}(f(x))$ up to boundary effects.
Spatial hierarchy. Stacking layers with strided convolutions or pooling produces a multi-scale feature pyramid: early layers detect edges, middle layers detect textures and parts, late layers detect whole objects.
Downsampling. Resolution decreases while channel depth increases along the network, concentrating compute where receptive fields are wide.

Reference architectures: AlexNet (Krizhevsky 2012, 8 layers, ReLU + dropout

GPU); VGG (Simonyan 2014, 19 layers of $3 \times 3$ convolutions); ResNet (He 2015, skip connections enable 50-152 layers); EfficientNet (Tan 2019, compound scaling of depth/width/resolution).

ResNet's skip connections solved the degradation problem (deeper networks training to higher loss than shallower ones, even on the training set) and made 100-layer networks practical. From 2015 to 2020, ResNet variants were the default vision backbone. The ViT paper deliberately frames its baseline against ResNet to make the comparison legible.

ViT: Pure Transformer for Images

Proposition

ViT Patch Embedding and Resolution Scaling

Statement

Given an image of resolution $H \times W$ with $C$ channels, ViT creates $N = HW/P^2$ non-overlapping patches of size $P \times P \times C$ . Each patch is linearly projected to a $d$ -dimensional embedding:

$\mathbf{z}_i^0 = \mathbf{E} \cdot \text{flatten}(\text{patch}_i) + \mathbf{e}_i^{\text{pos}}, \quad i = 1, \ldots, N$

where $\mathbf{E} \in \mathbb{R}^{d \times (P^2 C)}$ is the patch embedding matrix and $\mathbf{e}_i^{\text{pos}}$ is a learnable positional embedding. A learnable [CLS] token is prepended, giving $N+1$ tokens.

Self-attention complexity per layer is

$O(N^2 d) = O\!\left(\frac{H^2 W^2 d}{P^4}\right),$

quadratic in the number of patches and therefore quartic in image side length for fixed $P$ . Doubling resolution multiplies attention cost by $16\times$ .

Intuition

ViT's design choice is radical simplicity: use the standard transformer architecture without modification for images, and treat the patch-embed step as the only vision-specific component. The bet is that with enough pretraining data the transformer's flexibility outperforms any hand-coded CNN prior. The cost structure is the binding constraint: at $224 \times 224$ with $P = 16$ you have $N = 196$ patches and attention is cheap, but at $1024 \times 1024$ with the same patch size $N = 4096$ and a single attention layer is $\sim 430\times$ more expensive. The whole subsequent lineage (Swin, NaViT, FlexiViT) can be read as ways to escape this quartic-in-resolution wall.

Proof Sketch

Patch counts: tiling $H \times W$ with $P \times P$ tiles gives $\lceil H/P \rceil \cdot \lceil W/P \rceil$ patches; ViT requires $H, W$ divisible by $P$ , so $N = HW/P^2$ exactly. Each patch flattens to a vector in $\mathbb{R}^{P^2 C}$ and the projection $\mathbf{E}$ has $d \cdot P^2 C$ parameters. Self-attention forms an $N \times N$ similarity matrix and multiplies it by a $N \times d$ value matrix, giving $O(N^2 d)$ FLOPs per head. Substituting $N = HW/P^2$ yields the claimed quartic scaling in spatial dimension for fixed patch size.

Why It Matters

ViT showed the transformer architecture is not specific to language; it is a generic sequence processor. With sufficient pretraining (ImageNet-21k or JFT-300M) ViT-Large matches or exceeds ResNet-152 on ImageNet classification. The architectural unification opened the door to one backbone for vision and language, which led to GPT-4V, Gemini, Claude 3.5 Sonnet's vision tower, and to image-conditioned diffusion via the CLIP text encoder.

Failure Mode

The $N^2$ dependence is not avoidable in plain ViT. At $1024^2$ resolution the attention matrix alone is $4096^2 = 1.7 \times 10^7$ entries per head per layer; at $4096^2$ it is $2.7 \times 10^9$ . This is why every modern ViT variant either (a) uses local windows (Swin, MaxViT), (b) uses a fixed number of latents (Perceiver, Q-Former), or (c) uses content-adaptive tokens (NaViT, FlexiViT, register tokens). A pure ViT at native HD is not a real architecture in deployment.

report a correction →

ViT properties to keep in mind:

Global receptive field from layer 1: every patch can attend to every other patch, unlike CNNs where the receptive field grows gradually with depth.
No built-in translation equivariance: positional embeddings must learn spatial structure from data; this is part of why ViT needs more pretraining data than ResNet.
Data-hungry: ViT underperforms ResNet when trained on ImageNet-1k alone. DeiT (Touvron et al. 2021, arXiv:2012.12877) closed this gap with distillation from a strong CNN teacher and aggressive augmentation, showing the data-hunger was about training recipe as much as inductive bias.
Resolution flexibility: at inference time the patch grid can be resized by interpolating positional embeddings; FlexiViT (Beyer et al. 2023) trains with mixed patch sizes so a single checkpoint serves many resolutions.

Swin Transformer: Bringing Back Hierarchy

Proposition

Swin Shifted Window Attention

Statement

Swin computes self-attention within local windows of $M \times M$ patches. Per-layer complexity is

$\text{Complexity}(\text{W-MSA}) = O(M^2 N d) \quad \text{vs.} \quad \text{Complexity}(\text{global MSA}) = O(N^2 d),$

where $N$ is the total number of patches. With $M$ fixed the windowed variant is linear in $N$ rather than quadratic.

To enable cross-window communication, alternating layers use shifted windows: the partition is shifted by $(\lfloor M/2 \rfloor, \lfloor M/2 \rfloor)$ patches, so patches that lay on a window boundary in the previous layer now share a window with their previous neighbors across the boundary.

Swin builds a hierarchical feature pyramid by patch merging across stages:

Stage 1: $H/4 \times W/4$ patches with $C$ channels.
Stage 2: $H/8 \times W/8$ patches with $2C$ channels.
Stage 3: $H/16 \times W/16$ patches with $4C$ channels.
Stage 4: $H/32 \times W/32$ patches with $8C$ channels.

Intuition

ViT's quadratic attention is wasteful for dense-prediction tasks (detection, segmentation) where you need high-resolution feature maps. Swin reintroduces three CNN-like priors (local attention, hierarchy, downsampling) while keeping the attention mechanism. Shifted windows are the key trick: they provide cross-window connections without paying the global-attention cost, similar to how convolutions provide local connectivity with translational structure but without explicit windowing.

Proof Sketch

For a single window with $M^2$ patches the attention cost is $O(M^2 \cdot M^2 \cdot d) = O(M^4 d)$ . There are $N / M^2$ windows per layer, so total per-layer cost is $O((N/M^2) \cdot M^4 d) = O(M^2 N d)$ , linear in $N$ for fixed $M$ . The shifted-window construction does not change asymptotic cost: it only re-partitions the same set of tokens. The hierarchy reduces $N$ by $4\times$ between stages while doubling $C$ , keeping per-stage compute roughly balanced (a budget-allocation pattern borrowed from EfficientNet).

Why It Matters

Swin became the dominant vision backbone for dense prediction in 2021-2023: Cascade Mask R-CNN with Swin-L topped COCO detection, UPerNet

Swin became the de facto baseline for ADE20K semantic segmentation, and the hierarchical features dropped into existing FPN-based pipelines as a direct CNN replacement. Swin reframed the "ViT vs. CNN" debate as a false dichotomy: the best vision design combines attention with hierarchical spatial structure, not one or the other in isolation.

Failure Mode

The shifted-window mechanism reintroduces irregular boundary cases that are awkward to implement efficiently on accelerators; the original Swin code uses a cyclic-shift trick with attention masking that is non-trivial to port. Swin-V2 (Liu et al. 2022) addressed several scaling issues (attention saturation at depth, position-encoding instability with resolution change) but could not match the simplicity of plain ViT, which is one reason later self-supervised work (MAE, DINOv2) returned to plain ViT despite its weaker priors.

report a correction →

MAE: Masked Image Modeling at Scale

Proposition

MAE Asymmetric Encoder-Decoder with 75% Masking

Statement

He et al. (MAE, 2022) train a Vision Transformer to reconstruct randomly masked image patches with two design choices that diverge sharply from BERT-style masking:

High masking ratio. With $p = 0.75$ (vs. 15% in BERT), only $25\%$ of patches are visible to the encoder.
Asymmetric encoder-decoder. The (large) encoder $f_\text{enc}$ processes only the visible patches; mask tokens are appended only inside a (small) decoder $f_\text{dec}$ that reconstructs pixel values:

$\mathcal{L}_\text{MAE} = \frac{1}{|\mathcal{M}|}\sum_{i \in \mathcal{M}} \| f_\text{dec}(f_\text{enc}(x_{\bar{\mathcal{M}}}))_i - x_i \|_2^2,$

where $\mathcal{M}$ is the masked-patch index set, $\bar{\mathcal{M}}$ its complement, and the loss is computed on masked patches only in normalized pixel space.

The asymmetry yields a $3\text{-}4 \times$ encoder speedup and lower memory at fixed model size, because the encoder sees only $1 - p = 25\%$ of the tokens.

Intuition

Image patches are far more redundant than language tokens: a single patch is recoverable from neighbors via local statistics, so 15% masking would reduce to a pixel-inpainting copy task. Pushing the mask ratio to 75% breaks this: the model must reason about object-level structure to fill in three-quarters of a scene, and the resulting features encode global scene understanding rather than local texture statistics. The asymmetric encoder-decoder makes the high mask ratio cheap, because the encoder cost scales with the number of visible patches.

Proof Sketch

Compute: encoder attention is $O((N(1-p))^2 d_\text{enc})$ instead of $O(N^2 d_\text{enc})$ , a factor of $(1-p)^2 = 0.0625$ for $p = 0.75$ . The decoder operates on all $N$ tokens but is much smaller (typically $d_\text{dec} = 512$ vs. $d_\text{enc} = 1024$ and 8 vs. 24 layers), so it costs a small fraction of one forward pass through the encoder. Total training cost per image is dominated by the encoder, which is $\approx 3\text{-}4 \times$ cheaper than processing all tokens.

Why high mask ratio works: information-theoretically, completion of a masked patch is easy when the conditional entropy $H(x_i \mid \text{visible})$ is low, which holds for low mask ratios in images. Pushing $p$ up forces $H(x_i \mid \text{visible})$ high, so the representation must encode high-level structure to make the loss trainable. The 75% threshold is empirical (see He et al. Fig. 5) and varies by domain.

Why It Matters

MAE is the dominant pretraining recipe for ViT-based vision encoders that need to ship without paired text data: it produces features that fine-tune well across detection, segmentation, and depth estimation, and it scales cleanly to ViT-H/14 and beyond. Almost every "vision foundation model" trained without web text (EVA, AIMv2, V-JEPA's image variant) sits on some descendant of the MAE objective. The asymmetric encoder-decoder trick has also been ported to video (VideoMAE) and 3D (Point-MAE).

Failure Mode

MAE features underperform CLIP and DINOv2 on linear probing with frozen features: the reconstruction objective rewards low-level pixel fidelity, which leaves higher-level semantic structure under-developed in the encoder representation when used as-is. Fine-tuning closes the gap (MAE-pretrained ViT-L beats supervised ViT-L on ImageNet) but linear probe is poor. This is the practical reason DINOv2 displaced MAE as the default frozen-feature backbone in 2023-2024.

report a correction →

DINO and DINOv2: Self-Distillation

Proposition

DINO Self-Distillation Objective

Statement

DINO (Caron et al. 2021) trains a Vision Transformer via self-distillation: a student network $f_{\theta_s}$ is trained to match the output distribution of a teacher network $f_{\theta_t}$ , where the teacher is an exponential moving average (EMA) of the student's own parameters. There are no labels.

For an image, generate two global crops $x_1, x_2$ and several local crops $x_3, \ldots, x_K$ . The loss matches student and teacher output distributions across crops:

$\mathcal{L} = \sum_{x \in \{x_1, x_2\}} \sum_{\substack{x' \in \text{all crops} \\ x' \neq x}} H\!\left(P_t(x), \, P_s(x')\right),$

where $P_t(x) = \text{softmax}((f_{\theta_t}(x) - c) / \tau_t)$ , $P_s(x') = \text{softmax}(f_{\theta_s}(x') / \tau_s)$ with sharper teacher ( $\tau_t < \tau_s$ ), and $H$ is cross-entropy. The teacher centering vector $c$ is the running mean of teacher outputs.

The EMA update with momentum $m \in [0, 1)$ gives $\theta_t \leftarrow m \, \theta_t + (1-m) \theta_s$ . Centering and sharpening jointly prevent two distinct collapse modes: centering blocks collapse to a peaked distribution, sharpening blocks collapse to a uniform distribution.

Intuition

DINO's recipe forces the student to produce consistent representations across different crops of the same image: a small local crop must predict the same softmax distribution as the teacher's view of the whole image. This is a strong inductive bias toward part-whole consistency, and it is exactly the bias that produces the famous "DINO attention maps" where the [CLS] token attends to object foreground without any segmentation supervision. The EMA teacher provides a slowly moving target: gradients do not flow into it, so it acts like an "old self" that the student is always catching up to.

Proof Sketch

Why centering + sharpening prevents collapse: a constant output (uniform distribution) gives zero loss only if the teacher is also uniform; the sharpening temperature $\tau_t < \tau_s$ pushes the teacher toward peakedness. A peaked output (one-hot distribution) gives zero loss only if the teacher's peak matches the student's input-conditional peak; the centering subtracts the running mean of teacher outputs, removing the shared peak that would arise from collapse to a single class. Caron et al. show ablating either centering or sharpening alone causes collapse within a few epochs.

Why It Matters

DINOv2 (Oquab et al. 2023) scales the recipe with curated data, a richer loss (DINO + iBOT patch-level objective + KoLeo regularization), and distillation from a larger teacher. The result: features that match or exceed supervised pretraining on classification, segmentation, depth, correspondence, and retrieval without fine-tuning. DINOv2-ViT-L is the default frozen visual backbone in many 2024-2025 systems where a text-conditioned encoder (CLIP) is not desired, including VGGT (3D geometry), several robotics policies, and large-scale image-retrieval indices.

Failure Mode

DINOv2 features are language-free by construction, which is an advantage for spatial tasks and a disadvantage for any pipeline that needs the encoder to understand text-described concepts (zero-shot classification, text-based retrieval). A practical pattern is to fuse DINOv2 spatial features with a CLIP/SigLIP semantic head, which AM-RADIO (NVIDIA 2024, arXiv:2310.16764) does explicitly via multi-teacher distillation.

report a correction →

Register Tokens: A 2024 Correction

Proposition

Register Tokens Eliminate ViT Attention Artifacts

Statement

Darcet et al. (ICLR 2024, arXiv:2309.16588) observed that supervised and self-supervised ViTs at scale develop a small set of high-norm outlier tokens on low-information image patches. These tokens carry global information unrelated to their spatial location and degrade dense prediction (segmentation, depth) and unsupervised object discovery (LOST, TokenCut on DINO).

The fix: prepend $K$ learnable register tokens $r_1, \ldots, r_K$ to the patch sequence, never tied to any spatial position, never used as output targets. After retraining (or fine-tuning), the artifact pattern migrates from spatial patches into the register tokens, and the spatial patch tokens recover clean attention maps and improved dense-task performance.

Empirically $K = 4$ suffices for ViT-L/14, and the same fix applies across DINOv2, OpenCLIP, and DeiT architectures, suggesting the artifact is a property of the attention mechanism interacting with global aggregation rather than of any specific training recipe.

Intuition

The model needs a place to write global summary information, and the softmax attention mechanism makes it cheap to repurpose a few low-information tokens for that role: a token sitting on a uniform sky patch can absorb a high-norm "scratchpad" vector without much loss, and attention from other patches can route through it. The architectural fix is to provide explicit scratchpad slots so the model does not have to hijack image-content tokens.

The relationship to language-side attention sinks (Xiao et al. 2023, arXiv:2309.17453, "Efficient Streaming Language Models with Attention Sinks") is direct: both phenomena are softmax-induced "I have to put my mass somewhere" effects, both are fixed by giving the model a dedicated sink token, and the parallel discovery in 2023-2024 across vision and language indicates a shared structural cause.

Proof Sketch

Argument by ablation rather than closed-form proof. Darcet et al. measure patch-token norms in pretrained DINOv2 and observe a heavy-tailed distribution where $\sim 2\%$ of tokens have norm $\geq 100\times$ the median. After adding register tokens and continuing training, the high-norm pattern moves cleanly into the register slots, leaving spatial tokens with a tight norm distribution. Downstream evaluation: improved segmentation mIoU on ADE20K, improved object-discovery accuracy on LOST/TokenCut, no degradation on classification.

Why It Matters

Register tokens are now standard in new ViT releases (DINOv2 with registers, OpenCLIP-ViT-L/14 with registers, EVA-CLIP-L with registers). The result reframes how we read ViT attention maps: attention from a spatial patch to a high-norm "sink" patch in pre-2024 models was not spatial reasoning, it was scratchpad routing, and visualization techniques that read attention as semantic correspondence were partly measuring this artifact. Modern reproducibility studies of older attention-map results need to control for whether the underlying ViT had register tokens.

Failure Mode

Adding register tokens does not change the underlying training objective, so it does not fix problems caused by the objective itself: a CLIP-trained ViT with registers still struggles with compositional reasoning, and a DINOv2-with-registers backbone still lacks language-paired semantics. The fix is architectural plumbing for a specific failure mode (spatial artifacts), not a general capability boost.

report a correction →

CLIP and Friends: The Cross-Modal Branch

The full treatment lives at CLIP, OpenCLIP, and SigLIP. Three points to keep in mind here:

CLIP-style training uses a ViT image encoder identical in shape to a plain ViT-B/16 or ViT-L/14, with the supervised classification head replaced by a contrastive InfoNCE objective against a paired text encoder. The architectural lineage is identical; only the loss differs.
CLIP's text-paired supervision produces features that are semantically rich but spatially weaker than DINOv2; this is why modern multi-encoder VLMs (LLaVA-1.6, Cambrian-1) often combine a CLIP/SigLIP encoder for semantics with a DINOv2 or MAE encoder for spatial structure.
SigLIP-2 (Tschannen et al. 2025, arXiv:2502.14786) extends SigLIP with multilingual coverage, locality-aware decoding objectives, and improved dense-feature performance, narrowing the spatial-vs-semantic gap that motivated the dual-encoder pattern.

Native-Resolution and Frontier-Scale Variants

Three frontier directions worth naming:

NaViT and Pixtral. Dehghani et al. (2023, arXiv:2307.06304, "Patch n' Pack") introduced a packing scheme that processes images at native aspect ratio and resolution by concatenating tokens from multiple images into one sequence and using attention masking to prevent cross-image leakage. Pixtral (Agrawal et al. 2024, arXiv:2410.07073) inherits the same idea for VLM training, removing the standard "resize to $224^2$ " step that destroys aspect-ratio information and high-resolution detail.

ViT-22B. Dehghani et al. (2023, arXiv:2302.05442) scaled a single densely-trained ViT to 22 billion parameters. The notable findings are not about benchmark numbers but about architectural changes needed to keep training stable at scale: parallel attention/FFN layers, query/key normalization, removal of biases. ViT-22B is also the largest published ViT to date and serves as a probe of how much further the dense-ViT recipe can scale before encountering qualitatively new failure modes.

Distillation-based agglomerative models. AM-RADIO (Ranzinger et al. NVIDIA 2024, arXiv:2310.16764) and Theia (Shang et al. 2024) train a single ViT student to match the features of multiple teacher backbones simultaneously (CLIP + DINOv2 + SAM, in the AM-RADIO case). The result is one frozen encoder that exposes CLIP-style semantics, DINOv2-style spatial structure, and SAM-style segmentation features at a single forward-pass cost. This is the practical 2024-2025 answer to "which encoder do I use," and it has displaced the single-teacher pattern in several production VLM training runs.

SAM: Promptable Segmentation on a ViT Backbone

Segment Anything (Kirillov et al. 2023, arXiv:2304.02643) is a ViT-H/16 image encoder plus a small prompt encoder ( $\sim 4$ M parameters) plus a mask decoder ( $\sim 4$ M parameters), trained on the SA-1B dataset (1 billion masks across 11 million images). Its place in the lineage:

The image encoder is a plain ViT pretrained with MAE then fine-tuned on the segmentation objective. SAM is a use case of MAE-pretrained ViT features, not a new backbone.
The new pieces are the data engine (model-in-the-loop annotation produced SA-1B), the prompt-conditioning interface (points, boxes, text), and the multi-output mask decoder (returns three plausible masks per prompt to handle ambiguity).
SAM-2 (Ravi et al. 2024, arXiv:2408.00714) extends to video by adding a memory bank and a streaming mask propagation module, again on the same ViT image-encoder substrate.

The lesson for the lineage: the image-encoder design space converged on "plain ViT with strong pretraining" by 2023, and innovation has shifted upward to data, prompting interface, and downstream task heads.

The Lineage at a Glance

Model	Year	Key Innovation	Supervision	Primary Use
AlexNet	2012	CNN + ReLU + dropout + GPU	Supervised	Classification
ResNet	2015	Skip connections, 100+ layers	Supervised	Backbone for all vision tasks
ViT	2020	Patches as tokens, plain attention	Supervised (large-scale)	Classification
DeiT	2021	Distillation + augmentation for ViT	Supervised (ImageNet only)	ImageNet-1k-only ViT training
Swin	2021	Shifted windows, hierarchical features	Supervised	Detection, segmentation
CLIP	2021	Contrastive image-text pretraining	Self-supervised (text)	Zero-shot, multimodal backbone
DINO	2021	Self-distillation, EMA teacher	Self-supervised	Frozen-feature extraction
Perceiver / Perceiver IO	2021	Cross-attention to a latent bottleneck	Supervised	Modality-agnostic encoder
MAE	2021	Asymmetric encoder + 75% mask	Self-supervised	ViT pretraining at scale
ConvNeXt	2022	ResNet modernized with ViT-era choices	Supervised	Strong CNN baseline
EVA / EVA-CLIP	2022-23	Masked image modeling with CLIP-feature targets	Self-supervised	Large vision backbone
ViT-22B	2023	Dense ViT scaled to 22B parameters	Supervised	Frontier scaling probe
DINOv2	2023	Curated data + iBOT + KoLeo + distillation	Self-supervised	Universal vision features
SigLIP	2023	Sigmoid pairwise loss replaces softmax	Self-supervised (text)	Image-text pretraining at large batch
SAM	2023	Promptable segmentation, SA-1B	Supervised (prompt)	Zero-shot segmentation
NaViT	2023	Patch-n-Pack: native aspect / resolution	Supervised	Variable-resolution ViT
Register tokens	2024	Dedicated scratchpad slots fix attention artifacts	Architectural fix	Cleaner dense features
AM-RADIO	2024	Multi-teacher distillation (CLIP + DINOv2 + SAM)	Self-supervised	One backbone, three feature spaces
SAM-2	2024	Memory bank for video segmentation	Supervised (prompt)	Video segmentation
Pixtral	2024	NaViT-style native-resolution VLM tower	Self-supervised (text)	Multimodal LLM backbone
SigLIP-2	2025	Multilingual + locality + dense features	Self-supervised (text)	Improved CLIP-class encoder

Why ViT Works (and Where It Doesn't)

The standard explanation: attention can model global dependencies that convolutions miss. A convolutional layer with a $3 \times 3$ kernel only sees a $3 \times 3$ neighborhood; global information requires stacking many layers. A single attention layer can relate any patch to any other.

The deeper explanation is a bias-variance tradeoff at the architectural level. CNNs hard-code translation equivariance, locality, and hierarchy. ViTs hard-code none of these. With limited data the CNN's priors are correct on average and act as a regularizer; with abundant data the priors become a constraint and ViT's flexibility wins. The empirical crossover happens around ImageNet-21k scale ( $\sim 14M$ labeled images); above that scale the priors hurt, below it they help.

The 2024-2025 frontier modifies this picture in three ways. First, MAE and DINOv2 partially close the data gap with self-supervised pretraining, making ViT competitive on small downstream tasks if upstream pretraining is large. Second, register tokens reveal that "pure ViT" was silently using a small set of tokens as scratchpad memory, which means the architectural prior was less neutral than advertised. Third, AM-RADIO-style multi-teacher distillation suggests the right question is not "which backbone" but "which combination of feature spaces," with the underlying architecture being a generic ViT in nearly every case.

Common Confusions

Watch Out

ViT is not inherently better than CNNs

ViT's advantage appears primarily with large-scale pretraining. On ImageNet-1k from scratch a well-tuned ResNet-50 or ConvNeXt-T can match or exceed ViT-B. The empirical finding is not that attention beats convolution but that attention scales better with data and compute. For small-data and resource-constrained settings, modernized CNNs (ConvNeXt, EfficientNetV2) and hybrids (MaxViT, MobileViT) remain competitive or superior.

Watch Out

Swin is more than ViT with smaller windows

Swin adds hierarchical downsampling via patch merging, producing a multi-scale feature pyramid that plain ViT does not. The hierarchy is what makes Swin a drop-in replacement in detection and segmentation pipelines that expect FPN-style features at $1/4, 1/8, 1/16, 1/32$ resolution. ViT produces single-scale features from the last layer, which detection heads must adapt with extra modules (e.g. ViTDet's simple feature pyramid, He et al. 2022, arXiv:2203.16527).

Watch Out

DINOv2 features are not a strict upgrade of supervised ImageNet features

DINOv2 features capture visual structure that supervised ImageNet training misses. Supervised models overfit to the 1000 ImageNet classes: their features are optimized to discriminate dog breeds, not to encode spatial layout, depth, or material properties. DINOv2's self-supervised objective produces features that transfer broadly because they are not tied to any specific label set, which is the same reason language-pretrained transformers transfer better than task-specific classifiers. This does not mean DINOv2 features are universally better; on tasks where the ImageNet label structure happens to match the downstream task, supervised pretraining still wins.

Watch Out

MAE and DINOv2 are not interchangeable

Both are self-supervised ViT pretraining recipes, but the resulting features differ. MAE optimizes pixel reconstruction, which produces features that fine-tune well but probe poorly: a linear classifier on frozen MAE features lags a linear classifier on frozen DINOv2 features by 10+ ImageNet-1k accuracy points. DINOv2 optimizes consistency across augmentations, which produces features that are immediately usable as frozen embeddings. The right choice depends on the downstream pipeline: fine-tune $\to$ MAE; freeze and probe $\to$ DINOv2; both work for VLM backbones but DINOv2 is more common in 2024-2025 production runs.

Watch Out

Attention maps from pre-2024 ViTs include scratchpad artifacts

Visualizations of "what the [CLS] token attends to" in DINOv2, OpenCLIP, and DeiT before register tokens were partly measuring the high-norm-token scratchpad effect (Darcet et al. 2024). Spatial correspondences inferred from these maps were correct enough on average to look semantic but were corrupted by attention routing through artifact tokens on low-information patches. Reproducing those results on register-token variants gives cleaner maps and slightly different spatial correspondences.

Watch Out

SAM is a use case of ViT pretraining, not a new backbone

The SAM image encoder is a plain ViT-H/16 pretrained with MAE and then fine-tuned on the segmentation objective. The genuinely new components are the data engine that produced SA-1B, the prompt-conditioning interface, and the multi-output mask decoder. Quoting "SAM" as a backbone in the same breath as "ViT" or "DINOv2" elides the architecture- vs-task distinction.

Exercises

ExerciseCore

Problem

For a $224 \times 224$ image with $3$ color channels and a ViT with patch size $P = 16$ and embedding dimension $d = 768$ , compute (a) the number of patches $N$ , (b) the size of the patch embedding matrix $\mathbf{E}$ , (c) the self-attention complexity per layer, and (d) the factor by which (c) increases if the image resolution doubles to $448 \times 448$ at the same patch size.

ExerciseAdvanced

Problem

Swin with window size $M = 7$ has per-layer attention complexity $O(M^2 N d)$ instead of ViT's $O(N^2 d)$ . For a $224 \times 224$ image with $P = 4$ (Swin's patch size at stage 1), compute the ratio of Swin's per-layer attention cost to plain ViT's per-layer attention cost at the same patch size. Then compute the same ratio at $1024 \times 1024$ and explain why the ratio grows with resolution.

ExerciseAdvanced

Problem

MAE uses mask ratio $p = 0.75$ and an asymmetric encoder-decoder, where the encoder processes only visible patches. For a ViT-L encoder at $224^2$ with $P = 16$ , compute the per-image encoder attention cost (a) without masking, (b) with $p = 0.75$ masking. Then explain why pushing the mask ratio higher than language-modeling conventions ( $\sim 15\%$ ) is necessary for images.

ExerciseAdvanced

Problem

Darcet et al. (2024) report that adding $K = 4$ register tokens to a DINOv2 ViT-L/14 sequence eliminates the high-norm spatial-token artifact and improves dense prediction. (a) What fraction of the input sequence length do the register tokens add? (b) Why is $K = 4$ sufficient when there might be hundreds of low-information patches in a typical image? (c) Predict what would happen if you used $K = 0$ but post-hoc re-normalized the patch tokens to unit norm.

ExerciseResearch

Problem

DINOv2 produces strong visual features without labels; CLIP produces strong features with text supervision; AM-RADIO distills from CLIP + DINOv2 + SAM into a single ViT student. (a) For a downstream VLM (text generation conditioned on an image), which of the three feature sources do you expect to dominate gradient flow into the LLM, and why? (b) Under what conditions would multi-teacher distillation degrade downstream performance compared to using a single teacher?

Related Comparisons

CNN vs. ViT vs. Swin Transformer

References

Canonical:

Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT, ICLR 2021), arXiv:2010.11929. The original paper. Section 3.1 covers the patch-embedding construction; Section 4.2 reports the supervised JFT-300M scaling result.
Touvron, Cord, Douze, Massa, Sablayrolles, Jegou, Training data-efficient image transformers and distillation through attention (DeiT, ICML 2021), arXiv:2012.12877. Closes the ImageNet-1k-from-scratch gap with distillation from a strong CNN teacher.
Liu, Lin, Cao, Hu, Wei, Zhang, Lin, Guo, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV 2021), arXiv:2103.14030. Window attention + shifted windows + patch merging.
Caron, Touvron, Misra, Jegou, Mairal, Bojanowski, Joulin, Emerging Properties in Self-Supervised Vision Transformers (DINO, ICCV 2021), arXiv:2104.14294. Self-distillation with EMA teacher; first observation of unsupervised attention-based segmentation.
He, Chen, Xie, Li, Dollar, Girshick, Masked Autoencoders Are Scalable Vision Learners (MAE, CVPR 2022), arXiv:2111.06377. Asymmetric encoder-decoder with $75\%$ masking; Section 4.1 has the masking-ratio ablation.
Jaegle, Borgeaud, Alayrac, Doersch, Ionescu, Ding, Koppula, Zoran, Brock, Shelhamer, Henaff, Botvinick, Zisserman, Vinyals, Carreira, Perceiver IO: A General Architecture for Structured Inputs and Outputs (ICLR 2022), arXiv:2107.14795. Latent-bottleneck attention removes input quadratic cost.
Liu, Mao, Wu, Feichtenhofer, Darrell, Xie, A ConvNet for the 2020s (ConvNeXt, CVPR 2022), arXiv:2201.03545. Modernizes ResNet with ViT-era design choices, restores parity on classification.
Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, Sutskever, Learning Transferable Visual Models from Natural Language Supervision (CLIP, ICML 2021), arXiv:2103.00020. The contrastive image-text branch of the lineage; full treatment at CLIP, OpenCLIP, and SigLIP.

Current:

Oquab, Darcet, Moutakanni, Vo, Szafraniec, Khalidov, Fernandez, Haziza, Massa, El-Nouby, Assran, Ballas, Galuba, Howes, Huang, Li, Misra, Rabbat, Sharma, Synnaeve, Xu, Jegou, Mairal, Labatut, Joulin, Bojanowski, DINOv2: Learning Robust Visual Features without Supervision (TMLR 2024), arXiv:2304.07193. Curated data + iBOT + KoLeo + distillation; the default frozen-feature backbone for non-text-conditioned vision in 2024-2025.
Fang, Wang, Xie, Sun, Wu, Wang, Huang, Wang, Cao, EVA: Exploring the Limits of Masked Visual Representation Learning at Scale (CVPR 2023), arXiv:2211.07636. Masked image modeling with CLIP-feature targets; bridges MAE and CLIP.
Zhai, Mustafa, Kolesnikov, Beyer, Sigmoid Loss for Language Image Pre-Training (SigLIP, ICCV 2023), arXiv:2303.15343. Pairwise sigmoid replaces softmax NCE; dominant choice for new contrastive runs.
Dehghani, Djolonga, Mustafa, Padlewski, Heek, Gilmer, Steiner, Caron, Geirhos, Alabdulmohsin, Jenatton, Beyer, Tschannen, Arnab, Wang, Riquelme, Minderer, Puigcerver, Evci, Kumar, van Steenkiste, Elsayed, Mahendran, Yu, Oliver, Huot, Bastings, Collier, Gritsenko, Birodkar, Vasconcelos, Tay, Mensink, Kolesnikov, Pavetic, Tran, Kipf, Lucic, Zhai, Keysers, Harmsen, Houlsby, Scaling Vision Transformers to 22 Billion Parameters (ViT-22B, ICML 2023), arXiv:2302.05442. Architectural changes (parallel attention/FFN, QK-norm, no biases) needed for stable training at frontier scale.
Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson, Xiao, Whitehead, Berg, Lo, Dollar, Girshick, Segment Anything (SAM, ICCV 2023), arXiv:2304.02643. Promptable segmentation on a MAE-pretrained ViT-H; SA-1B dataset and data engine.
Ravi, Gabeur, Hu, Hu, Ryali, Ma, Khedr, Radle, Rolland, Gustafson, Mintun, Pan, Alwala, Carion, Wu, Girshick, Dollar, Feichtenhofer, SAM 2: Segment Anything in Images and Videos (2024), arXiv:2408.00714. Memory bank for video segmentation on the same ViT image encoder.
Dehghani, Mustafa, Djolonga, Heek, Minderer, Caron, Steiner, Puigcerver, Geirhos, Alabdulmohsin, Beyer, Houlsby, Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution (NeurIPS 2023), arXiv:2307.06304. Native-resolution ViT via sequence packing.
Ranzinger, Heinrich, Kautz, Molchanov, AM-RADIO: Agglomerative Vision Foundation Model (CVPR 2024), arXiv:2310.16764. Multi-teacher distillation from CLIP + DINOv2 + SAM into one ViT student.
Agrawal, Antoniak, Hanna, Bout, Chaplot, Chudnovsky, Costa, De Monicault, Garg, Gervet, Ghosh, Heroux, Jacques, Kembhavi, Kerr, Kuduvalli, Lacroix, Lample, Larionov, Lavaud, Loison, Mahler, Mary, Micheli, Roziere, Sablayrolles, Saxena, Steiner, Stock, Subramanian, Tavares, Tibrewala, Vetrov, Wang, Yang, Yu, Bressand, Lample, Pixtral 12B (2024), arXiv:2410.07073. NaViT-style native-resolution ViT as a VLM image tower.
Tschannen, Weers, Kolesnikov, Beyer, Zhai, SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (2025), arXiv:2502.14786. Multilingual coverage and locality-aware decoding objectives close the dense-feature gap with DINOv2.

Critique:

Darcet, Oquab, Mairal, Bojanowski, Vision Transformers Need Registers (ICLR 2024), arXiv:2309.16588. Identifies high-norm spatial-token artifacts across DINOv2 / OpenCLIP / DeiT and shows $K = 4$ register tokens fix dense-task performance and attention-map cleanliness.
Xiao, Tian, Chen, Han, Lewis, Efficient Streaming Language Models with Attention Sinks (ICLR 2024), arXiv:2309.17453. Language-side companion: same attention-induced "scratchpad sink" phenomenon, same fix via dedicated sink tokens.
Park, Kim, Kim, Suh, Kim, Yoo, What Do Self-Supervised Vision Transformers Learn? (ICLR 2023), arXiv:2305.00729. Probes the difference between MAE and contrastive ViT features; documents the linear-probe gap that motivates DINOv2 over MAE for frozen-feature use.
Touvron, Bojanowski, Caron, Cord, El-Nouby, Grave, Izacard, Joulin, Synnaeve, Verbeek, Jegou, ResMLP: Feedforward networks for image classification with data-efficient training (TPAMI 2022), arXiv:2105.03404. Shows MLP-only architectures recover most of ViT's accuracy, suggesting attention is not the only sufficient inductive bias for vision.
Wightman, Touvron, Jegou, ResNet strikes back: An improved training procedure in timm (NeurIPS 2021 Workshop), arXiv:2110.00476. Modern training recipes lift ResNet-50 from 76.5% to 80.4% top-1 on ImageNet; relevant context for "ViT vs. CNN" claims.

Next Topics

CLIP, OpenCLIP, and SigLIP in practice: the contrastive image-text branch built on ViT.
Florence and vision foundation models: unified detection + segmentation + captioning on ViT backbones.
JEPA and joint embedding: self-supervised ViTs in joint-embedding predictive architectures.
Diffusion models: ViT and U-Net backbones for image generation; CLIP/SigLIP text encoders for conditioning.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Convolutional Neural Networkslayer 3 · tier 2
Transformer Architecturelayer 4 · tier 2
CNNs for Medical Imaginglayer 4 · tier 3

Derived topics

5

CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraininglayer 4 · tier 1
Diffusion Modelslayer 4 · tier 1
JEPA and Joint Embeddinglayer 4 · tier 2
Self-Supervised Visionlayer 4 · tier 2
Florence and Vision Foundation Modelslayer 5 · tier 2

Graph-backed continuations

CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining Florence and Vision Foundation Models JEPA and Joint Embedding Diffusion Models Self-Supervised Vision