Skip to main content

Model Timeline

Model Timeline

A structured factual timeline of major language and multimodal models from GPT-2 through the current frontier, with parameter counts, key innovations, and the ideas that defined each era.

CoreTier 2FrontierReference~70 min

Why This Matters

The landscape of large language models has changed faster than any other area of technology. Understanding the timeline is not just historical trivia: each model introduced ideas (scaling laws, RLHF, mixture-of-experts, multimodal fusion, long context, constitutional AI, inference-time scaling) that define how the field works today.

This page is a structured factual reference. Dates, parameter counts, and capabilities are stated as precisely as public information allows. Frontier- tagged pages can become stale quickly; the currency of each section is noted.

How To Read A Frontier Timeline

This page mixes three kinds of evidence, and they should not be treated as equally strong:

Claim typeWhat counts as primary evidence hereHow the page states it
Release date and product surfaceofficial release posts, model cards, system cards, or provider docsstated directly
Architecture and parameter counttechnical reports or model cards published by the providerstated directly only when disclosed
Architecture rumors or leaked internalssecondary reporting outside provider docslabeled as unconfirmed and kept separate from facts
Benchmark claimslaunch-day reports, system cards, or technical reportsdescribed as release-time evidence, not timeless truth

The rule is simple: if the provider did not publish the architecture, this page should say that it was not published rather than quietly filling in the blank.

Capability Shifts At A Glance

EraWhat changedWhy it mattered
2019-2020 scalingGPT-2 and GPT-3 made pretraining scale itself a capability storyprompting started replacing task-specific fine-tuning for many use cases
2022-2023 alignmentInstructGPT, ChatGPT, and Claude turned base models into assistantsproduct behavior became a post-training problem, not just a pretraining one
2023-2025 open-weight expansionLlama, Qwen, and DeepSeek made strong local and self-hosted models normallabs no longer had a monopoly on serious model access
2024-2025 reasoningo-series, DeepSeek-R1, Gemini 2.5, and Claude thinking modes added test-time compute as a capability axissmaller or cheaper models could trade latency for stronger reasoning
2025-2026 multimodal agentsGPT-5, Claude 4, Gemini 3, and newer open models fused tools, reasoning, and multimodal IOthe important unit stopped being "chatbot only" and became "model inside an action loop"

Headline-model spec table (April 2026)

A complementary quantitative view: parameter counts, context windows, and headline release dates for the models the rest of this page treats in detail. Use this to anchor reasoning about deployment cost; the per-section definitions below cover the technical content.

ModelOrgReleaseActive paramsTotal paramsContextOpen weights
GPT-2OpenAIFeb 20191.5B1.5B1Kyes (eventually)
GPT-3OpenAIJun 2020175B175B2Kno
ChatGPTOpenAINov 2022undisclosedundisclosed4Kno
GPT-4OpenAIMar 2023undisclosedundisclosed8K-32Kno
Claude 2AnthropicJul 2023undisclosedundisclosed100Kno
Llama 2 70BMetaJul 202370B70B4Kyes
Mixtral 8x7BMistralDec 202313B47B32Kyes (Apache)
Gemini 1.5 ProGoogleFeb 2024undisclosedundisclosed1M-2Mno
Llama 3.1 405BMetaJul 2024405B405B128Kyes
DeepSeek-V3DeepSeekDec 202437B671B128Kyes
o1OpenAIDec 2024undisclosedundisclosed200Kno
DeepSeek-R1DeepSeekJan 202537B671B128Kyes
Claude 4 Sonnet / OpusAnthropicMay 2025undisclosedundisclosed200Kno
Llama 4 Scout / MaverickMetaApr 202517B / 17B109B / 400B10M / 1Myes
GPT-5OpenAIAug 2025undisclosedundisclosedundisclosedno
Claude Opus 4.5AnthropicNov 2025undisclosedundisclosed200K-1Mno
GPT-5.5 / 5.5 ProOpenAIApr 23, 2026undisclosedundisclosedundisclosedno
Claude Opus 4.7AnthropicApr 16, 2026undisclosedundisclosed200K-1Mno

"Undisclosed" is honest reporting: most frontier closed-weight models do not publish parameter counts or training compute, and inferring them from API behavior is unreliable. The system cards focus on capability and safety evaluations, not architectural details.

The Scaling Era (2019-2020)

GPT-2 (February 2019)

Definition

GPT-2

Organization: OpenAI. Parameters: 1.5 billion. Architecture: Decoder-only Transformer. Training data: WebText (40GB of web pages filtered by Reddit upvote quality).

Key contribution: Demonstrated that language models trained at sufficient scale exhibit zero-shot task performance. GPT-2 could generate coherent multi-paragraph text, summarize articles, and perform rudimentary translation without task-specific fine-tuning. OpenAI initially withheld the full model citing misuse concerns, sparking the first major debate about responsible release in the LLM era.

GPT-3 (June 2020)

Definition

GPT-3

Organization: OpenAI. Parameters: 175 billion. Architecture: Decoder-only Transformer (96 layers, 96 heads, dmodel=12288d_{\text{model}} = 12288). Training data: Approximately 300 billion tokens from Common Crawl, WebText2, Books, and Wikipedia.

Key contribution: In-context learning (ICL). GPT-3 showed that by simply placing examples in the prompt (few-shot), a frozen model could perform tasks it was never explicitly trained on. This was a qualitative shift: instead of fine-tuning a model per task, you write a prompt. The paper also established smooth scaling laws relating model size to performance.

Example

In-Context Learning as Implicit Bayesian Inference

Few-shot prompting works by conditioning the model on examples that implicitly specify the task. Theoretical work (Xie et al., 2022) argues ICL can be understood as implicit Bayesian inference: the model maintains a posterior over latent "concepts" (tasks) given the prompt, and generates completions consistent with the inferred concept.

The Alignment Era (2022-2023)

ChatGPT (November 2022)

Definition

ChatGPT

Organization: OpenAI. Base model lineage: GPT-3.5, itself a fine-tuned descendant of InstructGPT (Ouyang et al. 2022), which was in turn built on GPT-3. The training pipeline was: (1) supervised fine-tuning on human-written demonstrations, (2) reward model training on human preference comparisons, (3) PPO optimization against the reward model. ChatGPT is not a simple fine-tune of base GPT-3; it is the end product of this multi-step alignment pipeline. Key innovation: Reinforcement Learning from Human Feedback (RLHF).

Key contribution: The consumer product that brought LLMs to mainstream awareness. The RLHF pipeline (drawing on policy gradient methods) made the model substantially more helpful, harmless, and conversational than the base GPT-3.5.

GPT-4 (March 2023)

Definition

GPT-4

Organization: OpenAI. Parameters: Not officially disclosed. Architecture: Not officially disclosed.

Architecture rumors (not confirmed). External reporting (SemiAnalysis and other secondary sources) has described GPT-4 as a mixture-of-experts Transformer in roughly the 1.8 trillion total-parameter range with a small number of active experts per token. OpenAI has never confirmed these numbers and they should be treated as unverified leaks, not factual architecture details.

Key contribution: Multimodal input (text and images). Substantially improved reasoning, factuality, and instruction following over GPT-3.5. Reported to score in the 90th percentile on the bar exam; this figure has been contested. Martinez (2024, arXiv:2308.07312, "Re-Evaluating GPT-4's Bar Exam Performance") reanalyzed the result and placed the actual percentile at roughly 48-68th, depending on the testing cohort and exam form. The claim should be read as "strong performance on legal reasoning tasks," not as a settled empirical benchmark. Demonstrated that scaling plus RLHF plus careful data curation can produce models with broad expert-level knowledge across many domains.

Claude Family (2023-present)

Definition

Claude (Anthropic)

Organization: Anthropic. Models (2023-2024): Claude 1 (March 2023), Claude 2 (July 2023), Claude 3 family (March 2024: Haiku, Sonnet, Opus), Claude 3.5 Sonnet (June 2024), Claude 3.5 Sonnet upgraded with computer use (October 2024).

Key innovation: Constitutional AI (CAI). Instead of relying solely on human preference labels for alignment, CAI uses a set of written principles (a "constitution") to guide the model. The model critiques and revises its own outputs according to these principles, then trains on the revised outputs. This makes the alignment process more transparent and scalable than pure RLHF.

Computer use milestone (October 2024): Claude 3.5 Sonnet became the first frontier model to support computer use in public beta: the ability to interpret screenshots and generate mouse/keyboard actions to control desktop software. This marked the beginning of agentic computer interaction as a model capability, distinct from text-only tool use.

Anthropic has emphasized safety research alongside capability, including work on interpretability (circuit analysis, dictionary learning), honest calibration, and refusal behavior.

The Open Model Revolution (2023-present)

Llama Family (2023-present)

Definition

Llama (Meta)

Organization: Meta. Models: Llama 1 (Feb 2023, 7B-65B), Llama 2 (July 2023, 7B-70B), Llama 3 (April 2024, 8B and 70B), Llama 3.1 (July 2024, 8B-405B), Llama 4 Scout and Maverick (April 5, 2025).

Key contribution: Established the open-weight model ecosystem. Llama 1 showed that smaller models trained on more tokens could match larger models (the Chinchilla scaling insight applied aggressively). Llama 2 added RLHF and a permissive license. Llama 3 pushed quality to near-frontier levels. The 405B Llama 3.1 model demonstrated that open-weight models could compete with proprietary systems on many benchmarks.

Llama 4 (April 2025): The first Llama generation to use Mixture-of-Experts and native multimodal pretraining. Scout (109B total, 17B active, 16 experts, 10M context window) fits on a single H100 GPU with int4 quantization. Maverick (approximately 400B total, 17B active, 128 experts, 1M context window for instruct variant) targets frontier-level performance. Both models were pretrained on interleaved text, images, and video as first-class inputs.

The Llama family spawned an enormous ecosystem of fine-tunes, quantizations, and derivative models (Alpaca, Vicuna, Code Llama, and many others).

Qwen Family (2023-present)

Definition

Qwen (Alibaba)

Organization: Alibaba Cloud. Models: Qwen 1 (2023), Qwen 1.5 (2024), Qwen 2 (June 2024), Qwen 2.5 (September 2024), Qwen 3 (April 28, 2025).

Key contribution: Strong multilingual performance across 119 languages, particularly in Chinese and English. Qwen models are open-weight with Apache 2.0 licensing.

Qwen 3 (April 2025): Introduced hybrid thinking modes (switching between chain-of-thought reasoning and direct response). Dense variants from 0.6B to 32B; MoE variants including Qwen3-30B-A3B (30B total, 3B active) and Qwen3-235B-A22B (235B total, 22B active). Also released a proprietary Qwen3-Max with over 1 trillion parameters, available via API. Trained on 36 trillion tokens. Demonstrated competitive performance across coding, mathematics, and general language tasks. The Qwen-VL series added vision capabilities.

Architectural Innovation (2024-2025)

DeepSeek Family (2024-2025)

Definition

DeepSeek (DeepSeek AI)

Organization: DeepSeek AI (China). Key models: DeepSeek-V2 (May 2024, introduced MLA), DeepSeek-V3 (December 2024, 671B total, 37B active), DeepSeek-R1 (January 2025), DeepSeek-V3.1 (August 19, 2025).

Key innovations:

  • Mixture-of-Experts (MoE): DeepSeek-V3 uses MoE with 671 billion total parameters but only 37 billion active per token, achieving frontier performance at dramatically lower inference cost.
  • Multi-head Latent Attention (MLA): Introduced in V2, this KV-cache compression technique projects keys and values into a low-dimensional latent space, reducing memory requirements for long sequences.
  • Reasoning models: DeepSeek-R1 demonstrated that chain-of-thought reasoning can emerge from reinforcement learning on reasoning tasks, achieving competitive performance with OpenAI o1 on math and coding benchmarks. Released open-weight, enabling community study of reasoning model internals.
  • Hybrid reasoning (V3.1): DeepSeek-V3.1 (August 2025) introduced a hybrid architecture that combines thinking and non-thinking modes in a single model, switchable via the chat template. Maintains the 671B/37B MoE structure with 128K context.

DeepSeek demonstrated that architectural innovation (MoE, MLA) and RL-based training can compensate for having less compute than the largest Western labs.

Gemini Family (2023-present)

Definition

Gemini (Google DeepMind)

Organization: Google DeepMind. Models: Gemini 1.0 (December 2023), Gemini 1.5 (February 2024), Gemini 2.0 (December 2024), Gemini 2.5 Pro (March 25, 2025; GA June 2025), Gemini 3 Pro (November 18, 2025).

Key innovations:

  • Natively multimodal: Trained from the ground up on interleaved text, images, audio, and video, rather than bolting vision onto a text model.
  • Long context: Gemini 1.5 Pro demonstrated reliable processing of up to 1 million tokens (later extended to 2 million), enabling processing of entire codebases, hours of video, or thousands of pages of documents in a single context window.
  • Mixture-of-Experts: Gemini 1.5 uses a MoE architecture for efficiency.
  • Reasoning model era (Gemini 2.5): Gemini 2.5 Pro was Google's first explicitly designated thinking model, using chain-of-thought reasoning before responding. Led benchmarks on math, coding, and multimodal tasks at time of release.
  • Gemini 3 (November 2025): Released simultaneously across the Gemini app, Google AI Studio, and Vertex AI. Gemini 3 Pro and Gemini 3 Deep Think were the flagship variants. First time Google launched a new Gemini model across multiple products on day one.

Kimi (2024-present)

Definition

Kimi (Moonshot AI)

Organization: Moonshot AI (China). Key models: Kimi k1.5 (2024-2025), Kimi K2 (July 11, 2025), Kimi K2.5 (January 27, 2026).

Kimi k1.5 (2024-2025): Pioneered extremely long context windows for consumer applications. Demonstrated that long context is not just a benchmark metric but enables qualitatively different use patterns: processing entire books, long legal documents, and extended conversation histories. Extended RL-based reasoning in a long-context setting.

Kimi K2 (July 2025): Open-weight release (Apache 2.0). Strong performance on coding benchmarks. Instruct update in September 2025 doubled context window to 256K and improved agentic coding. K2 Thinking (November 2025) added dedicated chain-of-thought reasoning capability.

Kimi K2.5 (January 2026): Native multimodal architecture supporting visual and text input, thinking and non-thinking modes, dialogue and agent tasks. 1 trillion parameters. Competitive on agent, coding, image, and video tasks.

The Claude 4 Series (May 2025 - Present)

Definition

Claude 4 family (Anthropic)

Organization: Anthropic.

Claude Opus 4 and Sonnet 4 (May 22, 2025): The Claude 4 generation introduced hybrid models offering both near-instant responses and extended thinking modes for deeper reasoning in a single model. Opus 4 focused on sustained performance on complex, long-running agentic and coding tasks. Sonnet 4 provided superior coding and reasoning relative to Sonnet 3.7.

Claude Opus 4.1 (August 5, 2025): Incremental improvement on agentic tasks, real-world coding, and reasoning. Achieved 74.5% on SWE-bench Verified. Extended thinking capability expanded to 64,000 tokens.

Claude Sonnet 4.5 (September 29, 2025): Highest-intelligence Sonnet at its price point. Added 1M token context in beta. Best-in-class coding performance at the Sonnet tier.

Claude Opus 4.5 (November 24, 2025): Next full Opus release, continuing Anthropic's focus on agentic workflows and extended reasoning.

Claude Opus 4.7 (April 16, 2026): Current generally-available Opus tier. Continued improvements in reasoning, coding, and agentic task performance.

Claude Mythos Preview (April 7, 2026): Anthropic's most capable frontier model to date, publicly documented via the Claude Mythos Preview System Card (April 7, 2026). Per the card, Mythos Preview's large increase in capabilities led Anthropic to not make it generally available; instead it is being used as part of a defensive cybersecurity program with a limited set of partners. Findings from the Mythos System Card (RSP evaluations, cyber, alignment assessment, model-welfare assessment) are being used to inform future Claude releases and their safeguards. Not a public deployment, but a publicly documented capability datapoint.

GPT-5 (August 7, 2025)

Definition

GPT-5

Organization: OpenAI. Release: August 7, 2025, via livestream event. Available in ChatGPT, Microsoft Copilot, and the OpenAI API.

Key contribution: Highest published scores at release on benchmarks testing mathematics, programming, finance, and multimodal understanding. Improvements over GPT-4o in faster response times, coding and writing quality, accuracy on health-related questions, and reduced hallucination rates. Publicly accessible across tiers.

GPT-5.5 (April 23, 2026)

Definition

GPT-5.5

Organization: OpenAI. Release: April 23, 2026 in ChatGPT for Plus, Pro, Business, and Enterprise tiers and in Codex; API rollout (gpt-5.5 and gpt-5.5-pro) on April 24, 2026. Internal codename "Spud."

Key contribution: OpenAI's most capable mainline model at release, with emphasis on coding, debugging, online research, document and spreadsheet creation, and multi-tool agentic workflows. OpenAI states GPT-5.5 matches GPT-5.4 per-token serving latency while using fewer tokens to complete the same tasks, so the headline change is token efficiency at the same latency budget rather than a new latency floor. Continues the GPT-5.x trend of folding specialist capabilities (reasoning, Codex-class coding, computer use, tool search) into the mainline product surface.

Pricing at release: gpt-5.5 at 5 USD per million input tokens and 30 USD per million output tokens. gpt-5.5-pro at 30 USD / 180 USD per million input / output tokens, restricted to Pro, Business, and Enterprise.

What remains undisclosed. Parameter count, dense-vs-MoE architecture, training data composition, and training compute. OpenAI publishes a system card focused on capability and safety evaluations, not architectural disclosure.

The Reasoning Model Era (2024-2025)

A distinct capability shift emerged in late 2024: models that explicitly "think" before responding, generating extended chain-of-thought (sometimes thousands of tokens) as a scratchpad before producing the final answer. This inference-time scaling approach trades latency and token cost for accuracy on difficult reasoning tasks.

Example

Inference-Time Scaling Laws

Standard scaling laws (Kaplan et al. 2020) relate pretraining compute to loss. Reasoning models introduce a second scaling axis: test-time compute. Empirically, accuracy on hard math and coding tasks scales with the number of "thinking tokens" the model generates before answering. This is distinct from the training compute axis and enables smaller pretrained models to reach accuracy levels previously requiring much larger models, at the cost of increased latency and token spend.

OpenAI o-series:

  • o1-preview and o1-mini (September 12, 2024): first public reasoning models, outperforming standard GPT-4o on math and coding at the cost of latency.
  • o1 full release (December 5, 2024).
  • o3-mini (January 31, 2025): efficient reasoning model.
  • o3 and o4-mini (April 16, 2025): o3 became the strongest publicly available reasoning model at release; o4-mini added multimodal tool use.
  • o3-pro (June 2025): most capable and expensive o-series model.

DeepSeek-R1 (January 2025): Demonstrated that chain-of-thought reasoning can be induced via RL on verifiable tasks without supervised reasoning traces. Open-weight release enabled community study of reasoning model training.

Gemini 2.5 thinking mode (March 2025): Google's entry into the reasoning model category, with Gemini 2.5 Pro as the first Gemini model designated as a thinking model.

Claude extended thinking (2025): Available in Claude Opus 4 and later models, offering a separate high-latency reasoning path with up to 64K thinking tokens.

The 2025 Frontier: xAI and Mistral

xAI Grok Series

Definition

Grok (xAI)

Organization: xAI (Elon Musk). Models: Grok 1 (2023), Grok 2 (August 2024), Grok 3 (February 17, 2025), Grok 4 (July 9, 2025), Grok 4.1 (November 17, 2025).

Key contribution: Grok 3 achieved frontier-level performance on reasoning and coding benchmarks, established xAI as a serious frontier lab. Grok 4 and 4 Heavy released July 2025 with claimed benchmark outperformance over rivals. Real-time integration with X (Twitter) platform provides live information access distinct from other frontier models.

Mistral

Definition

Mistral (Mistral AI)

Organization: Mistral AI (France). European frontier lab producing both open-weight and proprietary models.

2025 releases: Mistral Large 24.11 (November 2024, 123B dense), Codestral 25.01 (January 2025, code-focused, led Lmsys Copilot Arena at launch), Mistral Small 3 (January 2025), Codestral (August 2025 refresh), Mistral 3 family (December 2025: full open-weight frontier model plus nine smaller offline-capable variants).

Key contribution: Demonstrated that efficient dense models (not requiring MoE) at 7B-123B scale can be competitive with much larger models when trained on high-quality data. Sliding window attention and grouped-query attention introduced in early Mistral 7B remain widely used. Mistral 3 positioned as a serious open-weight European alternative to US and Chinese MoE giants.

GLM-5 (Zhipu AI, February 2026)

Definition

GLM-5 (Zhipu AI)

Organization: Zhipu AI (China; Z.ai). Release: February 11, 2026. Parameters: 744B total, 40B active (MoE, vs GLM-4.5 which was 355B/32B). Training data: 28.5T tokens (vs 23T for GLM-4.5).

Release date: February 11, 2026, open-source on Hugging Face; the parameter table below lists GLM-5 under year 2026 accordingly.

Key contribution: Native multimodal, 128K context, designed for agentic tasks. At release approached Claude Opus 4.5 on coding benchmarks and surpassed Gemini 3 Pro on some benchmarks. First major model release from Zhipu after the company went public. Established GLM series as a serious frontier competitor from the Chinese open ecosystem.

Evaluation Benchmark Evolution

A parallel history runs alongside model releases: the benchmarks used to measure capability have changed as models saturated them. Understanding this evolution matters because each benchmark shift revealed something new about what capability means.

BenchmarkIntroducedWhat it measuresWhen frontier models saturated
MMLU202057-subject multiple-choice knowledgeGPT-4 era (~2023)
MATH2021Competition mathematics (proof and computation)o1/o3 era (~2024-25)
HumanEval2021Python function synthesis from docstringsGPT-4 era (~2023)
GPQA2023PhD-level biology, chemistry, physics questionso3/Gemini 2.5 era (~2025)
SWE-bench Verified2024Real GitHub issue resolution in large codebasesOngoing frontier (60-80% range as of 2025)
ARC-AGI2019/updatedNovel visual reasoning requiring abstractionOngoing frontier; o3 high-compute crossed 85% in 2024

The pattern: each benchmark was designed to be difficult when introduced, then became a target for optimization, then was superseded by a harder one. SWE-bench Verified and ARC-AGI are the primary contested frontiers as of mid-2026.

Watch Out

Benchmark saturation versus benchmark gaming

When a model scores 95% on MMLU, it may reflect genuine world knowledge or fine-tuning on MMLU variants. The two are hard to distinguish from the benchmark number alone. SWE-bench Verified was designed to resist gaming (each issue requires actually running the patched code), which is why it remains the primary coding benchmark despite being harder to interpret than simple accuracy metrics.

Key Architectural Trends

Example

Dense Models vs Mixture-of-Experts

Early models (GPT-2, GPT-3, Llama 1-2) used dense architectures where every parameter activates for every token. MoE models (DeepSeek-V3, Gemini 1.5, Llama 4, Qwen3-235B) activate only a subset of parameters per token. This means a 671B MoE model can have the quality of a very large dense model but the inference cost of a much smaller one. The trade-off: MoE models require more total memory (all experts must be loaded) even though compute per token is lower.

Example

The RLHF-to-RL-for-Reasoning Pipeline

The alignment methodology evolved from: (1) supervised fine-tuning on human demonstrations, to (2) RLHF with reward models trained on human preferences, to (3) reinforcement learning directly on verifiable reasoning tasks (math, code). DeepSeek-R1 and the OpenAI o-series showed that training models to "think step by step" via RL produces qualitatively stronger reasoning than SFT or standard RLHF alone.

Parameter Counts at a Glance

ModelYearParametersActive ParamsKey Innovation
GPT-220191.5B1.5B (dense)Zero-shot generation
GPT-32020175B175B (dense)In-context learning
ChatGPT2022~175B~175BRLHF alignment
Llama 120237-65BDenseOpen-weight ecosystem
GPT-42023UndisclosedUndisclosedMultimodal, frontier reasoning
Claude 3 Opus2024UndisclosedUndisclosedConstitutional AI
Llama 3.120248-405BDenseOpen-weight at frontier scale
DeepSeek-V32024671B37B (MoE)MoE efficiency, MLA
DeepSeek-R12025671B37B (MoE)RL-trained reasoning
Gemini 1.52024UndisclosedUndisclosed1M+ context, native multimodal
Kimi k1.52025UndisclosedUndisclosedLong-context + RL reasoning
Claude 4 Opus2025UndisclosedUndisclosedAgentic coding, extended thinking
Llama 4 Scout2025109B17B (MoE)Open-weight MoE, 10M context
Llama 4 Maverick2025~400B17B (MoE)Open-weight MoE, native multimodal
Qwen 3 235B2025235B22B (MoE)Hybrid thinking, 119 languages
GPT-52025UndisclosedUndisclosedFrontier reasoning + coding
GPT-5.52026UndisclosedUndisclosedToken efficiency + agentic tools at GPT-5.4 latency
DeepSeek-V3.12025671B37B (MoE)Hybrid think/no-think in one model
Grok 42025UndisclosedUndisclosedFrontier reasoning, real-time X data
Kimi K22025UndisclosedUndisclosedOpen-weight agentic coding
Gemini 2.5 Pro2025UndisclosedUndisclosedFirst Google thinking model
Gemini 3 Pro2025UndisclosedUndisclosedMultiproduct day-one launch
GLM-52026744B40B (MoE)Native multimodal, 128K, agentic
Claude Opus 4.72026UndisclosedUndisclosedCurrent frontier, agentic
Kimi K2.52026~1TUndisclosedNative multimodal MoE

Common Confusions

Watch Out

Parameter count does not equal capability

A 671B MoE model with 37B active parameters can outperform a 70B dense model despite using similar compute per token. Total parameter count is a poor proxy for capability. What matters is effective compute (active parameters times tokens processed), data quality, and training methodology (RLHF, RL for reasoning, data curation). Llama 1-65B, trained on 1.4T tokens, underperformed Llama 3-8B, trained on 15T+ tokens.

Watch Out

Open-weight is not open-source

Most "open" models release model weights but not training data, training code, or data processing pipelines. Llama, Qwen, and DeepSeek provide weights and inference code under various licenses, but reproducing the training run from scratch is not feasible without the full data pipeline. True open-source would include data, code, and training recipes.

Watch Out

GPT-3.5 is not a simple GPT-3 fine-tune

ChatGPT's base model (GPT-3.5) went through a multi-step pipeline: starting from a GPT-3-class pretrained model, then supervised fine-tuning on human demonstrations (InstructGPT stage), then reward model training on human preference comparisons, then PPO-based RL optimization. This is a qualitatively different model from base GPT-3, not just a prompt-engineering wrapper or single-pass fine-tune. The Ouyang et al. 2022 InstructGPT paper describes the full pipeline.

Summary

  • GPT-3 (2020) introduced in-context learning, the key capability of the LLM era
  • RLHF (ChatGPT, 2022) turned base models into useful assistants via a multi-step InstructGPT pipeline, not a simple fine-tune
  • Constitutional AI (Claude) made alignment more transparent and principle-based
  • MoE architectures (DeepSeek-V3, Gemini, Llama 4, Qwen 3) decouple total parameters from inference cost
  • RL for reasoning (DeepSeek-R1, o-series, Kimi K2) is a distinct capability tier with inference-time scaling as the new axis
  • Open-weight models (Llama, Qwen, DeepSeek, Kimi K2) have reached near-parity with proprietary systems on many benchmarks
  • Long context (Gemini 1.5, Kimi, Llama 4) enables qualitatively new applications
  • GPT-4 bar exam "90th percentile" is a contested figure; actual estimate from re-analysis is 48-68th percentile
  • Benchmarks saturate: MMLU and HumanEval are no longer discriminative; SWE-bench Verified and ARC-AGI are the current contested frontiers
  • GLM-5 (February 2026, Zhipu AI) and Kimi K2.5 (January 2026) mark Chinese labs releasing 700B-1T scale open-weight models

Exercises

ExerciseCore

Problem

Explain the difference between few-shot in-context learning (GPT-3 style) and fine-tuning. When would you prefer each approach?

ExerciseCore

Problem

DeepSeek-V3 has 671B total parameters but only 37B active per token. Explain why, and state one advantage and one disadvantage compared to a dense 70B model.

ExerciseAdvanced

Problem

Constitutional AI (Claude) and RLHF (ChatGPT) are both alignment methods. Describe the key difference in how they obtain training signal for the preference/reward model.

References

Reference note: this page is a maintained technical timeline, not an exhaustive historiography. The references below prioritize primary release posts, technical reports, and system cards.

Foundational Papers:

  • Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2, 2019)
  • Brown et al., "Language Models are Few-Shot Learners" (GPT-3, 2020)
  • Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT/RLHF, 2022)
  • Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022)

Primary release anchors:

Open and open-weight lines:

  • Touvron et al., "Llama: Open and Efficient Foundation Language Models" (2023)
  • Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
  • Meta, "The Llama 4 Herd" technical report (2025)
  • Hui et al., "Qwen2.5 Technical Report" (2024)
  • Qwen Team, "Qwen3 Technical Report" (2025)

Architecture Innovation:

Reasoning Models:

  • OpenAI, "OpenAI o1 System Card" (2024)
  • OpenAI, "OpenAI o3 and o4-mini System Card" (2025)

Proprietary Frontier:

  • OpenAI, "GPT-4 Technical Report" (2023)
  • OpenAI, "GPT-4 System Card" (2023)
  • Google DeepMind, "Gemini: A Family of Highly Capable Multimodal Models" (2023)
  • Google DeepMind, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context" (2024)
  • Anthropic, "Claude 3 Model Card" (2024)
  • Anthropic, "Claude 4 System Card" (2025)
  • Zhipu AI, "GLM-4.5 Technical Report" (2025); "GLM-5" (2026)
  • Moonshot AI, "Kimi K2 Technical Report" (2025)

Contested claims:

  • Martinez, "Re-Evaluating GPT-4's Bar Exam Performance" (2024), arXiv:2308.07312

Next Topics

This is a living reference document. As new models emerge, check primary technical reports for authoritative details.

Last reviewed: April 29, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

1

Graph-backed continuations