GPT Series Evolution

Sneiderman, Robby

Model Timeline

GPT Series Evolution

The progression from GPT-1 (117M) through GPT-4, the o-series, GPT-5, GPT-5.4, gpt-oss, and GPT-Rosalind: what changed through scale, post-training, reasoning-time compute, tool use, and domain specialization.

CoreTier 2FrontierReference~55 min

Prerequisites

Transformer Architecture Attention Mechanism Theory Scaling Laws Bert and Pretrain Finetune Paradigm

Prereq Map

Why This Matters

The GPT series is a useful record of what changed as autoregressive language models scaled and then became products. GPT-1 showed that decoder-only pretraining transfers. GPT-2 showed that tasks can be framed as text completion. GPT-3 made in-context learning visible at scale. InstructGPT and ChatGPT showed that post-training changes model behavior as much as pretraining scale does. GPT-4 made multimodal input central. The o-series and GPT-5 family made reasoning-time compute, coding, tool use, and computer operation part of the product surface. Understanding this progression grounds abstract discussions about scaling laws, post-training, and tool-augmented reasoning in concrete model releases.

OpenAI model lineage

GPT, reasoning models, Codex, and open weights

Snapshot current to April 22, 2026. Product labels and API model IDs change faster than the historical architecture story.

GPT foundation line

GPT-1

2018

Decoder-only pretraining showed transfer after task fine-tuning.

GPT-2

2019

Scaled generation and zero-shot task framing.

GPT-3

2020

Few-shot prompting and in-context learning at 175B parameters.

GPT-4

2023

Multimodal input and stronger reasoning; architecture undisclosed.

GPT-5

2025

Reasoning line folded back into the main GPT family.

Reasoning and computer-use line

o1

2024

Reasoning models introduced inference-time compute scaling.

o3 / o4-mini

2025

Stronger math, code, and tool-use variants before GPT-5 consolidation.

GPT-5.1

Nov 2025

Adaptive reasoning, no-reasoning mode, and developer editing tools.

GPT-5.3

Feb-Mar 2026

Codex for long-running coding; Instant for everyday ChatGPT use.

GPT-5.4

Mar 2026

Mainline reasoning, coding, computer-use, and tool-search model.

Open-weight branch

gpt-oss-120b

Aug 2025

Open-weight reasoning model for local or hosted deployment.

gpt-oss-20b

Aug 2025

Smaller open-weight model for lower-latency local use cases.

Domain-specialized research branch

GPT-Rosalind

Apr 2026

Trusted-access life-sciences model for biology, drug discovery, and scientific tool workflows.

API default

gpt-5.4

OpenAI recommends it for complex reasoning and coding; use mini or nano variants for lower latency and cost.

ChatGPT split

GPT-5.3 Instant + GPT-5.4 Thinking/Pro

Everyday chat and deeper reasoning are separate product surfaces.

Codex bridge

GPT-5.3-Codex -> GPT-5.4

GPT-5.4 incorporates the coding capability line that GPT-5.3-Codex pushed forward.

Open weights

gpt-oss-120b / gpt-oss-20b

Useful when local control matters; not a disclosed equivalent of the closed GPT-5.4 line.

Domain model

GPT-Rosalind

Life-sciences research preview for eligible enterprise research teams, not a public general-purpose model.

Current deployment ledger

API starting point

gpt-5.4

Recommended by OpenAI for complex reasoning and coding. The API docs list 1M context, 128K maximum output, vision input, and tool support.

Context, tool support, pricing, and model IDs can change; check the official model docs before building.

Lower-latency variants

gpt-5.4-mini / gpt-5.4-nano

Smaller GPT-5.4 variants for cost-sensitive or latency-sensitive workloads.

They are not the same capability tier as gpt-5.4; use them when the task is narrow enough.

Maximum-performance API variant

gpt-5.4-pro

Higher-cost model for the hardest tasks. OpenAI positions it for maximum performance rather than default use.

Benchmark wins are benchmark-scoped; do not infer universal superiority.

ChatGPT everyday model

GPT-5.3 Instant

The smoother everyday ChatGPT model released March 3, 2026. API access is listed as gpt-5.3-chat-latest.

ChatGPT product labels and API IDs do not always match one-to-one.

ChatGPT deeper reasoning

GPT-5.4 Thinking / GPT-5.4 Pro

The March 5, 2026 ChatGPT reasoning surface. OpenAI says GPT-5.4 Thinking replaced GPT-5.2 Thinking for paid users.

ChatGPT context limits may differ from API or Codex limits.

Open-weight branch

gpt-oss-120b / gpt-oss-20b

Open-weight reasoning models released under Apache 2.0 in August 2025.

They are related to OpenAI reasoning work, but not disclosed equivalents of the closed GPT-5.4 line.

Life-sciences research preview

GPT-Rosalind

Purpose-built model for biology, drug discovery, and translational medicine workflows, available through trusted access.

Eligible U.S. Enterprise research customers only during preview; not available for individual researchers or external customer-facing products.

Current Snapshot: April 22, 2026

The diagram above is about product and API names, not hidden architecture. OpenAI has not disclosed parameter counts, dense-vs-MoE choices, training data composition, or training compute for GPT-4 and later mainline models. Treat public model names as deployment surfaces: they tell you what developers can call and what users see, not exactly what is inside the model.

Two naming rules prevent confusion. First, ChatGPT names are product labels: GPT-5.3 Instant and GPT-5.4 Thinking describe user-facing modes. Second, API IDs are deployment handles: gpt-5.4, gpt-5.4-pro, and gpt-5.3-chat-latest are the strings developers call. A page about the GPT series should explain both, but it should not pretend that the label tells us the architecture.

GPT-1 (2018): Pretraining Works for Generation

Architecture. Decoder-only transformer, 12 layers, 768 hidden dimension, 12 attention heads. 117M parameters.

Training. Autoregressive language modeling on BooksCorpus (~7,000 unpublished books, ~800M words). Standard next-token prediction objective.

Key result. After pretraining, fine-tuning on downstream tasks (textual entailment, question answering, sentiment) improved over training from scratch. This was the decoder-only analog of BERT's finding for encoders.

What it taught us. Unsupervised pretraining with next-token prediction learns representations that transfer to supervised tasks. The decoder-only architecture is viable for understanding tasks, not just generation.

Limitations. 117M parameters is small. The model required task-specific fine-tuning and could not perform tasks zero-shot. Performance was competitive but not the strongest on most benchmarks; BERT, released months later, overtook GPT-1 on understanding tasks.

GPT-2 (2019): Zero-Shot Generation

Architecture. Same decoder-only transformer design, scaled to 1.5B parameters (GPT-2 XL). 48 layers, 1600 hidden dimension, 25 attention heads.

Training. WebText dataset: 8 million web pages (40GB of text) curated by following outbound links from Reddit posts with at least 3 karma. No fine-tuning on downstream tasks.

Key result. GPT-2 performed tasks zero-shot by framing them as text completion. Translation: provide an English sentence followed by "French:" and the model completes with a French translation. Summarization: provide an article followed by "TL;DR:" and the model generates a summary.

What it taught us. A language model trained on diverse text implicitly learns to perform many tasks. Scaling from 117M to 1.5B parameters improved zero-shot performance substantially. The model memorized some training data, raising concerns about privacy and misinformation.

The "too dangerous to release" claim. OpenAI initially released only the 124M version, citing concerns about misuse. The full 1.5B model was released 9 months later. In retrospect, the capabilities were modest by current standards, and the staged release was more about establishing norms than preventing actual harm.

GPT-3 (2020): In-Context Learning

Architecture. 175B parameters. 96 layers, 12288 hidden dimension, 96 attention heads. Context length: 2048 tokens.

Training. Mix of Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia. ~300B tokens total, with data weighted by quality (Wikipedia upweighted, Common Crawl downweighted).

Key result. In-context learning: GPT-3 can perform new tasks by conditioning on a few examples in the prompt, without any gradient updates. Provide 3 examples of English-to-French translation in the prompt, then a new English sentence, and GPT-3 translates it. This was not trained for; it emerged from scale.

Definition

In-Context Learning (ICL)

In-context learning is the ability of a language model to learn a task from examples provided in the prompt at inference time, with no parameter updates. Given prompt examples $(x_1, y_1), \ldots, (x_k, y_k)$ and a new input $x_{k+1}$ , the model generates $y_{k+1}$ by conditioning on the full prompt. The "learning" happens through the forward pass attention mechanism, not through gradient descent.

Few-shot vs. zero-shot vs. one-shot. GPT-3 was evaluated in three settings: zero-shot (task description only), one-shot (one example), and few-shot (multiple examples). Few-shot performance on many tasks approached or exceeded fine-tuned BERT models, without any fine-tuning.

What it taught us. Scale enables qualitative capability jumps. In-context learning does not exist at 1.5B parameters (GPT-2 shows weak ICL) but is strong at 175B. This is an empirical observation; there is no clean theory explaining why ICL emerges.

Main Theorems

Proposition

In-Context Learning as Implicit Bayesian Inference

Statement

A sufficiently large transformer pretrained on a mixture of tasks can perform approximate Bayesian inference over task identity given in-context examples. Concretely, if the pretraining distribution is a mixture $p(x) = \sum_t p(t) p_t(x)$ over tasks $t$ , then conditioning on examples $(x_1, y_1), \ldots, (x_k, y_k)$ in the prompt approximates:

$p(y_{k+1} \mid x_{k+1}, x_{1:k}, y_{1:k}) \approx \sum_t p(t \mid x_{1:k}, y_{1:k}) \cdot p_t(y_{k+1} \mid x_{k+1})$

The model implicitly infers which task the examples come from and applies that task's input-output mapping.

Intuition

The pretraining corpus contains many implicit "tasks" (translation passages, Q&A pairs, code with comments). When in-context examples look like a particular task, the transformer's attention mechanism routes computation through pathways learned for that task. More examples sharpen the posterior over task identity.

Proof Sketch

Xie et al. (2022) showed that transformers trained on sequences generated by hidden Markov models can perform exact Bayesian prediction in-context. The key condition is that pretraining data contains latent structure (different "concepts" or "tasks") and the transformer has enough capacity to represent the posterior. This is a stylized result; real-world ICL likely involves additional mechanisms.

Why It Matters

This framework explains why ICL improves with more in-context examples (more evidence for task identification), why it fails with out-of-distribution tasks (no matching task in the pretraining mixture), and why larger models are better at ICL (more capacity to represent the task mixture).

Failure Mode

ICL fails when (1) the task was not represented in pretraining data, (2) the prompt format is unfamiliar, (3) the task requires reasoning beyond what the model learned, or (4) the examples are misleading or inconsistent. ICL is also sensitive to example ordering and formatting, which true Bayesian inference would not be.

report a correction →

InstructGPT and ChatGPT (2022): Alignment via RLHF

The problem. GPT-3 generates fluent text but is difficult to direct. It completes text, not instructions. Ask it a question and it may generate more questions instead of answering. It can produce toxic, biased, or factually wrong text because it was trained to predict internet text, which contains all of these.

The solution. Reinforcement Learning from Human Feedback (RLHF), applied in three stages:

Supervised fine-tuning (SFT). Collect human demonstrations of desired behavior (following instructions, answering questions helpfully). Fine-tune GPT-3 on these demonstrations.
Reward model training. Collect human preferences (which of two responses is better). Train a reward model to predict human preferences.
RL optimization. Use PPO (Proximal Policy Optimization) to fine-tune the SFT model to maximize the reward model's score while staying close to the SFT model (KL penalty).

Key result. InstructGPT (1.3B parameters with RLHF) was preferred by humans over GPT-3 (175B) in 85% of comparisons. Alignment techniques matter more than raw scale for producing useful outputs.

What it taught us. The pretraining objective (next-token prediction) optimizes for predicting internet text, not for being helpful or truthful. RLHF bridges this gap. ChatGPT applied this recipe to GPT-3.5 and demonstrated that aligned language models have broad commercial viability.

GPT-4 (2023): Multimodal Reasoning

Known details. Accepts both text and image inputs. Significantly better reasoning (bar exam: reported ~90th percentile, but contested; Martinez 2024, "Re-evaluating GPT-4's bar exam performance," Artificial Intelligence and Law, reanalyzed the result and placed the actual percentile at roughly 48-68th depending on testing cohort and exam form — the 90th-percentile figure was measured against re-takers, not all test-takers). Context length: 8K and 32K token variants.

Unknown details. OpenAI did not disclose architecture size, training data, or training compute. Rumored to be a mixture-of-experts architecture, but not confirmed.

What it taught us. Multimodal input (processing images alongside text) extends the capabilities significantly. The gains from GPT-3.5 to GPT-4 on reasoning benchmarks suggest either substantial scaling, architectural improvements, better training data, or some combination. Without disclosure, the ML community cannot determine which factors contributed most.

GPT-4 Turbo through o3 (2023-2025)

Brief timeline of major releases after the original GPT-4:

GPT-4 Turbo (Nov 2023): Longer context (128K tokens), updated knowledge cutoff, lower API cost; same multimodal base.
GPT-4o (May 2024): Unified multimodal model handling text, audio, and images in a single architecture; faster and cheaper than GPT-4 Turbo.
o1-preview / o1 (Sep / Dec 2024): Chain-of-thought reasoning with inference-time compute scaling; trained to generate extended internal reasoning traces before producing a final answer. Substantially stronger on math and coding benchmarks than GPT-4o.
o3 and o4-mini (2025): Extended the o1 inference-time scaling approach with larger or cheaper reasoning budgets. By early 2026, ChatGPT had retired several older GPT-4 and o-series product surfaces, while API availability followed separate deprecation notices.

The GPT-5 Family (2025-2026)

OpenAI shifted naming from the o-series back to GPT-5.x, releasing a rapid sequence of models that folded reasoning capabilities into the main product line.

GPT-5 (2025): Initial mainline release after the o-series. Specific architectural details remain undisclosed.
GPT-5.1 (November 13, 2025): API release focused on adaptive reasoning for coding and tool-heavy tasks. OpenAI also introduced a no-reasoning mode, longer prompt-cache retention, and developer tools for patch application and shell execution. GPT-5.1 ChatGPT models were retired on March 11, 2026; existing conversations and GPTs moved to GPT-5.3 Instant and GPT-5.4 Thinking/Pro equivalents.
GPT-5.3 / GPT-5.3-Codex (February-March 2026): GPT-5.3-Codex released February 5, 2026 for long-running coding and professional computer work. GPT-5.3 Instant released March 3, 2026 and became the default everyday ChatGPT model, with emphasis on factuality, search quality, and conversational flow.
GPT-5.4 (March 5, 2026): Unified reasoning, coding, computer use, tool search, and professional document workflows into the mainline model. OpenAI describes GPT-5.4 as the first mainline reasoning model to incorporate GPT-5.3-Codex coding capability, with API availability as gpt-5.4 and gpt-5.4-pro.
GPT-5.5 and GPT-5.5 Pro (April 23, 2026; API April 24, 2026): OpenAI's strongest agentic-coding and computer-use models to date. Released in ChatGPT (Plus/Pro/Business/Enterprise) and Codex on April 23, with gpt-5.5 ($5/M input, $30/M output, 1M context) and gpt-5.5-pro ($30/M input, $180/M output) reaching the API on April 24. Co-designed for, trained with, and served on NVIDIA GB200 and GB300 NVL72 systems; matches GPT-5.4 per-token serving latency despite the higher capability bar, and uses fewer tokens than GPT-5.4 to complete the same Codex tasks. Documented benchmark deltas vs GPT-5.4: Terminal-Bench 2.0 75.1% → 82.7%, Expert-SWE (internal) 68.5% → 73.1%, FrontierMath Tier 4 27.1% → 35.4%, ARC-AGI-2 73.3% → 85.0%, Graphwalks 1M-token BFS f1 9.4% → 45.4%. Cybersecurity and biology rated High under the OpenAI Preparedness Framework (not Critical); cyber-permissive variants (including GPT-5.4-Cyber for critical-infrastructure defenders) are gated through the chatgpt.com/cyber Trusted Access program. An internal version with a custom harness contributed a new asymptotic proof about off-diagonal Ramsey numbers, later verified in Lean.

The naming pattern reveals a shift in release cadence: instead of one major release every few years, OpenAI now iterates in short cycles within a major version. Specialist capabilities move back into the main line: o-series reasoning becomes part of GPT-5 thinking models, and GPT-5.3-Codex coding capability is folded into the mainline 5.4 / 5.5 line.

What remains undisclosed. Parameter counts, architecture details (MoE vs dense), and training data composition for the entire GPT-5 family. Benchmark comparisons are published selectively. The GPT-5.5 announcement names training and serving on NVIDIA GB200 / GB300 NVL72 systems explicitly but discloses no parameter or training-token counts.

gpt-oss and GPT-Rosalind: Branches Off the Mainline

OpenAI also released gpt-oss-120b and gpt-oss-20b in August 2025 as open-weight reasoning models under Apache 2.0. These models matter because they change the deployment story: a developer can run or fine-tune them outside OpenAI's hosted API, which is useful for local control, latency, customization, and research inspection.

They should not be described as open versions of GPT-5.4. OpenAI presents them as open-weight reasoning models informed by its internal systems, not as disclosed replicas of the closed GPT-5 line. Once weights are released, the safety and update model also changes: OpenAI can publish model cards and usage policies, but cannot revoke or patch every downstream copy.

GPT-Rosalind, announced April 16, 2026, is a different kind of branch: a purpose-built life-sciences model for biology, drug discovery, and translational medicine workflows. OpenAI describes it as a research preview available through trusted access for qualified enterprise research customers. It belongs on this page because it shows a shift from one general assistant line toward domain-specialized reasoning systems connected to scientific databases, literature, and tools. It should not be framed as a general public model or as a replacement for GPT-5.4.

Common Confusions

Watch Out

GPT-3 was not fine-tuned for in-context learning

In-context learning was not an explicit training objective. GPT-3 was trained purely on next-token prediction. ICL is an emergent behavior that becomes stronger with scale. This is one of the most surprising findings in modern ML: capabilities appear without being explicitly trained for.

Watch Out

RLHF does not teach new knowledge

RLHF adjusts which knowledge the model expresses and how it expresses it. It does not add new factual knowledge. A model that does not know a fact after pretraining still does not know it after RLHF. RLHF makes the model more likely to admit ignorance instead of confabulating, but the underlying knowledge base is set during pretraining.

Watch Out

Scale alone does not explain GPT-4

While GPT-4 is likely larger than GPT-3, OpenAI also improved training data quality, training methodology, and post-training alignment. Attributing GPT-4's improvements solely to parameter count is speculation. The lack of technical disclosure makes it impossible to isolate the contribution of each factor.

Canonical Examples

Example

In-context learning for sentiment classification

Prompt to GPT-3: "Review: This movie was terrible. Sentiment: Negative. Review: I loved every minute of it. Sentiment: Positive. Review: The acting was mediocre but the plot was engaging. Sentiment:" GPT-3 completes with "Positive" or "Mixed" depending on the model size and sampling. No gradient updates are performed. The model infers the task from the examples and applies it to the new input.

Exercises

ExerciseCore

Problem

GPT-1 had 117M parameters and was trained on ~800M words. GPT-3 had 175B parameters and was trained on ~300B tokens. Compute the ratio of parameters and training tokens between the two models. What does this suggest about whether GPT-3 was "compute-optimal" according to the Chinchilla scaling law (tokens should scale linearly with parameters)?

ExerciseAdvanced

Problem

Explain why in-context learning is sensitive to the order and format of examples in the prompt, while a fine-tuned model is not sensitive to the order of training examples. What does this imply about the mechanism underlying ICL?

References

Canonical:

Radford et al., "Improving Language Understanding by Generative Pre-Training" (2018), GPT-1
Radford et al., "Language Models are Unsupervised Multitask Learners" (2019), GPT-2
Brown et al., "Language Models are Few-Shot Learners" (2020), GPT-3, NeurIPS
Kaplan et al., "Scaling Laws for Neural Language Models" (2020); arXiv:2001.08361
Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022) [Chinchilla]; arXiv:2203.15556

Current:

Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (2022), InstructGPT
OpenAI, "GPT-4 Technical Report" (2023)
Xie et al., "An Explanation of In-context Learning as Implicit Bayesian Inference" (2022), ICLR
Olsson et al., "In-context Learning and Induction Heads" (2022); arXiv:2209.11895
Garg et al., "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes" (2022); arXiv:2208.01066
Von Oswald et al., "Transformers learn in-context by gradient descent" (2023); arXiv:2212.07677
Martinez, "Re-evaluating GPT-4's bar exam performance," Artificial Intelligence and Law (2024), DOI:10.1007/s10506-024-09396-9; SSRN 4441311
OpenAI, "Models" API documentation (accessed Apr 22, 2026), https://developers.openai.com/api/docs/models
OpenAI, "Introducing GPT-5" (2025), https://openai.com/index/introducing-gpt-5
OpenAI, "Introducing GPT-5.1 for developers" (Nov 13, 2025), https://openai.com/index/gpt-5-1-for-developers/
OpenAI Help Center, "Retiring GPT-4o and other ChatGPT models" (updated Apr 2026), https://help.openai.com/en/articles/20001051
OpenAI, "Introducing GPT-5.3-Codex" (Feb 5, 2026), https://openai.com/index/introducing-gpt-5-3-codex/
OpenAI, "GPT-5.3 Instant: Smoother, more useful everyday conversations" (Mar 3, 2026), https://openai.com/index/gpt-5-3-instant/
OpenAI, "Introducing GPT-5.4" (March 5, 2026), https://openai.com/index/introducing-gpt-5-4/
OpenAI, "GPT-5.4 Thinking System Card" (Mar 5, 2026), https://openai.com/index/gpt-5-4-thinking-system-card
OpenAI, "Introducing GPT-5.5" (April 23, 2026), https://openai.com/index/introducing-gpt-5-5/
OpenAI, "Introducing gpt-oss" (Aug 5, 2025), https://openai.com/index/introducing-gpt-oss
OpenAI, "gpt-oss-120b & gpt-oss-20b Model Card" (Aug 5, 2025), https://openai.com/index/gpt-oss-model-card/
OpenAI, "Introducing GPT-Rosalind for life sciences research" (Apr 16, 2026), https://openai.com/index/introducing-gpt-rosalind/
OpenAI Help Center, "Introducing GPT-Rosalind for life sciences research" (updated Apr 2026), https://help.openai.com/en/articles/20001193-introducing-gpt-rosalind-for-life-sciences-research

Next Topics

Open weight models (LLaMA): the open-source response to closed GPT models
Post-training overview: RLHF, DPO, and instruction tuning in detail

Last reviewed: April 28, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

8

Scaling Lawslayer 4 · tier 1
Attention Mechanism Theorylayer 4 · tier 2
BERT and the Pretrain-Finetune Paradigmlayer 4 · tier 2
RLHF and Alignmentlayer 4 · tier 2
Transformer Architecturelayer 4 · tier 2

Derived topics

1

LLaMA and Open Weight Modelslayer 5 · tier 2

Graph-backed continuations

LLaMA and Open Weight Models