Gemini and Google Models

Sneiderman, Robby

Model Timeline

Gemini and Google Models

Google's model lineage from PaLM through Gemini 3.1: native multimodality, long context, TPU infrastructure, Gemini API preview churn, and the Gemma open-weight branch through Gemma 4.

CoreTier 2FrontierReference~45 min

Prerequisites

Transformer Architecture

Prereq Map

Why This Matters

Google researchers introduced the transformer architecture, and Google owns the full Gemini stack: research lab, TPU hardware, compiler and serving systems, product surfaces, Gemini API, Vertex AI, and open-weight Gemma releases. That vertical integration matters because model quality is not only a neural-network question. It is also a systems question: training hardware, data pipelines, multimodal serving, long-context retrieval, safety gates, product telemetry, and API stability all shape what users actually experience.

This page is a dated reference snapshot. As of April 22, 2026, the public Gemini API surface has two layers that should not be confused: stable 2.5 models for production, and Gemini 3.x previews for the newest Pro, Flash, Live, TTS, and image-generation endpoints.

Google model lineage

PaLM to Gemini 3.1, with Gemma as the open-weight branch

Snapshot current to April 22, 2026. Preview endpoints change faster than the family structure.

Pre-Gemini language line

PaLM

2022

540B dense model; chain-of-thought era.

PaLM 2

2023

Compute-efficient successor; powered early Bard.

Closed Gemini frontier line

Gemini 1.0

Dec 2023

Nano, Pro, Ultra; native multimodal launch.

Gemini 1.5

Feb 2024

MoE and million-token context.

Gemini 2.0

Dec 2024

Flash-first tool-use and multimodal-output line.

Gemini 2.5

2025

Thinking models; stable Pro, Flash, Flash-Lite APIs.

Gemini 3 / 3.1

2025-26

Gemini 3 Pro launched; 3.1 Pro Preview is the April 2026 Pro preview.

Open-weight Gemma line

Gemma

2024

2B and 7B open models based on Gemini research.

Gemma 2

2024

2B, 9B, 27B; stronger distillation and training recipes.

Gemma 3

2025

1B, 4B, 12B, 27B; vision input and 128K context.

Gemma 4

Apr 2026

Edge E2B/E4B plus 26B MoE and 31B dense under Apache 2.0.

Stable production default

Gemini 2.5 Pro / Flash / Flash-Lite

Use specific stable model strings when you need fewer API surprises.

Current Pro preview

Gemini 3.1 Pro Preview

The older Gemini 3 Pro Preview was shut down on March 9, 2026.

Open local line

Gemma 4

Open weights; useful when local control matters more than frontier API quality.

Current Snapshot: April 22, 2026

Family	Public role	Status as of Apr 22, 2026	What to remember
PaLM / PaLM 2	Pre-Gemini language-model lineage	Historical; useful for scaling and compute-allocation context	PaLM was the chain-of-thought scale era; PaLM 2 emphasized a more compute-efficient training recipe.
Gemini 1.0	First Gemini multimodal family	Historical	Nano, Pro, and Ultra introduced Google's native multimodal direction.
Gemini 1.5	Long-context Gemini family	Mostly historical, but important conceptually	Gemini 1.5 Pro introduced a production million-token context path and confirmed a mixture-of-experts design.
Gemini 2.0	Agent and multimodal-output transition	Historical-to-current bridge	Flash-first release; tool use, native multimodal output, and early thinking models became central.
Gemini 2.5	Stable production API family	Stable `gemini-2.5-pro`, `gemini-2.5-flash`, and `gemini-2.5-flash-lite` remain listed, with scheduled replacements in 2026	Best default if an application needs a stable model string rather than a preview endpoint.
Gemini 3 / 3.1	Current preview frontier line	`gemini-3.1-pro-preview`, `gemini-3-flash-preview`, `gemini-3.1-flash-lite-preview`, `gemini-3.1-flash-live-preview`, and related preview endpoints are listed in the Gemini API docs	Use for experiments or fast-moving products; expect renames, shutdowns, and version churn.
Gemma	Open-weight branch	Gemma 4 is the newest official open model family	Use when local deployment, inspectability, or Apache 2.0 licensing matters more than closed frontier capability.

Definition

Preview model

A preview Gemini model is an API endpoint Google makes available before it is treated as a stable model string. Preview models can be useful, but they may change, be renamed, or be shut down with notice. For example, Google lists gemini-3-pro-preview as shut down on March 9, 2026, with gemini-3.1-pro-preview as the replacement.

Definition

Stable model

A stable Gemini model points to a specific production model. Stable does not mean permanent. It means the endpoint is safer for production code than a preview alias, because behavior and replacement timelines are documented more conservatively.

Pre-Gemini: PaLM and PaLM 2

PaLM (April 2022)

PaLM was a 540B-parameter dense language model trained with Google's Pathways infrastructure. It matters historically because it sits at the point where scale, chain-of-thought prompting, and broad benchmark evaluation became tightly coupled in public language-model research.

The emergence story should be read carefully. PaLM showed large jumps on some tasks, but later work argued that several "emergent abilities" can be artifacts of nonlinear metrics or thresholded evaluations rather than literal phase transitions in model cognition.

PaLM 2 (May 2023)

PaLM 2 improved multilingual, reasoning, and coding performance over PaLM. Google did not disclose exact parameter counts. The safer public interpretation is that PaLM 2 emphasized better data mixture and compute allocation rather than a larger announced parameter count.

Gemini 1.0 (December 2023)

Gemini 1.0 shipped as Nano, Pro, and Ultra:

Nano: on-device models for mobile settings.
Pro: default mid-tier model for Google products and developers.
Ultra: largest 1.0 model, reserved for the highest-capability tier at launch.

The important design claim was native multimodality. Gemini 1.0 was trained on interleaved text, image, audio, and video data, rather than only attaching a vision encoder to a text-only model after pretraining. That does not prove superior vision performance on every task; it does explain why Google framed Gemini as a family built around cross-modal reasoning from the start.

Gemini 1.5: Long Context Becomes a Product Feature

Gemini 1.5 Pro introduced the public long-context story. Google described Gemini 1.5 as a mixture-of-experts model and announced a standard 128K-token context window, with up to 1M tokens for selected developers and enterprise users at launch. Google later described production execution at up to 1M tokens and research tests beyond that.

The conceptual shift was not merely "more tokens." A million-token context changes the application shape: full code repositories, long legal records, long video/audio inputs, and multi-document research tasks can be placed in one prompt. The hard part is not advertising the window. The hard part is retrieving the right evidence inside that window reliably and cheaply.

Proposition

Quadratic Attention Cost for Long Sequences

Statement

Standard self-attention computes a full $L \times L$ attention matrix, requiring $O(L^2 d)$ FLOPs and $O(L^2)$ memory per layer. For $L = 1{,}000{,}000$ tokens, the attention matrix has $10^{12}$ entries per layer.

Intuition

Every token can attend to every other token. Doubling the context length roughly quadruples the attention-matrix work. A million-token context is therefore not a normal transformer setting; it requires model, systems, and serving tricks.

Proof Sketch

Self-attention computes $\operatorname{softmax}(QK^\top / \sqrt{d_k})V$ . The matrix $QK^\top$ has shape $L \times L$ . Computing and storing that matrix scales quadratically in $L$ .

Why It Matters

Long-context Gemini models are useful because they change the user workflow, but context length alone is not intelligence. The model still has to locate evidence, preserve instruction hierarchy, avoid distraction from irrelevant middle-context material, and pay the latency/cost bill.

Failure Mode

If a model supports a million-token prompt but retrieval quality collapses for evidence placed in the middle, the advertised window overstates the useful window. Always test long-context models with positional retrieval tasks that match the real workload.

report a correction →

Gemini 2.0 and 2.5

Gemini 2.0 made Flash the center of the initial release. Google framed 2.0 around tool use, low-latency serving, native multimodal output, and "thinking" experiments such as Gemini 2.0 Flash Thinking.

Gemini 2.5 made thinking models the main line. Google released Gemini 2.5 Pro Experimental on March 25, 2025 and described it as a model that reasons through internal steps before responding. Google also stated that 2.5 Pro shipped with a 1M-token context window, with a 2M-token window planned at launch.

By April 2026, the Gemini API docs still list stable 2.5 endpoints:

Stable model	Public positioning	Replacement pressure
`gemini-2.5-pro`	Complex reasoning and coding	Google lists a June 17, 2026 shutdown date with `gemini-3.1-pro-preview` as replacement.
`gemini-2.5-flash`	Low-latency, high-volume tasks that still need reasoning	Google lists a June 17, 2026 shutdown date with `gemini-3-flash-preview` as replacement.
`gemini-2.5-flash-lite`	Fastest and lowest-cost multimodal 2.5 tier	Google lists a July 22, 2026 shutdown date with `gemini-3.1-flash-lite-preview` as replacement.

The right lesson is practical: model family pages and API docs can disagree in spirit if one is a launch narrative and the other is an operational interface. For production code, the API docs and deprecation table matter more.

Gemini 3 and 3.1

Gemini 3 Pro launched on November 18, 2025 across the Gemini app, AI Mode in Search, Google AI Studio, Vertex AI, Gemini Enterprise, Gemini CLI, and Google Antigravity. The Gemini 3 Pro model card says Gemini 3 Pro is a sparse mixture-of-experts transformer with native multimodal support for text, vision, and audio inputs, a context window up to 1M tokens, and text output up to 64K tokens.

Gemini 3.1 Pro Preview launched on February 19, 2026 as the upgraded core Pro preview. The Gemini API model page lists:

Preview endpoint	Input types	Output	Token limits	Main role
`gemini-3.1-pro-preview`	Text, image, video, audio, PDF	Text	1,048,576 input; 65,536 output	Highest-complexity Pro preview for reasoning, software engineering, tool use, and multimodal analysis.
`gemini-3-flash-preview`	Text, image, video, audio, PDF	Text	1,048,576 input; 65,536 output	Faster Gemini 3 line for high-throughput use.
`gemini-3.1-flash-lite-preview`	Text, image, video, audio, PDF	Text	1,048,576 input; 65,536 output	Lowest-latency and cost-sensitive Gemini 3.1 text model.
`gemini-3.1-flash-live-preview`	Live audio/video dialogue setting	Audio/text depending on Live API path	Live API limits depend on session mode	Real-time voice and dialogue workloads.

Google's April 21, 2026 Deep Research Max announcement also matters for the model map because it places Gemini 3.1 Pro inside a larger autonomous research agent, not just a chat box. The agent can search, read, use connected sources, synthesize reports, and cite evidence. That is a product architecture shift: the model is becoming one component in a longer research workflow.

Gemma: The Open-Weight Branch

Gemma is Google's open-weight branch built from Gemini research, not the same thing as frontier Gemini API models.

Family	Release	Public sizes	Main point
Gemma 1	February 2024	2B, 7B	First open-weight Gemma models.
Gemma 2	June 2024	2B, 9B, 27B	Better training recipes and distillation from larger models.
Gemma 3	March 2025	1B, 4B, 12B, 27B	Vision input, 128K context, function calling, quantized variants, and broad multilingual coverage.
Gemma 4	April 2026	E2B, E4B, 26B MoE, 31B dense	Apache 2.0 open models with edge and workstation targets; larger models offer up to 256K context.

Use Gemma when the deployment constraints are local control, privacy, cost, fine-tuning, licensing, or edge hardware. Use closed Gemini API models when the task needs the strongest Google-hosted model and you can accept external API dependency.

Google's Infrastructure Advantage

TPUs. Gemini is trained and served on Google's TPU stack. That gives Google unusually tight control over the hardware, compiler, distributed training system, serving path, and model architecture.

Data and products. Search, YouTube, Android, Workspace, Google Books, and other Google products create a large surface for data, evaluation, user feedback, and deployment. This is not automatically a quality guarantee. It does mean that Gemini has a distribution and integration path that most labs cannot match.

API and product coupling. Gemini is not only a model family. It is a product layer across Search, Gemini app, NotebookLM, Android Studio, Vertex AI, Google AI Studio, and Google Cloud. That matters because user-facing capabilities often come from model plus tools plus retrieval plus product constraints.

What Not To Overclaim

Do not rank Gemini from stale benchmark tables. Public rankings move quickly. Use current benchmark pages only for dated comparisons, and state the date.
Do not treat context length as solved retrieval. A 1M-token window is useful, but lost-in-the-middle behavior, instruction conflicts, and irrelevant evidence still matter.
Do not confuse Gemma with Gemini. Gemma is open-weight and local-friendly; Gemini is Google's closed hosted frontier line.
Do not treat preview endpoints as stable defaults. The Gemini 3 Pro Preview shutdown is the concrete warning: preview strings can be replaced quickly.
Do not infer hidden parameter counts. Google does not disclose most Gemini frontier parameter counts. The factual page should say "undisclosed" rather than repeat rumors.

Common Confusions

Watch Out

Native multimodality does not automatically mean better vision

Jointly training on text, images, audio, and video can help cross-modal reasoning, but it does not guarantee superior performance on every visual task. Evaluation still depends on the benchmark, prompting setup, image resolution, and whether the task needs OCR, spatial reasoning, world knowledge, or tool use.

Watch Out

A preview model can be stronger and riskier at the same time

A preview endpoint may be the best public Gemini model for a task, but it can also have more churn. For a product that needs reproducible behavior, a stable 2.5 model may be safer until a 3.x model becomes stable.

Watch Out

Long context is not the same as memory

Context is information supplied inside the current request. Memory is information retained across requests by a product or agent system. Gemini can have a large context window, but persistent user memory depends on the product layer and user controls.

Exercises

ExerciseCore

Problem

The Gemini API docs list gemini-3.1-pro-preview with a 1,048,576-token input limit and 65,536-token output limit. What practical problem does this solve that a 128K-token model cannot solve as directly?

ExerciseAdvanced

Problem

Why is gemini-3.1-pro-preview not automatically the correct production choice even if it is the current Pro preview?

ExerciseResearch

Problem

Design a fair evaluation for choosing between Gemini 2.5 Pro, Gemini 3.1 Pro Preview, and Gemma 4 for a private-codebase assistant.

References

Canonical and technical:

Vaswani et al., "Attention Is All You Need" (2017)
Chowdhery et al., "PaLM: Scaling Language Modeling with Pathways" (2022)
Google DeepMind, "Gemini: A Family of Highly Capable Multimodal Models" (2023)
Google, "Introducing Gemini 1.5, Google's next-generation AI model" (Feb 15, 2024)
Google DeepMind, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context" (2024)
Google DeepMind, Gemini 3 Pro Model Card (updated Dec 2025)

Current product/API references:

Google AI for Developers, Gemini models (accessed Apr 22, 2026)
Google AI for Developers, Gemini 3.1 Pro Preview (last updated Apr 1, 2026)
Google AI for Developers, Gemini 3.1 Flash-Lite Preview (last updated Apr 1, 2026)
Google AI for Developers, Gemini API deprecations (accessed Apr 22, 2026)
Google, "Gemini 2.5: Our most intelligent AI model" (Mar 25, 2025)
Google, "Gemini 3: Introducing the latest Gemini AI model from Google" (Nov 18, 2025)
Google, "Gemini 3 Flash: frontier intelligence built for speed" (Dec 17, 2025)
Google, "Gemini 3.1 Pro: A smarter model for your most complex tasks" (Feb 19, 2026)
Google, "Deep Research Max: a step change for autonomous research agents" (Apr 21, 2026)
Google, "Introducing Gemma 3" (Mar 12, 2025)
Google, "Gemma 4: Our most capable open models to date" (Apr 2, 2026)
Schaeffer et al., "Are Emergent Abilities of Large Language Models a Mirage?" (2023)

Next Topics

Model comparison table: compare Gemini against GPT, Claude, DeepSeek, Llama, Qwen, and Kimi with dated caveats
Sparse attention and long context: why million-token windows require engineering beyond vanilla attention
Llama and open-weight models: compare Gemma's open-weight branch with Meta's Llama line

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Transformer Architecturelayer 4 · tier 2

Derived topics

3

Sparse Attention and Long Contextlayer 4 · tier 2
LLaMA and Open Weight Modelslayer 5 · tier 2
Model Comparison Tablelayer 5 · tier 2

Graph-backed continuations

Model Comparison Table LLaMA and Open Weight Models Sparse Attention and Long Context