Skip to main content

Formal Verification

AlphaProof and AI-Assisted Theorem Proving

AlphaProof (DeepMind, July 2024) earned an IMO silver medal solving 4 of 6 problems by autoformalizing into Lean 4 and running a reinforcement-learning theorem prover. AlphaGeometry 2 solved IMO 2024 Problem 4 in geometry. The state of AI-assisted formal mathematics in 2026: autoformalization, retrieval-augmented theorem proving, and the LLM + Lean kernel pipeline.

AdvancedTier 1FrontierSupporting~28 min

Why This Matters

In July 2024, Google DeepMind announced AlphaProof and AlphaGeometry 2: AI systems that earned an IMO silver medal on the 2024 International Mathematical Olympiad (July 11-22, 2024 in Bath, UK). Together they solved 4 of the 6 problems, scoring 28 of 42 points, one point short of the gold-medal threshold. AlphaProof tackled algebra and number-theory problems via Lean 4; AlphaGeometry 2 solved the geometry problem.

This was the first time an AI system reached medal-level performance on the IMO under the standard rules (translation: same problem statement; bounded compute time; proofs verified by mathematicians, not the system itself). The result is a structural shift in the AI-assisted theorem proving landscape and a concrete demonstration that the formal-language + LLM approach scales beyond toy problems.

This page surveys the 2026 landscape:

  • The AlphaProof / AlphaGeometry 2 architecture.
  • The autoformalization pipeline (LLM translates English → Lean 4).
  • The reinforcement-learning theorem prover layer.
  • Prior work the line builds on (GPT-f, Polu-Sutskever 2020; LeanDojo / ReProver, Yang et al. 2023).
  • Where this technology is going through 2026.

The AI-theorem-proving system landscape

SystemYearArchitectureSubstrateOpen / closedHeadline result
AlphaProof2024LLM autoformalizer + RL prover + Lean kernelLean 4 / Mathlibclosed (DeepMind)IMO 2024 silver, 4 / 6 problems
AlphaGeometry 22024LLM proposer + symbolic geometry enginecustom geometry DSLclosed (DeepMind)IMO 2024 problem 4 (geometry)
GPT-f2020Transformer trained on MetamathMetamathclosedfirst credible LM theorem prover
HyperTree Proof Search2022tree search + LLM proposerLean / Metamathclosed (Meta)substantial miniF2F gains
LeanDojo / ReProver2023retrieval-augmented LLMLean 4 / MathlibopenminiF2F state-of-the-art at release
Llemma2024math-pretrained base model (7B / 34B)text + Lean / Coq / Isabelleopen weightsbase model for downstream provers
Lean Copilot2024tactic-suite LLM integrationLean 4 / Mathlibopeninteractive premise selection + proof search
DeepSeek-Prover-V22025RL self-play on LeanLean 4 / Mathlibopen weights (7B and 671B)competitive with proprietary on miniF2F
Goedel-Prover-V22025self-correction loop on LeanLean 4 / Mathlibopen weightsstate-of-the-art miniF2F-test pass rate

The pattern is consistent: every system separates a fast, fallible proposer (the LLM or RL agent) from a slow, sound verifier (Lean kernel, Coq kernel, custom DSL, or SMT). The proposer is trained to generate proof attempts; the verifier rejects anything that does not type-check. This is the AlphaZero pattern (Silver et al. 2017) lifted to mathematics: search + neural-network policy + ground-truth verifier.

The AlphaProof Architecture

DeepMind's published description (Hubert et al. 2024 blog post; the system has not yet been peer-reviewed in a paper as of this writing) gives the following architecture:

  1. Autoformalization layer (LLM). A specialized language model translates the problem statement from English mathematical prose into Lean 4 syntax. The output is a theorem ... : ... := by sorry skeleton with the formal statement filled in.

  2. Search layer (RL theorem prover). A reinforcement-learning agent generates Lean 4 proof attempts by predicting tactic sequences. The agent is trained on a curriculum of problem-proof pairs from mathlib4 and synthetic problem generation. AlphaZero-style self-play across millions of problem variants drives the policy.

  3. Verifier layer (Lean kernel). Every candidate proof is checked by Lean 4's kernel. If the proof type-checks, the problem is considered solved. No reliance on the model's confidence, kernel verification is the trust layer.

Proposition

Kernel-Verified Soundness Reduction

Statement

For an AlphaProof-style pipeline that emits a Lean term tt as a candidate proof of a theorem TT, the soundness of the end-to-end system reduces to the soundness of the Lean 4 kernel. Concretely: if kernel.check t T returns true, then TT is a theorem of Lean's logic (under the imported environment); the AI components — autoformalizer, RL proposer, retrieval — are not in the trust base.

Intuition

The AI is a guesser; the kernel is the judge. A guess that fails to type-check is rejected and contributes nothing. A guess that type-checks is, by definition, a Lean proof of TT. The AI's training process, biases, and even active deception cannot produce a wrong "proof" that the kernel accepts, because acceptance by the kernel is the definition of correctness. This is the same trust-base argument that makes Coq's small kernel design valuable: by minimizing the trusted code, you minimize what can go wrong.

Why It Matters

This is the architectural reason AlphaProof is qualitatively different from a chain-of-thought language model that emits "proofs" as English. Chain-of-thought proofs require human review for correctness; kernel-checked Lean proofs require human review only of the theorem statement. The autoformalization step is therefore the remaining trust hole: if the autoformalizer mistranslates "for all n0n \geq 0" as "for all nZn \in \mathbb{Z}," the kernel will happily verify the wrong theorem.

Failure Mode

Soundness fails if (a) the kernel itself has a bug, (b) an imported axiom is inconsistent (e.g., a buggy axiom declaration in user code), or (c) the autoformalized statement diverges from the human-intended statement. (a) and (b) are also failure modes for human Lean users; (c) is specific to the AI pipeline and is the active research frontier.

The system reportedly used "a few days" of compute per problem during the IMO event, orders of magnitude more than a human contestant's 4.5 hours, but a finite budget that proves the technology is at the threshold of practical applicability.

The architectural pattern is the same as AlphaZero (Silver et al. 2017) applied to theorem proving: search + neural-network policy + ground-truth verifier. The kernel plays the role of the chess-rule verifier; the RL agent plays the role of MCTS-guided move selection.

The AlphaGeometry 2 Track

AlphaGeometry 2 is the geometry-specific system; it solved IMO 2024 Problem 4 (a synthetic-geometry problem about a triangle and its circumcircle). Its predecessor, AlphaGeometry (Trinh et al., Nature 2024), was IMO-bronze level on a held-out olympiad-geometry test set.

The architecture differs from AlphaProof:

  • Symbolic engine. A custom geometry solver applies classical geometric construction rules (extending lines, drawing circles, etc.).
  • LLM proposer. A language model proposes auxiliary constructions that guide the symbolic engine toward a solution.
  • Synthetic training. Trained from 100M synthetic geometry problems generated programmatically; no human proofs needed.

This is a neuro-symbolic hybrid rather than a pure RL+kernel system. The symbolic engine plays the same role as Lean's kernel: a ground-truth verifier the LLM cannot fool. AlphaGeometry 2's improvements over the original include better synthetic data and a stronger LLM proposer.

Background: GPT-f and the LLM Theorem-Proving Line

The AlphaProof line builds on a decade of work on neural theorem proving:

  • GPT-f (Polu and Sutskever, 2020): the first GPT-style Transformer trained on Metamath proofs. Showed that language models could generate proof scripts above chance, but performance was modest.
  • Lean theorem proving with HOList (Bansal et al. 2019): early supervised learning on Coq + HOL.
  • PACT (Han et al. 2022): co-training on autoformalization and theorem proving for Lean.
  • HyperTree Proof Search (Lample et al. 2022): tree search on tactic sequences combined with LLM proposers.
  • LeanDojo + ReProver (Yang et al., NeurIPS 2023): retrieval-augmented theorem prover; selects relevant Mathlib lemmas as context for the LLM. Open-source.
  • Llemma (Azerbayev et al. 2024): a 7B / 34B parameter language model trained on math-heavy text + code, designed as a base for theorem-proving fine-tuning.

The AlphaProof advance, by reports, was substantially improved data efficiency through autoformalization-based synthetic data and stronger RL on the proof-search layer. The exact technical details await a published paper.

Autoformalization

The bottleneck for AlphaProof and similar systems is getting the problem statement into Lean 4 in the first place. This is autoformalization, translating mathematical English into a formal statement in a proof assistant.

Definition

Autoformalization

Autoformalization is the task of translating a natural-language mathematical statement (definition, lemma, theorem) into a syntactically valid term in a formal proof assistant such as Lean 4, Coq, or Isabelle/HOL. Concretely: given an English sentence such as "every continuous function on a compact metric space attains its maximum," produce the corresponding Lean 4 term, typically of the shape theorem extreme_value_theorem ... : ... := by ..., with the type signature filled in (proof body left as sorry for a downstream prover to complete). Modern pipelines also include an autoinformalization loop: translate Lean back to English and check semantic agreement. Autoformalization rates on undergraduate exercises were around 30-40% in 2022 (Wu et al.) and have climbed past 60% on definitions / 50% on theorem statements by 2025 with retrieval-augmented frontier models. Proofs remain unreliable.

The state of autoformalization in 2026:

  • Granite-3-Mathlib (IBM, 2024): a fine-tuned Llama-2 70B model trained on Mathlib4 + retrieval. Reports 60-70% accuracy on autoformalizing real-analysis exercises.
  • ByT5-Mathlib (Wu et al. 2022, updated through 2024): byte-level transformer for autoformalization with improved accuracy on definitions vs. theorems.
  • Wu et al. 2022 ("Autoformalization with Large Language Models"): the foundational paper; showed Codex and PaLM could autoformalize undergraduate math at moderate rates.
  • draft + verify pipelines: LLM produces a draft Lean statement; kernel + targeted retests verify it; human or another LLM corrects errors.

For now, autoformalization is reliable for definitions and basic theorem statements but unreliable for proofs. The AlphaProof pipeline succeeds because the proof-search component is the heavy lifter; the autoformalization just has to nail the statement.

What This Means for Mathematicians

Several practical consequences in 2026:

  • Mathlib4 contributors increasingly use AI-assisted workflows. Tools like Llemma-tuned aesop extensions, retrieval-based exact? augmentation, and Lean Copilot (Song et al. 2024) are now standard in serious Lean formalization projects.

  • Olympiad-level problems are partially in reach. AlphaProof's IMO performance suggests that within 2-5 years, AI systems may consistently solve IMO problems under standard rules, possibly faster than human contestants (compute-wise; IMO time limits are the binding constraint).

  • PFR formalization (Tao 2023) showed that a major research-level result (the Polynomial Freiman-Ruzsa theorem) can be formalized in Lean 4 in weeks rather than years. AI assistance shortens this further.

  • AlphaGeometry-style systems for non-geometric domains. The neuro-symbolic pattern (LLM proposer + symbolic verifier) is being applied to combinatorics, number theory, and analysis. The challenge is finding the right symbolic-engine target for each domain.

  • Research mathematicians as Lean operators. The role of "mathematician fluent in Lean" is becoming distinct from "Lean engineer." Mathlib4 contributors increasingly are publishing mathematicians using Lean as a research tool.

Limitations and Open Problems

Despite the IMO result, several practical limitations remain:

  • Compute cost. AlphaProof reportedly used "days" of compute per problem. Real-time interactive use requires orders of magnitude speedup.
  • Statement coverage. Autoformalization is only reliable for ~60% of typical undergraduate / olympiad statements; the remaining 40% require human assistance.
  • Long proofs. AlphaProof excels at short, tactical proofs (10-100 lines). Multi-page proofs requiring deep conceptual moves remain hard.
  • Domain transfer. Performance on number theory and algebra exceeds performance on combinatorics or analysis. No clear unified architecture yet.
  • Black-box risk. AI-generated proofs can be hundreds of lines of opaque tactic invocations. The kernel verifies them, but humans cannot easily understand or maintain them.

The Path-Network Connection

For TheoremPath and the path-network domains:

  • ProofsPath: AI-assisted theorem proving is the natural successor to traditional olympiad coaching. The craft of mathematical proof writing remains valuable; the AI becomes another collaborator at the proof.
  • TheoremPath: ML theorems formalized in Lean 4 (via the formalizing-ml-theorems-in-lean-and-coq page) form the substrate that AlphaProof and similar systems can prove against.
  • ComputationPath: the architecture of AlphaProof illustrates the RL on a verifiable substrate pattern that defines modern AI-for-math; it sits naturally in the computability + verification toolkit.
  • DSAPath: tree search and retrieval are core algorithmic primitives that AlphaProof relies on.

Common Mistakes

Watch Out

Treating AlphaProof as a 'theorem-proving GPT'

AlphaProof is not just an LLM. It's an AlphaZero-style RL agent with the Lean kernel as the verifier. The LLM component (autoformalization, proposal generation) is necessary but not sufficient.

Watch Out

Confusing IMO performance with general theorem proving

IMO problems are short, well-specified, and have clean solutions. AlphaProof's success on IMO doesn't mean it can prove arbitrary research mathematics, those proofs are longer, more open-ended, and require domain-specific mathematical insight.

Watch Out

Assuming the Lean kernel can be tricked

The Lean kernel is the trust layer; AlphaProof cannot "convince" it of false claims. Any proof AlphaProof produces is a correct proof in Lean 4, the AI's job is to find the proof, not to be trusted about the proof.

Watch Out

Overestimating autoformalization reliability

For research-level statements, autoformalization is still unreliable. Robust pipelines combine LLM drafting with human review (or with a verifier loop that retries on type-checking failure). Pure LLM autoformalization is not production-ready.

Exercises

ExerciseCore

Problem

Why is the Lean kernel the trust boundary for AlphaProof, not the RL agent or the LLM?

ExerciseAdvanced

Problem

Sketch how an AlphaGeometry-style approach (LLM proposer + symbolic engine) might be adapted to combinatorial olympiad problems, where the symbolic substrate is less obvious.

Cross-Network Links

  • TheoremPath: theorem-proving-in-lean is the foundational prerequisite. ML-theorem case studies, a Lean tactic cookbook, and a treatment of reasoning-data curation (reasoning-data-curation) are planned companion pages.
  • ProofsPath (forthcoming, content/proofspath-topics/): olympiad technique pages cover the human-style techniques AlphaProof must learn (induction, pigeonhole, vieta jumping, invariants and monovariants, the extremal principle). The Lean 4 wrappers landing in TheoremPath via PRs #185-#188 are the bridge between the human craft and the AI prover.
  • ComputationPath (parked): the architecture of AlphaProof is the canonical RL on a verifiable substrate pattern. Until ComputationPath ships, the relevant overview lives on TheoremPath at theorem-proving-in-lean.

References

Canonical:

  • Google DeepMind. "AI achieves silver-medal standard solving International Mathematical Olympiad problems." Blog post, July 25 2024. The official AlphaProof / AlphaGeometry 2 announcement; no peer-reviewed paper published as of this writing.
  • Trinh, Wu, Le, He, Luong. "Solving olympiad geometry without human demonstrations." Nature 625 (2024) 476-482, doi link. The AlphaGeometry (predecessor) paper; describes the symbolic-engine + LLM-proposer architecture and the synthetic-data pipeline that AlphaGeometry 2 inherits.
  • Polu and Sutskever. "Generative Language Modeling for Automated Theorem Proving." arXiv:2009.03393 (2020). GPT-f, the foundational paper for the LLM-as-proof-script-generator line.
  • Silver et al. "Mastering the game of Go without human knowledge." Nature 550 (2017), 354-359. The AlphaZero paper; the AlphaProof RL+verifier pattern is its direct descendant.

Current:

  • Yang et al. "LeanDojo: Theorem Proving with Retrieval-Augmented Language Models." NeurIPS 2023, project page. Open-source retrieval-augmented theorem prover; the public benchmark for LLM-driven Lean proving.
  • Azerbayev et al. "Llemma: An Open Language Model for Mathematics." ICLR 2024, arXiv:2310.10631. 7B / 34B base model trained on Proof-Pile-2; the standard open-weight starting point for theorem-proving fine-tuning.
  • Song et al. "Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean." arXiv:2404.12534 (2024). The Lean Copilot tactic suite (select_premises, suggest_tactics, search_proof).
  • Wu et al. "Autoformalization with Large Language Models." NeurIPS 2022, arXiv:2205.12615. The first systematic study of Codex / PaLM autoformalization rates on undergraduate math.
  • Mathlib Community. "Mathlib4: The Lean Mathematical Library." leanprover-community.github.io, ongoing. The substrate every modern Lean-based theorem prover targets; over 1.2M lines as of 2026.
  • Tao. "Formalizing the proof of PFR in Lean 4 - a personal account." Blog post, December 2023. The Polynomial Freiman-Ruzsa formalization; case study for what AI assistance can shorten in research-level formalization.

Frontier:

  • DeepSeek-Prover team. "DeepSeek-Prover-V2." arXiv:2504.14589 (2025). Open-weight RL-trained prover competitive with proprietary frontier on miniF2F; 7B and 671B-MoE variants.
  • Goedel-Prover team. "Goedel-Prover-V2." arXiv:2502.07640 (2025). Self-correction-driven Lean prover; reaches state-of-the-art miniF2F-test pass rates without RL on a held-out set.
  • PutnamBench. "Multilingual Competition Mathematics Benchmark for Formal Theorem-Proving." arXiv:2407.11214 (2024). The Putnam-level evaluation harness across Lean, Coq, Isabelle.

Related Topics

Last reviewed: April 28, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

1

Graph-backed continuations