Skip to main content

LLM Construction

Test-Time Compute and Search

Spending more inference compute through best-of-N, verifier-guided search, tree search, process rewards, and latent/internal computation, with gains limited by verifier quality and base pass rate.

AdvancedTier 2FrontierFrontier watch~65 min

Why This Matters

Hide overviewShow overview
Infographic titled 'Test-Time Compute and Search'. Six-step pipeline along the top: (1) Prompt: user question or task; (2) Initial model proposal: model produces a first answer; (3) Extra test-time compute: sample more tokens, apply sampling methods, generate options; (4) Search / branching: explore multiple paths to candidate solutions; (5) Verification / ranking: score, verify, and rank candidates or steps; (6) Final answer: returns the best answer with optional justification. Why use test-time compute panel: hard reasoning tasks, long-horizon planning, tool-augmented problem solving, safer or more reliable answers, trading latency for quality. Common strategies table: chain-of-thought prompting (elicits step-by-step reasoning within the model; improves accuracy on reasoning-heavy tasks; main cost: more tokens and longer responses); self-consistency (sample multiple completions paths and take majority; boosts robustness and correctness; main cost: multiple samples more compute); best-of-N sampling (generate N candidates and select the best; simple and effective quality lift; main cost: linear cost with N samples); beam search / tree search (expand top candidates across steps to explore a tree; finds stronger solutions in larger search space; main cost: exponential growth in branches); tool use and retrieval (call tools or retrieve facts to inform reasoning; more accurate and up-to-date answers; main cost: tool latency and integration cost); verifier / reranker guidance (use a verifier model to score and rank candidates; improves selection and reliability; main cost: verifier cost and possibly miss-ranks); reflection / critique loops (model critiques, corrects, and iterates on its answer; catches mistakes and refines solutions; main cost: extra iterations add overhead). Key tradeoffs panel: more compute and latency (extra tokens, samples, and search increase response time and cost), better answers on hard tasks (most gains appear on complex, ambiguous, or high-stakes tasks), diminishing returns (past a point, more compute yields smaller quality improvements), search bias and model fragility (search may compound similar errors and miss diverse solutions), verifier weakness can mis-rank outputs (imperfect verifiers can favor incorrect but plausible answers), cost-quality tradeoff (deeper search may produce more reliable answers at the cost of compute). When it helps most: math and proofs (multi-step derivations, theorems, and proofs), coding and debugging (complex programs, bugs, and refactoring), planning and routing (scheduling, routing, and resource allocation), factual answers with fact-checks (retrieval and verification reduce factual errors), high-stakes workflows (safety-critical, legal, and medical, where reliability matters). Closing line: 'Test-time compute does not change weights; it changes how much reasoning and selection happens at inference time.'
Test-time compute and search shifts work from training (parameter updates) to inference (more sampling, branching, verification, reflection). Each strategy in the table buys quality at a specific compute and latency cost.

Training scaling laws describe how pretraining compute, data, and parameters affect loss. Test-time compute is a different choice: spend more work on a single query by sampling more candidates, searching over partial reasoning traces, calling verifiers, or allocating extra internal computation before the final answer.

The useful claim is narrower than "more compute means better reasoning." Extra inference compute pays off when three conditions hold together: the model has a nonzero base pass rate on the task, the candidate set adds real diversity rather than echoing the same failure mode, and the selection signal correlates with true success. Drop any one and longer outputs just become longer mistakes.

Mental Model

A model that commits to a single answer in one forward pass spends a fixed compute budget. A model that is uncertain can convert additional inference compute into accuracy through four mechanisms:

  1. Best-of-NN: generate NN candidates, select via verifier or majority vote
  2. Process reward: produce step-level reasoning scored by a process reward model
  3. Tree search: expand and backtrack through partial solutions
  4. Latent reasoning: iterate internally before emitting tokens (see latent reasoning)

Each trades inference cost for accuracy. The empirical questions are how much accuracy improves with budget, where the curve flattens, and how the answer depends on verifier quality.

Inference Compute: The Core Idea

Definition

Test-Time Compute

Test-time compute (or inference compute) is the total computation spent generating a response to a single query. For a standard autoregressive model generating TT tokens with NN parameters, inference compute is approximately 2NT2NT FLOPs. Test-time compute scaling increases this by generating multiple candidates, performing search, or extending the reasoning chain.

The standard approach generates one response: compute 2NT\approx 2NT. Test-time scaling methods spend a larger multiple of this cost to produce, search, or evaluate candidate answers. The central question is not just whether quality improves, but whether the improvement survives the cost, latency, verifier, and benchmark assumptions.

Best-of-N Sampling

The simplest test-time scaling method: generate NN independent responses and select the best one using a verifier or reward model.

Proposition

Best-of-N Scaling with a Perfect Verifier

Statement

If the model produces a correct answer with probability pp per sample, and the verifier perfectly identifies correctness, then the probability of at least one correct answer among NN samples is:

P(at least one correct)=1(1p)NP(\text{at least one correct}) = 1 - (1-p)^N

For small pp, this scales approximately linearly: PNpP \approx Np when Np1Np \ll 1. To achieve success probability 1δ1 - \delta, you need:

Nlog(1/δ)log(1/(1p))log(1/δ)pN \geq \frac{\log(1/\delta)}{\log(1/(1-p))} \approx \frac{\log(1/\delta)}{p}

samples. The cost scales as O(1/p)O(1/p) to achieve high confidence.

Intuition

If the model has even a small chance of producing the correct answer, generating enough samples makes success near-certain. The verifier acts as a filter: it cannot make the model smarter, but it can select the model's best work from many attempts. The bottleneck shifts from generation quality to verifier quality.

Why It Matters

Best-of-N is a simple baseline for test-time compute scaling. It requires no additional training, only a good verifier. For math and code, where verifiers are reliable (run the code, check the proof), best-of-N gives substantial improvements. For open-ended text where verification is hard, the gains are smaller.

Failure Mode

When the verifier is imperfect (a reward model rather than a ground-truth checker), best-of-N suffers from reward hacking at inference time. The selected sample maximizes the verifier's score, not actual quality. As NN increases, the selected sample increasingly exploits verifier weaknesses. This is the inference-time analog of reward model overoptimization.

Selection Without an External Verifier

Best-of-N assumes you can score candidates from outside. When no verifier exists, two cheap selection rules remain:

  • Self-consistency / majority voting (Wang et al., 2022). Sample NN chains of thought, extract the final answer from each, and return the most common. Works on tasks with a small discrete answer space (math word problems, multiple choice). Equivalent to a verifier that votes for whatever the model most often agrees with itself on. Fails on open-ended outputs and when the model is consistently wrong in the same way.
  • Self-evaluation. Ask the model to score its own candidates. Cheap, but gives a verifier-like signal whose quality is bounded by the model's own judgment. Useful as a baseline; weaker than a trained reward model on hard reasoning.

Self-consistency is strictly weaker than a good verifier: it cannot detect a shared mistake. But on benchmarks where the answer space is small and the model's wrong answers are diverse (so the right one wins the vote), the gap between self-consistency and a perfect verifier is small.

Verifier-Guided Search

Best-of-N generates complete responses independently. More sophisticated methods use the verifier to guide generation during the reasoning process.

Process Reward Models (PRMs) vs Outcome Reward Models (ORMs)

  • ORM: Scores the final answer. Binary: correct or not.
  • PRM: Scores each intermediate reasoning step. Guides the search by pruning bad reasoning paths early.

With a PRM, you can perform step-level beam search: at each reasoning step, generate multiple continuations, score them with the PRM, and keep only the top-kk. This avoids wasting compute on reasoning paths that go wrong early.

Monte Carlo Tree Search (MCTS) for Reasoning

MCTS treats reasoning as a sequential decision problem:

  • State: The reasoning chain so far
  • Action: The next reasoning step
  • Reward: Correctness of the final answer (or PRM score)
  • Rollout: Complete the reasoning chain from the current state

The standard MCTS loop:

  1. Select a promising node using UCB (upper confidence bound)
  2. Expand by generating a new reasoning step
  3. Simulate by completing the chain (or using a value estimate)
  4. Backpropagate the outcome to update node statistics

This is the reason game-search language appears in some reasoning papers. The reasoning trace becomes the "game tree," and each step becomes a "move."

Method Matrix

MethodWhere the extra compute goesSelection signalStrongest use casesMain failure mode
Single passone rolloutnoneeasy tasks, low latencyfirst answer is brittle
Best-of-Nmany complete candidatesfinal-answer verifier or ORMcode, math, constrained generationreward overoptimization or duplicated wrong answers
Self-consistency / majority votemany complete candidatesmost common final answerdiscrete-answer reasoning when no verifier existsshared failure mode across samples sweeps the vote
PRM-guided beam/tree searchbranch expansion on partial tracesprocess reward modellong multi-step reasoning with meaningful step scoresPRM label noise and branch collapse
MCTS / verifier-guided searchtargeted rollouts on promising prefixesrollout reward, PRM, or verifiertheorem proving, program synthesis, hard mathexpensive search with mismatch between search policy and final verifier
Latent or silent computeextra internal steps per answertrained controller or reasoning objectivetasks where extra internal depth helps more than extra textopaque computation and weak audit surface

These methods can be combined. A modern reasoning system may use long chain-of-thought rollouts, a verifier for best-of-N selection, and a PRM to prune branches during search.

Chain-of-Thought as Inference Compute

Definition

Chain-of-Thought Scaling

Chain-of-thought (CoT) prompting increases inference compute by generating intermediate reasoning tokens before the final answer. A model generating TT reasoning tokens plus TaT_a answer tokens spends approximately 2N(T+Ta)2N(T + T_a) FLOPs. More reasoning tokens means more compute.

The hypothesis: the transformer's fixed per-token computation is insufficient for hard reasoning, and generating intermediate tokens provides additional "serial computation" that enables multi-step inference.

Proposition

When Test-Time Scaling Helps

Statement

For verifiable reasoning tasks, extra test-time compute can improve selected accuracy when it creates useful candidate diversity and the selection signal is better than chance.

There is no universal law of the form

accuracy(Cinf)=alog(Cinf)+b\text{accuracy}(C_{\text{inf}}) = a \log(C_{\text{inf}}) + b

that applies across models, benchmarks, search methods, and verifiers. Empirical curves are method- and task-dependent: they may improve quickly, improve slowly, or saturate once the model's candidate distribution stops producing useful new solutions.

Intuition

More inference compute is useful when it changes the set of candidate solutions and the system can identify the better ones. If every candidate repeats the same wrong pattern, or if the verifier prefers polished wrong answers, more compute does not buy reliability.

Why It Matters

This is the honest version of inference scaling. Training compute and inference compute can be partially substitutable on some reasoning benchmarks, but the tradeoff is conditional. A smaller model plus search can outperform a larger single-pass model when the smaller model already has a useful pass rate and the search procedure is compute-aware.

Failure Mode

Verifier overoptimization is the main failure mode. As the number of candidates increases, an imperfect reward model is more likely to select responses that exploit its blind spots. The selected answer can look better to the verifier while getting worse by the true task metric.

Latent Reasoning

One active direction is reasoning in hidden states rather than in visible tokens.

Standard chain-of-thought reasoning forces the model to express all reasoning as text tokens. This has limitations:

  • Bandwidth: Natural language is a low-bandwidth communication channel for complex reasoning
  • Faithfulness: The written chain-of-thought may not reflect the model's actual computation
  • Efficiency: Generating tokens is expensive; internal computation is cheaper

Latent reasoning models use special "thinking tokens" that are processed by the transformer but not decoded into text. The model can perform many steps of internal computation (many forward passes through "thinking" positions) before producing an output token. This increases the model's effective depth for hard problems.

This is an active research area. The theoretical framework is developing: latent reasoning can be viewed as giving the model a variable-depth computation graph rather than the fixed depth of a standard transformer.

Build It This Way by Default

For reasoning tasks with verifiable answers (math, code, constrained generation), start with a small best-of-N baseline and report the full setup: NN, sampling temperature, verifier, wall-clock budget, pass@1, pass@N, selected accuracy, and failure cases. Do not publish a universal default sample count. The right budget depends on base pass rate, verifier quality, latency, and benchmark difficulty.

What To Report Before Claiming A Gain

The recurring mistake in test-time-compute papers is to report only the best selected accuracy. A serious comparison should include:

QuantityWhy it matters
pass@1tells you what the base model can do without search
pass@N or candidate diversityshows whether extra samples actually uncover new correct paths
selected accuracymeasures what the verifier plus search procedure returns
verifier calls / rollout countcaptures hidden compute beyond generated tokens
wall-clock latency and token budgetkeeps the result tied to an actual deployment cost
failure casesreveals whether gains come from better reasoning or better exploitation of the evaluator

When Test-Time Compute Helps (and When It Doesn't)

Test-time compute is most effective when:

  • The task has a reliable verifier (code execution, math provers)
  • The model has a non-zero pass rate (it sometimes gets the answer right)
  • The task requires multi-step reasoning (search helps explore paths)

Test-time compute is least effective when:

  • Verification is as hard as generation (open-ended writing, subjective tasks)
  • The model never produces correct answers (fundamental capability gap)
  • The task is easy (single-pass generation already succeeds reliably)

Common Confusions

Watch Out

More tokens does not always mean better reasoning

Generating a longer chain-of-thought is not the same as reasoning more deeply. Models can produce long, fluent chains of text that arrive at the wrong answer. The quality of reasoning depends on the model's training, not just the length of the output. Structured search with verification matters more than raw token count.

Watch Out

Best-of-N is not the same as temperature sampling

High temperature increases output diversity but does not select for quality. Best-of-N generates diverse candidates and selects the best one. The selection step is what matters. Without a good verifier, high-temperature sampling just produces more varied wrong answers.

Watch Out

Pass@N is not the same as selected accuracy

pass@N asks whether at least one correct solution appears among the candidates. Selected accuracy asks whether the verifier-search system returns the correct one. The gap between the two is a direct measure of verifier or search weakness.

Watch Out

Test-time and train-time compute are not fully substitutable

A small model with massive test-time compute cannot match a large, well-trained model on all tasks. Test-time compute helps the model use knowledge it already has more effectively. It cannot inject knowledge the model never learned. The substitution works for reasoning tasks where the knowledge is present but the inference path is hard to find.

Summary

  • Test-time compute scaling: generate multiple candidates and select the best
  • Best-of-N with a verifier: success probability =1(1p)N= 1 - (1-p)^N
  • PRMs guide search at each step; ORMs score only the final answer
  • MCTS adapts game-playing search to reasoning traces
  • Chain-of-thought provides serial compute via intermediate tokens
  • Latent reasoning trades visible tokens for internal computation
  • Verifier quality is the bottleneck: poor verifiers lead to inference-time reward hacking
  • Reported scaling curves are benchmark-, model-, method-, and verifier-dependent

Exercises

ExerciseCore

Problem

A model solves a math problem correctly with probability p=0.05p = 0.05 per attempt. Using best-of-N with a perfect verifier, how many samples do you need to achieve a 95% probability of finding at least one correct solution?

ExerciseAdvanced

Problem

Compare the compute efficiency of two strategies: (A) train a 2x larger model and generate one response; (B) keep the original model and use best-of-NN with a perfect verifier. In a toy model, assume training scaling gives accuracy Nparams0.1\propto N_{\text{params}}^{0.1} and best-of-N gives accuracy 1(1p)N1 - (1-p)^N with fixed per-sample cost, under what conditions does strategy B dominate?

ExerciseResearch

Problem

The inference-time reward hacking problem: as NN increases in best-of-N with an imperfect reward model, the selected response increasingly exploits the reward model rather than improving in true quality. Formalize this using the Gao et al. (2022) framework: if the proxy reward r^\hat{r} and the true reward rr^* have correlation ρ<1\rho < 1, how does the expected true reward of the best-of-N selection behave as NN \to \infty?

References

Canonical:

Current:

Next Topics

The natural next steps from test-time compute:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

4