Skip to main content

LLM Construction

Structured Output and Constrained Generation

Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.

AdvancedTier 2FrontierFrontier watch~45 min

Why This Matters

LLMs generate free-form text. Applications need structured data: JSON for APIs, SQL for databases, code that parses, function calls with correct argument types. The gap between "text that looks like JSON" and "valid JSON that matches a schema" is the difference between a demo and a production system.

Structured output techniques guarantee that every generated token sequence is valid according to a specified grammar or schema. This eliminates an entire class of failure modes (malformed output). The effect on content quality is task-dependent: Tam et al. (2024, arXiv:2408.02442) show that strict format constraints can hurt reasoning quality on some tasks, while on others they leave quality unchanged or improve it. Treat the trade-off between reliability and reasoning headroom as empirical, not settled.

Mental Model

At each generation step, the LLM produces a probability distribution over the full vocabulary. Constrained decoding applies a mask that zeros out tokens that would make the output invalid at that point. The model can only choose from tokens that are legal continuations according to the grammar. This is like playing chess where illegal moves are removed from the board before you choose: you cannot make a mistake, and the reduced option set focuses your decision.

Formal Setup

Let VV be the vocabulary and GG be a context-free grammar (or schema) defining the set of valid outputs. At generation step tt, the model has produced tokens y1,,yt1y_1, \ldots, y_{t-1}. The set of valid next tokens is:

Vt={vV: valid completion of y1yt1v under G}V_t = \{v \in V : \exists \text{ valid completion of } y_1 \ldots y_{t-1} v \text{ under } G\}

Definition

Constrained Decoding

Constrained decoding modifies the generation process by applying a mask at each step:

pconstrained(yty<t)=p(yty<t)1[ytVt]vVtp(vy<t)p_{\text{constrained}}(y_t | y_{<t}) = \frac{p(y_t | y_{<t}) \cdot \mathbf{1}[y_t \in V_t]}{\sum_{v \in V_t} p(v | y_{<t})}

Tokens not in VtV_t receive zero probability. The remaining probabilities are renormalized. The result is a valid probability distribution over only legal continuations.

Definition

Grammar-Guided Generation

Grammar-guided generation uses a formal grammar (regular expression, context-free grammar, or JSON schema compiled to a grammar) to compute VtV_t at each step. The grammar is maintained as a parser state that advances with each generated token. At each position, the parser determines which tokens would lead to a valid parse continuation.

Main Theorems

Proposition

Constrained Decoding Guarantees Valid Output

Statement

If VtV_t is the exact set of tokens that can lead to a valid completion under grammar GG, then constrained decoding produces a valid output with probability 1. Formally, for any sequence y1,,yTy_1, \ldots, y_T generated by constrained decoding with pconstrained(yty<t)>0p_{\text{constrained}}(y_t | y_{<t}) > 0 for all tt, the output y1yTy_1 \ldots y_T is in the language L(G)L(G).

Intuition

At every step, only tokens leading to valid completions are allowed. Since each step maintains validity, the full output is valid by induction. This is the same principle as a parser that rejects invalid input at the earliest possible point, but applied in reverse: we prevent invalid output at the earliest possible point.

Proof Sketch

By induction on tt. Base case: V1V_1 contains only tokens that start a valid string in L(G)L(G). Inductive step: if y1yt1y_1 \ldots y_{t-1} is a valid prefix (has at least one valid completion in L(G)L(G)), then any ytVty_t \in V_t maintains this property by definition of VtV_t. At the final step, the EOS token is only in VTV_T if the current prefix is a complete valid string.

Why It Matters

Unconstrained generation of JSON exhibits a non-trivial parse failure rate that grows with schema complexity, while grammar-constrained generation drives this rate to near zero by construction. Exact rates depend on model, prompt, and schema, and should be measured per application rather than quoted as a universal number. For production systems making millions of API calls, even a low percentage of malformed outputs translates into a meaningful reliability cost.

Failure Mode

The guarantee is only as good as the grammar. If the grammar is more permissive than the intended schema (e.g., allows any string value when you need a specific enum), the output will be syntactically valid but semantically wrong. Constrained decoding guarantees form, not content.

Empirical observation: grammar constraints reduce the effective search space

At each generation step, unconstrained generation chooses from V|V| tokens (typically 32K-128K). Grammar-guided generation restricts this to Vt|V_t| tokens. For structured formats:

  • JSON with a fixed schema: Vt|V_t| ranges from 1 (only a colon after a key) to a fraction of V|V| inside a string value
  • SQL with a fixed table schema: Vt|V_t| is often on the order of tens to a couple of hundred tokens at most positions
  • Regular expressions: Vt|V_t| is determined by the automaton state

The reduction in Vt|V_t| is exact: the grammar acts as a hard filter on the next-token distribution. The effect of this filter on downstream quality is a different question. Masking redistributes probability mass across the surviving tokens, which can shift which semantically meaningful option the model ultimately picks. Some reports claim that grammar constraints match or improve unconstrained quality on structured tasks; others (Tam et al. 2024) show format constraints hurting reasoning on benchmarks. Frame the reduction as a mechanism, not as a theorem about generation quality.

Watch Out

Mask-induced probability shifts are not neutral

Zeroing out invalid tokens and renormalizing is not a projection onto a subspace of the model's "intended" distribution. The model was trained without the mask. When a mask removes the token the model would have preferred (say, the start of an explanation in natural language before a JSON payload), probability mass flows to tokens the model would not have chosen on their own. Sometimes that sharpens output. Sometimes it forces the model into a reasoning path it handles poorly.

Failure mode. If the grammar is too restrictive, it can force the model into outputs it would not naturally produce, degrading quality. For example, if the schema requires an integer field but the correct answer requires a float, the grammar will force an incorrect integer. If the schema forces immediate JSON emission and the model benefits from a chain of thought first, the constraint can cut off reasoning that would have improved the answer.

Implementation Approaches

Token-level masking (Outlines, llama.cpp GBNF grammars, XGrammar, SynCode). Compile the grammar into a state machine. At each step, query the state machine for valid next tokens, build a bitmask over the vocabulary, and apply it to logits before softmax. The cost per step is the state machine transition plus the mask application, typically negligible compared to the model forward pass. XGrammar (Dong et al. 2024, arXiv:2411.15100) focuses on low-overhead mask computation for pushdown automata corresponding to JSON schemas. SynCode (Ugare et al. 2024, arXiv:2403.01632) couples LLM decoding with an incremental parser for context-free grammars in code generation.

Programmatic prompting DSLs (LMQL, Guidance, Instructor). LMQL (Beurer-Kellner et al. 2023, arXiv:2212.06094) treats prompting as a query language with constraints that compile to token-level masks at inference time. Microsoft Guidance interleaves fixed template tokens with model-generated spans under regex or grammar constraints. Instructor (jxnl/instructor) is a Python library that wraps LLM clients and maps Pydantic schemas to OpenAI function-calling or structured-outputs calls, with client-side validation and retries.

OpenAI function calling vs. structured outputs. These are distinct. Function calling (since 2023) passes a JSON schema in the request; the model is trained to emit arguments that usually match the schema, but conformance is validated after generation rather than enforced during decoding. Structured outputs mode (OpenAI, August 2024) uses constrained decoding server-side to guarantee that every sampled sequence is a member of the schema's language. Anthropic's tool-use API similarly relies on the model following the schema, with client-side validation rather than a hard guarantee.

Guided generation with retries. Generate unconstrained output, validate it, and retry on failure. This is simpler to implement but wastes compute on invalid generations and provides no formal guarantee of eventual success. Expected retry cost is 1/(1p)1/(1-p) where pp is the per-attempt failure rate; this stays cheap when pp is small and becomes expensive on complex schemas where pp can approach 0.50.5.

Practical Considerations

Grammar compilation cost. Converting a JSON schema to a state machine is a one-time cost. For complex schemas with nested objects, arrays, and recursive types, the state machine can have thousands of states. This is acceptable because the compilation happens once per schema, not per request.

Interaction with sampling. Constrained decoding works with any sampling strategy: greedy, top-k, top-p, temperature scaling. The constraint is applied as a hard mask on logits before the sampling strategy. The sampling strategy only sees the valid tokens.

Streaming. Constrained decoding is compatible with token-by-token streaming. Each token is guaranteed valid in context, so partial outputs can be streamed to the client. The client can parse partial JSON as it arrives.

Strict JSON numbers and mid-token rejections. RFC 8259 forbids leading zeros in JSON numbers: the grammar accepts 0, 7, and 1.5, but not 007 or 01. LLM tokenizers often emit leading-zero sequences naturally (numeric strings, zero-padded identifiers, dates). A strict JSON grammar will mask the second 0 of an emerging 007, forcing the sampler onto a different branch mid-number. This appears as "grammar constraints slowing generation" or "the model gets stuck," and is a common cause of degraded throughput and quality under strict JSON decoding. Relaxing the number production to accept leading zeros (non-standard JSON) or post-processing numbers as strings are both common workarounds.

Common Confusions

Watch Out

Constrained decoding is not fine-tuning

Constrained decoding modifies the inference-time sampling procedure without changing the model weights. The model is not retrained to produce structured output. The constraint is applied externally. Fine-tuning on structured data can improve the model's natural tendency to produce valid output, but constrained decoding provides the hard guarantee.

Watch Out

JSON mode is weaker than schema-constrained generation

JSON mode (as offered by some APIs) guarantees the output is valid JSON. Schema-constrained generation guarantees the output matches a specific JSON schema (correct field names, types, required fields). Valid JSON that does not match your schema is useless. Always prefer schema-level constraints when available.

Watch Out

Constrained decoding can still produce wrong content

A model constrained to output JSON matching a schema {"temperature": number} will always produce valid JSON with a numeric temperature field. But the number might be wrong. Constrained decoding guarantees syntax, not semantics. Factual accuracy still depends on the model's knowledge and reasoning.

Summary

  • Constrained decoding masks invalid tokens at each step, guaranteeing valid output
  • Grammar-guided generation maintains a parser state to compute valid token sets
  • The constraint reduces the effective search space, often improving content quality
  • Token-level masking adds negligible overhead to the model forward pass
  • Constrained decoding works with any sampling strategy (greedy, top-k, top-p)
  • JSON mode guarantees valid JSON; schema mode guarantees conformance to a specific schema
  • Syntax is guaranteed; semantics (correctness of content) is not

Exercises

ExerciseCore

Problem

A JSON schema requires output of the form {"name": string, "age": integer}. At the position immediately after {"name": "Alice", "age": , what tokens are in the valid set VtV_t? Assume the vocabulary includes digits 0-9, letters, quotes, braces, brackets, commas, colons, and minus sign.

ExerciseAdvanced

Problem

You are building a constrained decoding system for SQL generation. The grammar allows only SELECT queries on a database with tables users (columns: id, name, email) and orders (columns: id, user_id, amount, date). After the model generates SELECT name, email FROM users WHERE , what is VtV_t? Consider both column names and SQL keywords that can follow WHERE.

References

Canonical:

  • Willard and Louf, "Efficient Guided Generation for Large Language Models" (2023, arXiv:2307.09702). The Outlines library and its regex-to-FSM compilation
  • Scholak et al., "PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models" (EMNLP 2021). SQL-constrained generation via incremental parsing
  • Beurer-Kellner et al., "Prompting Is Programming: A Query Language for Large Language Models" (LMQL, 2023, arXiv:2212.06094)
  • IETF RFC 8259, "The JavaScript Object Notation (JSON) Data Interchange Format" (2017). The grammar that strict JSON masks enforce

Current:

  • Dong et al., "XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models" (2024, arXiv:2411.15100)
  • Ugare et al., "SynCode: LLM Generation with Grammar Augmentation" (2024, arXiv:2403.01632)
  • Tam et al., "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models" (2024, arXiv:2408.02442). Evidence that format constraints can hurt reasoning quality
  • OpenAI, "Structured Outputs" documentation (Aug 2024). Server-side constrained decoding for schema conformance, distinct from function-calling validation
  • Microsoft Guidance (github.com/guidance-ai/guidance). Template-plus-constraint DSL for controlled generation
  • Instructor (github.com/jxnl/instructor). Pydantic-schema wrapper over OpenAI and Anthropic structured-output and tool-use APIs
  • GBNF grammar support in llama.cpp (2023-2024)

Next Topics

The natural next steps from structured output:

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

0

No published topic currently declares this as a prerequisite.