LLM Construction
Structured Output and Constrained Generation
Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.
Prerequisites
Why This Matters
LLMs generate free-form text. Applications need structured data: JSON for APIs, SQL for databases, code that parses, function calls with correct argument types. The gap between "text that looks like JSON" and "valid JSON that matches a schema" is the difference between a demo and a production system.
Structured output techniques guarantee that every generated token sequence is valid according to a specified grammar or schema. This eliminates an entire class of failure modes (malformed output). The effect on content quality is task-dependent: Tam et al. (2024, arXiv:2408.02442) show that strict format constraints can hurt reasoning quality on some tasks, while on others they leave quality unchanged or improve it. Treat the trade-off between reliability and reasoning headroom as empirical, not settled.
Mental Model
At each generation step, the LLM produces a probability distribution over the full vocabulary. Constrained decoding applies a mask that zeros out tokens that would make the output invalid at that point. The model can only choose from tokens that are legal continuations according to the grammar. This is like playing chess where illegal moves are removed from the board before you choose: you cannot make a mistake, and the reduced option set focuses your decision.
Formal Setup
Let be the vocabulary and be a context-free grammar (or schema) defining the set of valid outputs. At generation step , the model has produced tokens . The set of valid next tokens is:
Constrained Decoding
Constrained decoding modifies the generation process by applying a mask at each step:
Tokens not in receive zero probability. The remaining probabilities are renormalized. The result is a valid probability distribution over only legal continuations.
Grammar-Guided Generation
Grammar-guided generation uses a formal grammar (regular expression, context-free grammar, or JSON schema compiled to a grammar) to compute at each step. The grammar is maintained as a parser state that advances with each generated token. At each position, the parser determines which tokens would lead to a valid parse continuation.
Main Theorems
Constrained Decoding Guarantees Valid Output
Statement
If is the exact set of tokens that can lead to a valid completion under grammar , then constrained decoding produces a valid output with probability 1. Formally, for any sequence generated by constrained decoding with for all , the output is in the language .
Intuition
At every step, only tokens leading to valid completions are allowed. Since each step maintains validity, the full output is valid by induction. This is the same principle as a parser that rejects invalid input at the earliest possible point, but applied in reverse: we prevent invalid output at the earliest possible point.
Proof Sketch
By induction on . Base case: contains only tokens that start a valid string in . Inductive step: if is a valid prefix (has at least one valid completion in ), then any maintains this property by definition of . At the final step, the EOS token is only in if the current prefix is a complete valid string.
Why It Matters
Unconstrained generation of JSON exhibits a non-trivial parse failure rate that grows with schema complexity, while grammar-constrained generation drives this rate to near zero by construction. Exact rates depend on model, prompt, and schema, and should be measured per application rather than quoted as a universal number. For production systems making millions of API calls, even a low percentage of malformed outputs translates into a meaningful reliability cost.
Failure Mode
The guarantee is only as good as the grammar. If the grammar is more permissive than the intended schema (e.g., allows any string value when you need a specific enum), the output will be syntactically valid but semantically wrong. Constrained decoding guarantees form, not content.
Empirical observation: grammar constraints reduce the effective search space
At each generation step, unconstrained generation chooses from tokens (typically 32K-128K). Grammar-guided generation restricts this to tokens. For structured formats:
- JSON with a fixed schema: ranges from 1 (only a colon after a key) to a fraction of inside a string value
- SQL with a fixed table schema: is often on the order of tens to a couple of hundred tokens at most positions
- Regular expressions: is determined by the automaton state
The reduction in is exact: the grammar acts as a hard filter on the next-token distribution. The effect of this filter on downstream quality is a different question. Masking redistributes probability mass across the surviving tokens, which can shift which semantically meaningful option the model ultimately picks. Some reports claim that grammar constraints match or improve unconstrained quality on structured tasks; others (Tam et al. 2024) show format constraints hurting reasoning on benchmarks. Frame the reduction as a mechanism, not as a theorem about generation quality.
Mask-induced probability shifts are not neutral
Zeroing out invalid tokens and renormalizing is not a projection onto a subspace of the model's "intended" distribution. The model was trained without the mask. When a mask removes the token the model would have preferred (say, the start of an explanation in natural language before a JSON payload), probability mass flows to tokens the model would not have chosen on their own. Sometimes that sharpens output. Sometimes it forces the model into a reasoning path it handles poorly.
Failure mode. If the grammar is too restrictive, it can force the model into outputs it would not naturally produce, degrading quality. For example, if the schema requires an integer field but the correct answer requires a float, the grammar will force an incorrect integer. If the schema forces immediate JSON emission and the model benefits from a chain of thought first, the constraint can cut off reasoning that would have improved the answer.
Implementation Approaches
Token-level masking (Outlines, llama.cpp GBNF grammars, XGrammar, SynCode). Compile the grammar into a state machine. At each step, query the state machine for valid next tokens, build a bitmask over the vocabulary, and apply it to logits before softmax. The cost per step is the state machine transition plus the mask application, typically negligible compared to the model forward pass. XGrammar (Dong et al. 2024, arXiv:2411.15100) focuses on low-overhead mask computation for pushdown automata corresponding to JSON schemas. SynCode (Ugare et al. 2024, arXiv:2403.01632) couples LLM decoding with an incremental parser for context-free grammars in code generation.
Programmatic prompting DSLs (LMQL, Guidance, Instructor). LMQL (Beurer-Kellner et al. 2023, arXiv:2212.06094) treats prompting as a query language with constraints that compile to token-level masks at inference time. Microsoft Guidance interleaves fixed template tokens with model-generated spans under regex or grammar constraints. Instructor (jxnl/instructor) is a Python library that wraps LLM clients and maps Pydantic schemas to OpenAI function-calling or structured-outputs calls, with client-side validation and retries.
OpenAI function calling vs. structured outputs. These are distinct. Function calling (since 2023) passes a JSON schema in the request; the model is trained to emit arguments that usually match the schema, but conformance is validated after generation rather than enforced during decoding. Structured outputs mode (OpenAI, August 2024) uses constrained decoding server-side to guarantee that every sampled sequence is a member of the schema's language. Anthropic's tool-use API similarly relies on the model following the schema, with client-side validation rather than a hard guarantee.
Guided generation with retries. Generate unconstrained output, validate it, and retry on failure. This is simpler to implement but wastes compute on invalid generations and provides no formal guarantee of eventual success. Expected retry cost is where is the per-attempt failure rate; this stays cheap when is small and becomes expensive on complex schemas where can approach .
Practical Considerations
Grammar compilation cost. Converting a JSON schema to a state machine is a one-time cost. For complex schemas with nested objects, arrays, and recursive types, the state machine can have thousands of states. This is acceptable because the compilation happens once per schema, not per request.
Interaction with sampling. Constrained decoding works with any sampling strategy: greedy, top-k, top-p, temperature scaling. The constraint is applied as a hard mask on logits before the sampling strategy. The sampling strategy only sees the valid tokens.
Streaming. Constrained decoding is compatible with token-by-token streaming. Each token is guaranteed valid in context, so partial outputs can be streamed to the client. The client can parse partial JSON as it arrives.
Strict JSON numbers and mid-token rejections. RFC 8259 forbids leading
zeros in JSON numbers: the grammar accepts 0, 7, and 1.5, but not
007 or 01. LLM tokenizers often emit leading-zero sequences naturally
(numeric strings, zero-padded identifiers, dates). A strict JSON grammar
will mask the second 0 of an emerging 007, forcing the sampler onto a
different branch mid-number. This appears as "grammar constraints slowing
generation" or "the model gets stuck," and is a common cause of degraded
throughput and quality under strict JSON decoding. Relaxing the number
production to accept leading zeros (non-standard JSON) or post-processing
numbers as strings are both common workarounds.
Common Confusions
Constrained decoding is not fine-tuning
Constrained decoding modifies the inference-time sampling procedure without changing the model weights. The model is not retrained to produce structured output. The constraint is applied externally. Fine-tuning on structured data can improve the model's natural tendency to produce valid output, but constrained decoding provides the hard guarantee.
JSON mode is weaker than schema-constrained generation
JSON mode (as offered by some APIs) guarantees the output is valid JSON. Schema-constrained generation guarantees the output matches a specific JSON schema (correct field names, types, required fields). Valid JSON that does not match your schema is useless. Always prefer schema-level constraints when available.
Constrained decoding can still produce wrong content
A model constrained to output JSON matching a schema
{"temperature": number} will always produce valid JSON with a numeric
temperature field. But the number might be wrong. Constrained decoding
guarantees syntax, not semantics. Factual accuracy still depends on the
model's knowledge and reasoning.
Summary
- Constrained decoding masks invalid tokens at each step, guaranteeing valid output
- Grammar-guided generation maintains a parser state to compute valid token sets
- The constraint reduces the effective search space, often improving content quality
- Token-level masking adds negligible overhead to the model forward pass
- Constrained decoding works with any sampling strategy (greedy, top-k, top-p)
- JSON mode guarantees valid JSON; schema mode guarantees conformance to a specific schema
- Syntax is guaranteed; semantics (correctness of content) is not
Exercises
Problem
A JSON schema requires output of the form {"name": string, "age": integer}.
At the position immediately after {"name": "Alice", "age": , what tokens
are in the valid set ? Assume the vocabulary includes digits 0-9,
letters, quotes, braces, brackets, commas, colons, and minus sign.
Problem
You are building a constrained decoding system for SQL generation. The
grammar allows only SELECT queries on a database with tables users
(columns: id, name, email) and orders (columns: id, user_id,
amount, date). After the model generates SELECT name, email FROM users WHERE ,
what is ? Consider both column names and SQL keywords that can follow WHERE.
References
Canonical:
- Willard and Louf, "Efficient Guided Generation for Large Language Models" (2023, arXiv:2307.09702). The Outlines library and its regex-to-FSM compilation
- Scholak et al., "PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models" (EMNLP 2021). SQL-constrained generation via incremental parsing
- Beurer-Kellner et al., "Prompting Is Programming: A Query Language for Large Language Models" (LMQL, 2023, arXiv:2212.06094)
- IETF RFC 8259, "The JavaScript Object Notation (JSON) Data Interchange Format" (2017). The grammar that strict JSON masks enforce
Current:
- Dong et al., "XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models" (2024, arXiv:2411.15100)
- Ugare et al., "SynCode: LLM Generation with Grammar Augmentation" (2024, arXiv:2403.01632)
- Tam et al., "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models" (2024, arXiv:2408.02442). Evidence that format constraints can hurt reasoning quality
- OpenAI, "Structured Outputs" documentation (Aug 2024). Server-side constrained decoding for schema conformance, distinct from function-calling validation
- Microsoft Guidance (github.com/guidance-ai/guidance). Template-plus-constraint DSL for controlled generation
- Instructor (github.com/jxnl/instructor). Pydantic-schema wrapper over OpenAI and Anthropic structured-output and tool-use APIs
- GBNF grammar support in llama.cpp (2023-2024)
Next Topics
The natural next steps from structured output:
- Tool-augmented reasoning: structured output enables reliable tool calling
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Transformer Architecturelayer 4 · tier 2
- Tool-Augmented Reasoninglayer 5 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.