LLM Construction
Hallucination Theory
Why language models confabulate, framed by the Kalai-Nachum-Vempala-Zhang (2025) statistical lower bound on pretraining error and the evaluation-incentive equilibrium that prevents post-training from removing it. Calibration, conformal prediction, structural failure modes (reversal curse, snowballing), and the measurement gap (FActScore, SAFE).
Prerequisites
Why This Matters
Language models produce fluent, confident, and sometimes false text. This is not a defect that scale, data quality, or alignment will quietly remove. Kalai, Nachum, Vempala, and Zhang (2025) give the first sharp statistical reason: the rate at which a base language model emits confident wrong answers on rare facts is bounded below by the fraction of facts in its training corpus that appear exactly once. Post-training does not fix the problem because every standard benchmark scores abstention as wrong, so the policy that maximizes benchmark score is the one that bluffs.
The page makes both halves of this argument precise, then organizes the rest of the hallucination literature around them: when calibration helps and when it cannot, where conformal prediction earns its place, and the structural failure modes (reversal curse, snowballing, long-tail entities) that are not calibration phenomena at all.
Singleton-rate lower bound on hallucination (Kalai et al. 2025, Theorem 2)
The Statistical Origin: A Lower Bound on Pretraining Error
Pretraining minimizes cross-entropy on a corpus drawn from a true distribution . The model is then used generatively: sample tokens, return text. A "hallucination" is a confidently sampled output that lies outside the valid set of correct responses to the prompt. The first question is whether any base model, however well trained, can drive this rate to zero.
Kalai et al. answer no, and give a quantitative bound by reducing generation to binary classification.
The Is-It-Valid (IIV) reduction
For prompt , let be the set of valid responses and be a set of plausible-but-wrong "error" responses. Define the binary classifier iff , that is, iff the generative model assigns the response above-uniform mass relative to the error set. The IIV problem is to predict whether given a sample drawn from a 50/50 mixture of valid responses and uniform errors. A density-estimation model induces an IIV classifier; bounding generation error in terms of IIV error is the path from "good density estimator" to "few hallucinations."
Generation-to-classification reduction (Kalai-Nachum-Vempala-Zhang 2025, Theorem 1)
Statement
The base model's generation error rate (probability of sampling an invalid response) satisfies
where is the binary classification error of the induced IIV classifier and is a calibration term.
Intuition
A model that generates valid text well must, in particular, distinguish valid from invalid responses well. The bound makes the converse precise: generation error is at least twice classification error, modulo a ratio of valid-to-error set sizes and a calibration slack. There is no way to be a perfect generator without being a good classifier; classifier error sets a floor on generation error.
Proof Sketch
The argument is a direct calculation. The IIV classifier predicts exactly when . Errors split into false positives (invalid assigned high mass) and false negatives (valid assigned low mass). False positives lower-bound the probability that the generator emits an invalid response with above-uniform mass; false negatives lower-bound the probability that it under-samples valid responses. Summing the two contributions gives . The subtracted term accounts for prompts where the valid set is unusually large relative to the error set (so a uniform generator would already classify well). The term absorbs the gap between and the true marginal on the valid set.
Why It Matters
This converts an unsupervised question (does my generative model hallucinate?) into a supervised one (can a classifier separate valid from invalid completions?). Whatever lower bound applies to IIV classification immediately applies to generation. The next theorem turns this into a concrete bound that depends only on the corpus statistics.
Failure Mode
The bound is loose when the calibration term is large or when the valid set is comparable in size to the error set. It is meant as a floor on hallucination rate under reasonable conditions, not a tight characterization. The interesting case is when is much larger than (most prompts have a few correct answers and many plausible-looking wrong ones), where the subtracted ratio is negligible.
The reduction is a setup. The payoff is the next bound, which says you can lower-bound hallucination rate from properties of the training corpus alone, before training ever runs.
Singleton rate
A prompt is a singleton in a training corpus if and only if exactly one response to appears in the corpus. The singleton rate is the fraction of prompts in the corpus that are singletons. Intuitively, singletons are facts the model has seen exactly one example of; the corpus gives no redundancy and no triangulation. Kalai et al. give the precise definition in Section 3.3.1.
Singleton-rate lower bound on hallucination (Kalai-Nachum-Vempala-Zhang 2025, Theorem 2)
Statement
The base model's generation error rate satisfies
where is the singleton rate of the corpus and is its size.
Intuition
If a fact appears exactly once in the training corpus, the model has no statistical lever to distinguish "rare-but-true" from "rare-and-wrong." The best it can do on singletons is guess. The fraction of guessable prompts is the singleton rate, and generation error on the corpus inherits this floor. The other terms shrink with corpus size ; the singleton rate does not.
Proof Sketch
Combine the IIV reduction (Theorem 1) with a counting argument on singletons. For singleton prompts the IIV classifier has at-chance accuracy because the empirical distribution gives no signal beyond a single sample. So by a uniform-convergence argument on the empirical measure. Plugging into Theorem 1 and rearranging gives the stated bound, with the term coming from a standard generalization bound (the explicit constants are derived in the paper).
Why It Matters
Three immediate consequences.
First, scaling the corpus reduces all terms except and . The singleton-rate floor is a property of how facts are distributed in the world, not how much text you collect. World knowledge has a long tail: most named entities, most events, and most academic results appear in only a handful of sources. As you scrape more text, you discover more singletons at roughly the same rate, so does not vanish.
Second, the bound is on the base model's error, before any post-training. RLHF, Constitutional AI, and DPO act on a fixed pretrained model; they cannot lower the singleton-rate floor by themselves.
Third, the bound is unconditional in model architecture. A trillion-parameter transformer trained on the same corpus inherits the same floor. Capacity is not the bottleneck; data redundancy is.
Failure Mode
The bound is not tight on every slice of the data. Hallucination on frequent facts (entities that appear many times, well-tested benchmarks) can be much lower than because the relevant prompts are not singletons. The bound concerns the average over the corpus distribution; subset-level rates can be far smaller or larger. The bound also assumes the IIV setup with a discrete error set; it does not directly apply to continuous outputs (regression, ranking).
The empirical complement of this theorem is Mallen et al. (ACL 2023), who built PopQA: 14,000 entity questions stratified by Wikipedia popularity. They report that LM accuracy on long-tail entities barely improves with scale, while retrieval-augmented models dominate on the same slice. The shape of their plot is what the singleton-rate bound predicts: scaling fixes the head of the distribution, not the tail.
Calibration Theory: What It Buys and What It Does Not
If we cannot eliminate hallucination, can we at least know when the model is wrong? Calibration is the formal version of this question. A calibrated model whose top- confidence is is correct of the time among predictions of confidence .
Calibration
A predictor is calibrated if and only if for all ,
Calibration says the model's stated confidence equals its empirical accuracy at that confidence. It is a marginal property over the joint distribution of , not a guarantee about any particular prompt.
Expected Calibration Error
With top-class confidence and prediction , the population ECE is
The standard binned estimator (Guo et al. 2017) partitions the confidence interval into bins and computes
Almost every paper reports the binned estimator. It is not the same object as the population ECE; binning bias is well documented (Nixon et al. 2019, Roelofs et al. 2022).
Brier score calibration-refinement decomposition (DeGroot-Fienberg 1983)
Statement
Let be the conditional probability of given the forecast value. Then
The forecast is calibrated when the first term vanishes ( a.s.) and sharp when concentrates near or .
Intuition
Predictive quality decomposes into two independent components. Calibration asks whether the model's confidence is matched by accuracy. Refinement asks how much signal the model's forecast carries about the outcome. A constant predictor that always says where is the base rate is perfectly calibrated and useless: refinement is zero. A discriminating predictor that always assigns or is sharp; whether it is calibrated depends on how often it is right.
Why It Matters
This decomposition explains what temperature scaling and Platt scaling do: they fix calibration without changing refinement. It also explains why a calibrated model is not necessarily a good model. Both components matter. For LLMs the decomposition makes a sharper claim: post-RLHF degradation of calibration (Tian et al. 2023) does not necessarily reduce refinement, so the underlying capability is preserved while the uncertainty signal is destroyed. The fix is to recover calibration without retraining the policy.
Failure Mode
The clean additive split is specific to the Brier score. The log score and other Bregman-divergence proper scoring rules admit analogous but different decompositions (Gneiting and Raftery 2007). There is no universal additive calibration-refinement split that holds for every proper scoring rule.
The Kadavath-Tian-Kalai synthesis. Three findings, in chronological order, told a coherent story only after Kalai et al. supplied the mechanism:
-
Kadavath et al. (2022) showed that pre-RLHF base models are well calibrated on multiple-choice questions where the answer set is fixed. Self-evaluation prompts ("Is your answer correct? Yes / No") track the base model's actual accuracy.
-
Tian et al. (EMNLP 2023) showed that RLHF substantially degrades calibration on the same multiple-choice tasks; verbalized confidence elicited via prompting partly recovers it. The OpenAI GPT-4 system card reported the same pattern: the base model was better calibrated than the post-trained one.
-
Kalai et al. (2025) Section 4 explained why this happens. RLHF maximizes preference-model score, and preference data is built from human comparisons of answers; humans rank confident-correct answers above hedged-correct answers. The reward model inherits this preference, so KL-penalized RL trains the policy to be confident. Calibration is collateral damage from a reward shape that does not value uncertainty.
The fix is not "use better preference data." Confident-correct beats hedged-correct in any human-comparison protocol that does not actively penalize overconfidence. The fix has to come from the rubric.
The Evaluation-Incentive Equilibrium
Standard benchmarks (MMLU, GPQA, HumanEval, BBH, MATH, GSM8K) score for a correct answer and for a wrong answer or an abstention. Under this rubric, the policy that maximizes benchmark score is the policy that always guesses, even when uncertain. Kalai et al. formalize this in Section 4.
Binary grading vs behavioral calibration (Kalai et al. 2025, Observation 1)
Binary grading makes abstention strictly suboptimal (Kalai-Nachum-Vempala-Zhang 2025, Observation 1)
Statement
For any binary grading rubric and any prompt where the model's confidence in some answer is , the expected score from emitting that answer is and the expected score from abstaining is . Abstention is therefore strictly dominated by guessing whenever the model has any nonzero confidence in any candidate.
Intuition
Binary grading collapses two distinct outcomes (wrong and "I don't know") into the same score of zero. The model has no incentive to express uncertainty; expressing it is identical to being wrong. The optimal policy under such a rubric is to commit to the most likely candidate at every prompt, which is exactly the behavior we call hallucination.
Proof Sketch
Let be the event "answer is correct" and the candidate answer. Score is given that is emitted, and given abstention. Expected score from emitting is ; expected score from abstaining is . Whenever , emitting strictly dominates abstention.
Why It Matters
This reframes hallucination as an equilibrium response to the field's measurement protocol, not a property of models in isolation. A model that hallucinates on PopQA is doing the only thing the benchmark rewards. Changing the model without changing the rubric will not change the equilibrium. The leverage is on the evaluation side.
Failure Mode
Some benchmarks already include partial credit, weighted scoring, or abstention rewards (TruthfulQA's MC2 averages probability mass across correct options, for instance). The observation applies strictly to the dominant binary-graded benchmarks; it does not say all benchmarks are misdesigned.
Behavioral calibration as the proposed fix. Kalai et al. propose that benchmarks state an explicit confidence target in the prompt:
Answer only if you are more than confident, since mistakes are penalized points while correct answers receive point.
Under this rubric, the expected score from emitting a candidate with confidence is , which is positive iff . Abstention is now optimal whenever confidence falls below the target. The model becomes incentivized to track its own uncertainty across thresholds; the audit asks whether reported confidence matches accuracy across the threshold sweep. This is what they call behavioral calibration: not just numerical calibration of softmax outputs, but the broader property that the model's decisions to commit or abstain track its actual reliability.
The reform is cheap. It requires changing the grading rubric and the prompt, not the model. Whether benchmarks adopt it is a coordination problem in the field, not a research problem.
Structural Failures Beyond Calibration
Calibration explains a slice of hallucination but misses several phenomena entirely.
The reversal curse (Berglund et al. 2023). Models trained on " is " do not generalize to " is ." On real-world celebrity examples, GPT-4 answered "Who is Tom Cruise's mother?" correctly of the time but answered "Who is Mary Lee Pfeiffer's son?" correctly only of the time, despite the same fact being involved. The asymmetry persists across model sizes and families, and it is not fixed by data augmentation. This is a structural property of next-token prediction: the gradient signal for " is " updates the conditional , not the conditional . No amount of calibration repair touches it.
Snowballing hallucinations (Zhang, Press, Merrill, Liu, Smith 2023). When a model commits to an early wrong answer, it generates fluent supporting justification consistent with the wrong answer rather than catching the mistake. The striking finding is that the same model, asked separately to verify each claim in its own snowballed answer, recognizes the errors at accuracy for ChatGPT and for GPT-4. The model has the relevant knowledge; sequential decoding committed to a path that knowledge could have ruled out. This is a decoding-procedure phenomenon, not a knowledge gap, and it is invisible to single-token calibration analysis.
Internal-state probes (Azaria and Mitchell 2023). Linear classifiers trained on hidden-layer activations distinguish true from false statements at to accuracy across model layers, outperforming methods that use only output probabilities. Models often "know" they are emitting falsehoods in a representational sense but do not surface this in their decoded output. This suggests calibration is at least partly a decoding problem, not a knowledge problem, and points toward decoding-time interventions (probe-conditioned generation, abstain-on-low-probe-confidence) as a research direction.
These three findings sit awkwardly with the standard "the model does not know what it does not know" story. The reversal curse shows the model genuinely lacks a piece of structure that a human would derive automatically. Snowballing and internal-state probes show the model has knowledge that decoding fails to use. Calibration repair addresses neither.
Measurement: Atomic Decomposition
How do you score hallucination on free-form generation, where the model emits a paragraph that mixes correct and incorrect claims? The dominant approach since 2023 is to decompose generations into atomic facts and verify each.
FActScore (Min, Krishna, Lyu, Lewis, Yih, Koh, Iyyer, Zettlemoyer, Hajishirzi, EMNLP 2023). A long-form generation is split into atomic facts; each fact is verified against a reliable knowledge source. The reported score is the fraction of supported atomic facts. Their automated estimator agrees with human verification within . On biography generation, ChatGPT scored only , providing the first quantitative bound on what "factual" means for paragraph-length generation.
SAFE / LongFact (Wei et al., NeurIPS 2024). The Search-Augmented Factuality Evaluator extends FActScore by issuing Google searches to verify each atomic fact, scoring at agreement with crowdsourced human annotations and agreement on disagreement cases, at roughly lower cost than human annotation. The companion LongFact benchmark provides thousands of long-form questions across topics.
The measurement gap matters because the singleton-rate bound and the evaluation-incentive observation both make claims about "the rate" of hallucination, but until atomic decomposition there was no operational definition of that rate for free-form text. FActScore and SAFE are the de facto standards. They are not perfect (the Google-search verifier inherits Google's coverage gaps), but they make the rate measurable.
Mitigation: Where Each Approach Fits
Retrieval-augmented generation (RAG). Lewis et al. (2020) introduced the standard RAG architecture: retrieve passages from a knowledge base, condition generation on them. Mallen et al. (2023) showed RAG dominates parametric memory on long-tail entities. The architecture targets the singleton-rate bound directly: by moving rare facts from parametric storage to the retrieved context, the model no longer needs to memorize them. The well-known limits: the retriever can fail; the model can ignore retrieved context (Yoran et al. 2024 quantify this); retrieved documents themselves can be wrong; and RAG does not address structural failures like the reversal curse or snowballing.
Conformal language modeling (Quach et al., ICLR 2024). Apply conformal prediction to LLM generation: sample candidates, calibrate a stopping rule on a held-out set, return a prediction set with marginal coverage guarantee. The set is small when the model is sure and large when it is not. The guarantee is distribution-free under exchangeability.
Conformal factuality (Mohri and Hashimoto 2024). A back-off algorithm: progressively make LM outputs less specific until conformal coverage holds. The reported result is to correctness guarantees while retaining most of the original output. This is the version of conformal that concretely targets the hallucination problem and it earns its place on this page.
The underlying guarantee is the standard split-conformal coverage bound; see the marginal-coverage theorem on the calibration page for the full statement and proof. The LM-specific contribution of Quach et al. and Mohri-Hashimoto is the construction of a non-conformity score for free-form text — token-level log-probabilities are uninformative, so both groups build scores out of semantic-equivalence checks (NLI for Quach, atomic-claim entailment for Mohri-Hashimoto). The coverage guarantee transfers; the engineering content is the score.
Marginal coverage is not conditional coverage on hard prompts. A factuality-conformal LLM with marginal coverage can have coverage on the long-tail entities where you most want a guarantee, balanced by coverage on easy ones. Per-prompt guarantees require either Mondrian conformal partitioning by a known difficulty signal or distribution-conditional methods like Romano-Patterson-Candès CQR — neither of which is free, and both of which require defining the difficulty bins. This is the gap between "the algorithm works on average" and "the algorithm works on the cases that scared you into using it."
Self-consistency and sample-then-verify. Sample generations, take the majority answer (Wang et al. 2023, ICLR). This is cheap and helps on tasks with a single correct answer; it is irrelevant on open-ended generation.
Decoding interventions from internal-state probes. Conditional generation that abstains when an Azaria-Mitchell-style probe flags low truthfulness. This direction is active but not yet standard practice.
Common Confusions
Hallucination is not a defect to be patched; it is a statistical floor
The Kalai singleton-rate bound says that pretraining error is bounded below by the fraction of facts in the corpus that appear exactly once. No amount of training data, model capacity, or alignment can drive base-model hallucination on the long tail to zero. The hope is to push the rate down on the head and to install honest abstention on the tail; the hope is not to eliminate it.
Calibration degrading after RLHF is not a tuning bug
Tian et al. (2023) and the GPT-4 system card both document the post-RLHF calibration drop. Kalai Section 4 explains the mechanism: human preference data ranks confident-correct above hedged-correct, the reward model inherits this, and the policy is trained to be confident. Better preference data does not fix this; only a rubric that rewards calibrated abstention does.
Hallucination is not the same as lying
Lying requires intent. A language model executes a sampled distribution over tokens; intent is not a meaningful attribute. The closer human analogue is confabulation: filling in gaps with plausible patterns. Anthropomorphizing into "the model knows it is wrong" mostly fails as a frame, with one wrinkle: Azaria-Mitchell shows internal probes do detect falsehoods at well-above-chance rates, so a weaker version of "the model has access to its own unreliability" is empirically defensible.
Low perplexity does not bound factual correctness
Perplexity measures how well the model predicts the next token in a held-out reference text. It does not measure whether the model can generate true statements when prompted to. A model with state-of-the-art perplexity can still fail PopQA on rare entities, fail the reverse direction of facts it was trained on, and snowball wrong answers into fluent paragraphs. The metric and the property are different.
Conformal prediction is not a free lunch for LLMs
Marginal coverage is not conditional coverage. A conformal LLM that achieves marginal coverage can have coverage on one slice of prompts and on another. The guarantee is honest about what it is and is not. For high-stakes per-input use, the relevant tool is some flavor of conditional or distributional conformal (Romano-Sesia-Candes 2019, Cauchois et al. 2024), not the marginal version.
Exercises
Problem
A pretraining corpus contains facts (prompt-response pairs). Suppose the singleton rate is and that all error sets satisfy . Ignoring the calibration term , give a numerical lower bound on the base model's generation error rate.
Problem
A benchmark has questions. A model outputs a probability on its top candidate for each question . Under a binary rubric (right , wrong or abstain ), the expected score from always answering with the top candidate is . Now consider Kalai's confidence-target rubric: the prompt sets a threshold , the model gets for correct, for wrong, and for abstention.
(a) Write the model's expected score under the confidence-target rubric as a function of and the abstention decisions, assuming the model abstains when and emits its top candidate when .
(b) Show that this abstention threshold is the optimal policy given the rubric, that is, no other abstention rule yields a higher expected score.
(c) Pick and a distribution where questions have and have . Compute the expected score under both rubrics. Which produces a more honest measurement of the model's calibrated knowledge?
Problem
The reversal curse (Berglund et al. 2023) says a model trained on " is " does not learn to predict "" given "." Argue, from the structure of the cross-entropy loss on autoregressive next-token prediction, why this is the expected behavior. Specifically: write the loss contributed by the sentence "Mary Lee Pfeiffer is the mother of Tom Cruise" and identify which conditional distributions the gradient updates and which it does not. Then describe an architectural or training-data intervention that would address the asymmetry without breaking standard left-to-right generation.
Problem
Design an evaluation protocol that measures behavioral calibration in the Kalai sense for a frontier LLM on a long-tail factual QA task. Your protocol must:
(a) Use the explicit-confidence-target rubric across multiple thresholds . (b) Distinguish models that hedge appropriately (high score at ) from models that hedge indiscriminately. (c) Be auditable by a third party without access to the model weights. (d) Produce a single summary statistic that ranks models meaningfully.
Address: how do you communicate the rubric to the model? How do you measure the model's effective threshold? What is the failure mode of your protocol?
Frequently Asked Questions
- Why do LLMs hallucinate?
- Two reasons. First, a statistical lower bound: Kalai-Nachum-Vempala-Zhang (2025) prove the base-model error rate on rare facts is at least the singleton rate, the fraction of facts that appear exactly once in training. Second, an incentive equilibrium: every standard benchmark scores abstention as wrong, so post-training optimizes for confident bluffing rather than honest uncertainty.
- Will scaling fix hallucinations?
- No, not entirely. Scaling reduces the singleton rate by repeating more facts, but the long tail of rare facts persists at any finite training set. The deeper issue: scaling does not change the benchmark incentive structure that rewards bluffing. Both halves of the problem need to move.
- Does RLHF make hallucinations better or worse?
- Worse on calibration, often. RLHF fine-tunes against human preference signals; if humans (or learned reward models) reward confident answers more than hedged ones, RLHF amplifies bluffing. This is a documented failure mode (Lin et al. 2022, Sharma et al. 2023). The fix is to include explicit calibration objectives, not assume preference data captures calibration.
- How is conformal prediction useful here?
- Conformal prediction gives distribution-free coverage guarantees: a confidence set with provable probability of containing the true answer. It does not reduce hallucination per se, but it lets a deployed system fail safely by abstaining when the conformal set is too large. It addresses the symptom (deploy-time consequences) rather than the cause.
- What problems can calibration not solve?
- Structural failures. The reversal curse (knowing 'A is B' does not imply knowing 'B is A'), snowballing within a single chain (one early error compounds across reasoning steps), and long-tail entity confusion are not calibration problems. Even a perfectly calibrated model exhibits them. These need retrieval, tool use, or architectural changes.
References
- Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang. Why Language Models Hallucinate. September 2025. The Is-It-Valid reduction (Section 3.1, Theorem 1), the singleton-rate lower bound on pretraining error (Section 3.3.1, Theorem 2), and the binary-grading-incentive observation that explains why post-training does not fix hallucination (Section 4). arXiv:2509.04664
- Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A". 2023. Forward 79% vs reverse 33% on celebrity examples; persists across model sizes; not fixed by data augmentation. arXiv:2309.12288
- Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith. How Language Model Hallucinations Can Snowball. 2023. Models over-commit to early mistakes and fluently elaborate on them; the same models recognize the errors when asked separately at 67% (ChatGPT) and 87% (GPT-4). arXiv:2305.13534
- Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, Hannaneh Hajishirzi. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL 2023. Introduces PopQA (14k entity questions stratified by popularity); scaling fails to improve long-tail accuracy; retrieval-augmented smaller models beat much larger parametric ones on rare entities. arXiv:2212.10511
- Amos Azaria, Tom Mitchell. The Internal State of an LLM Knows When It's Lying. 2023. Hidden-layer linear probes distinguish true from false statements at 71-83% accuracy, outperforming probability-based methods. arXiv:2304.13734
- Saurav Kadavath et al. Language Models (Mostly) Know What They Know. 2022. Pre-RLHF base models are calibrated on multiple-choice tasks; self-evaluation prompts track actual accuracy. arXiv:2207.05221
- Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. EMNLP 2023. Documents post-RLHF calibration degradation; verbalized confidence prompting partially recovers it. arXiv:2305.14975
- Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023. Atomic-fact decomposition metric for long-form generation; automated estimator within 2% of human verification; ChatGPT scores 58% on biographies. arXiv:2305.14251
- Jerry Wei et al. Long-form Factuality in Large Language Models. NeurIPS 2024. SAFE (Search-Augmented Factuality Evaluator) verifies each atomic fact via Google search; LongFact benchmark (38 topics); 72% agreement with crowdsourced annotators at 20x lower cost. arXiv:2403.18802
- Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, Regina Barzilay. Conformal Language Modeling. ICLR 2024. Calibrated sampling of candidate sets for free-form generation with marginal coverage guarantee. arXiv:2306.10193
- Christopher Mohri, Tatsunori Hashimoto. Language Models with Conformal Factuality Guarantees. 2024. A back-off algorithm delivering 80-90% correctness guarantees while retaining most of the original LM output, by progressively making outputs less specific. arXiv:2402.10978
- Patrick Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. The original RAG architecture combining a parametric generator with a non-parametric retriever. arXiv:2005.11401
- Morris H. DeGroot, Stephen E. Fienberg. The Comparison and Evaluation of Forecasters. The Statistician 32(1-2), 1983. Brier score calibration-refinement decomposition.
- Tilmann Gneiting, Adrian E. Raftery. Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association 102(477), 2007. General theory of proper scoring rules and their decompositions.
- Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger. On Calibration of Modern Neural Networks. ICML 2017. The standard binned ECE estimator; temperature scaling as a post-hoc fix. arXiv:1706.04599
- Vladimir Vovk, Alex Gammerman, Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005. The original conformal prediction monograph.
- Anastasios N. Angelopoulos, Stephen Bates. Conformal Prediction: A Gentle Introduction. Foundations and Trends in Machine Learning 16(4), 2023. Practitioner-friendly survey. arXiv:2107.07511
- Ziwei Ji et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 2023. Taxonomy and broad literature survey predating the Kalai et al. statistical analysis.
Next Topics
- RLHF and alignment: why RLHF reduces blatant hallucination but does not target the singleton-rate floor
- RLHF deep dive: the reward-shape mechanism behind post-training calibration degradation
- Mechanistic interpretability: can we localize the circuits that hidden-state truth probes are reading?
- Reward hacking: the broader Goodhart frame for why proxy optimization fails
Last reviewed: April 19, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
11- Empirical Risk Minimizationlayer 2 · tier 1
- Chain-of-Thought and Reasoninglayer 5 · tier 1
- Reinforcement Learning from Human Feedbacklayer 5 · tier 1
- Out-of-Distribution Detectionlayer 3 · tier 2
- RLHF and Alignmentlayer 4 · tier 2
Derived topics
1- Reward Hackinglayer 5 · tier 2
Graph-backed continuations