LLM Construction
Tokenization and Information Theory
Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.
Prerequisites
Why This Matters
Before a language model sees any text, a tokenizer chops it into pieces. This seemingly mundane preprocessing step has profound consequences. The tokenizer determines what the model can represent, how efficiently it compresses text, whether it handles non-English languages well, and even whether it can do arithmetic.
A model with 32K tokens per context window is not limited by characters or words but by tokens. If your tokenizer wastes tokens on inefficient encodings (e.g., splitting every Chinese character into 3 byte-level tokens), your effective context shrinks by 3x for that language. Information theory gives us the tools to reason about what "efficient" means.
Mental Model
Think of tokenization as building a codebook. You have a source language (raw text) and you want to compress it into a shorter sequence of symbols (tokens). Shannon's source coding theorem says the best you can do is the entropy of the source. BPE approximately achieves this by greedily building a codebook that captures frequent patterns.
A good tokenizer compresses common text heavily (short token sequences) while gracefully handling rare text (falling back to character or byte-level tokens). The vocabulary size is a design choice that trades off between compression (large means shorter sequences) and the difficulty of learning token embeddings (large means more parameters and sparser training signal).
Formal Setup and Notation
Let be an alphabet (e.g., UTF-8 bytes) and the set of all strings over . A tokenizer is a function that maps strings to sequences of tokens from a vocabulary with .
Byte Pair Encoding (BPE)
BPE builds a vocabulary by iteratively merging the most frequent adjacent pair of symbols:
- Start with a vocabulary of individual bytes (or characters)
- Count all adjacent pairs in the training corpus
- Merge the most frequent pair into a single new token
- Repeat steps 2-3 until the vocabulary reaches the desired size
At inference time, apply the learned merges greedily in order of priority to tokenize new text.
The algorithm is originally due to Gage (1994) in a compression context (C Users Journal 12:2). Sennrich-Haddow-Birch (2016) adapted it to neural machine translation, which is the form used in modern LLMs. WordPiece (Schuster-Nakajima 2012, used in BERT) is a closely related subword algorithm that picks merges to maximize training-corpus likelihood rather than raw frequency.
Compression Ratio
The compression ratio of a tokenizer on a text corpus is:
A higher compression ratio means each token carries more information on average. English text typically achieves CR around 3.5-4.5 with modern tokenizers.
Fertility
The fertility of a tokenizer for a language or domain is the average number of tokens per word (or per character):
Lower fertility means more efficient representation. English has fertility around 1.3 with GPT-4's tokenizer; some languages exceed 3.0.
Core Definitions
SentencePiece is a widely used tokenization library that treats the input
as a raw stream of Unicode characters (not raw bytes), encoding whitespace as
the meta-character ▁ so it can be reversed losslessly without
pre-tokenization rules. This is the key distinction from byte-level BPE
(GPT-2/GPT-4): SentencePiece's base alphabet is Unicode characters with
optional byte-fallback for unseen codepoints, while byte-level BPE always
operates on the 256-symbol UTF-8 byte alphabet. SentencePiece supports BPE
and unigram language model tokenization. The unigram approach assigns
probabilities to all substrings and finds the tokenization that maximizes the
likelihood, which is more principled than greedy BPE but gives similar results
in practice.
The information-theoretic view: let be the per-character entropy of the source text. A tokenizer with vocabulary size and average token length characters produces sequences of length tokens for input of length characters. Each token carries bits. The minimum achievable sequence length is bounded by the entropy:
Good tokenization pushes toward this bound. In practice, BPE gets surprisingly close for the language it was trained on.
Main Theorems
Tokenization Length and Source Entropy
Statement
For a stationary ergodic source with entropy rate (bits per character), any tokenizer with vocabulary size satisfies:
Equality is achieved by an optimal variable-length code. BPE approximates this bound for frequent patterns in the training distribution.
Intuition
Each token is one of symbols, so it carries at most bits of information. The source produces bits per character. You need at least tokens per character to avoid losing information, just as you cannot compress a file below its entropy.
Proof Sketch
This follows from Shannon's source coding theorem. The total information in characters is approximately bits. Encoding this into a sequence of tokens from an alphabet of size provides at most bits. For lossless encoding, , giving .
Why It Matters
This bound tells you the theoretical limit of tokenization efficiency. If your tokenizer is far from this bound for some language, it is wasting context window on that language. This directly explains the multilingual tokenization gap: tokenizers trained mostly on English are near-optimal for English but far from optimal for underrepresented languages.
Empirical observation: BPE approaches the entropy rate
BPE with vocabulary size empirically achieves a compression ratio that approaches the entropy rate as grows, on the distribution BPE was trained on. Measured compression-ratio gaps shrink roughly monotonically in for English text across GPT-2, GPT-3.5, GPT-4, Llama-2, and Llama-3 tokenizers (Rust et al. 2021; Petrov et al. 2023 for the multilingual picture).
This is stated as an empirical regularity, not as a rigorous theorem. A clean asymptotic redundancy bound of the form for BPE specifically (as opposed to LZ-family dictionary coders) is not established in the literature. The closest formal results are Zouhar-Meister-Gastaldi-Cotterell (Formal Perspectives on BPE, ACL 2023), which shows BPE is an instance of greedy dictionary coding and shares its approximation guarantees under assumptions that are strong for natural language (stationary memoryless source, unique merge ties). For natural text, the correct framing is "empirically competitive with optimal dictionary codes, no closed-form redundancy bound known."
Intuition. BPE greedily captures the most frequent pairs first. As the vocabulary grows, it captures longer common substrings, shrinking the gap to the entropy rate. Each merge step removes the most redundancy per added vocabulary entry, which is a greedy approximation to optimal dictionary coding.
Caveat. BPE is optimized for its training distribution. On out-of-distribution text (rare languages, code, mathematical notation), it can produce very long token sequences. Any near-optimality claim holds in expectation over the training distribution, not worst-case.
Canonical Examples
BPE merge example
Starting vocabulary: . Corpus: "abab cdcd abab".
Step 1: Most frequent pair is (a, b) with count 4. Merge into token "ab". Corpus becomes: "ab ab cd cd ab ab". Step 2: Most frequent pair is (c, d) with count 2. Merge into "cd". Step 3: Most frequent pair is (ab, space) or other patterns.
After enough merges, common words become single tokens. The string "abab" that initially took 4 tokens now takes 2 (or 1 with another merge).
Tokenization failure: arithmetic
GPT-style tokenizers merge digit sequences inconsistently. The number "123456" might be tokenized as ["123", "456"] or ["12", "3456"] depending on context. The model must learn that "123" followed by "456" represents one hundred twenty-three thousand four hundred fifty-six, not two separate numbers. This is why LLMs struggle with arithmetic: the tokenizer imposes an unnatural decomposition of numbers.
Common Confusions
BPE is not word-level tokenization
BPE operates on subword units. Common words become single tokens, but rare words are split into pieces. This is not a bug but a feature: it gives the model a fallback for any string (down to individual bytes) while keeping common text compact. The vocabulary never encounters an out-of-vocabulary token.
Larger vocabulary is not always better
Increasing improves compression (fewer tokens per text) but increases the embedding matrix size ( parameters) and makes each token's training signal sparser. There is an optimal that balances compression, parameter count, and training efficiency. Most modern LLMs use between 32K and 256K.
Tokenization is not language-neutral
A tokenizer trained on 90% English text learns English-optimal merges. Non-Latin scripts may be split at the byte level, tripling token counts. This means the model's effective context length and per-token information content vary by language. Multilingual tokenizers address this by balancing the training corpus or using character-aware methods.
Summary
- BPE builds a vocabulary by greedily merging frequent character pairs; it approximates entropy-optimal compression
- Good tokenization minimizes expected sequence length, approaching the information-theoretic bound tokens per character
- Tokenizer quality varies by language: underrepresented languages get worse compression and shorter effective context
- Tokenization failures (numbers, rare scripts, code) directly cause model failures because the model only sees the tokenized representation
- Vocabulary size is a tradeoff: larger means better compression but more parameters and sparser training
Exercises
Problem
English text has approximately bits per character. A tokenizer with tokens achieves an average token length of 4.1 characters. How close is this to the entropy bound?
Problem
Explain why byte-level BPE (starting from 256 byte tokens) is strictly more general than character-level BPE (starting from Unicode characters). What does it gain and what does it lose?
Problem
The unigram language model tokenizer (used in SentencePiece) finds the tokenization that maximizes the log-likelihood where are learned token probabilities. How does this relate to the information-theoretic view of tokenization as compression?
References
Pre-canonical (compression lineage):
- Gage, A New Algorithm for Data Compression (C Users Journal 12:2, 1994). The original BPE algorithm in a compression context, 22 years before its NLP adaptation.
- Schuster & Nakajima, Japanese and Korean Voice Search (ICASSP 2012). The original WordPiece algorithm, later adopted by BERT.
Canonical (NLP adaptations):
- Sennrich, Haddow, Birch, Neural Machine Translation of Rare Words with Subword Units (ACL 2016, arXiv:1508.07909). Adapts Gage BPE to NMT; the form used in modern LLM tokenizers.
- Kudo, Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (ACL 2018, arXiv:1804.10959). Introduces the unigram LM tokenizer with EM training, used by SentencePiece alongside BPE.
- Kudo & Richardson, SentencePiece: A Simple and Language Independent Subword Tokenizer (EMNLP 2018, arXiv:1808.06226). The widely used tokenization library.
Current (formal and fairness analysis):
- Rust et al., How Good is Your Tokenizer? (ACL 2021, arXiv:2012.15613). Cross-linguistic tokenizer evaluation.
- Petrov et al., Language Model Tokenizers Introduce Unfairness Between Languages (NeurIPS 2023, arXiv:2305.15425). Multilingual fertility disparities on GPT-4's tokenizer, some languages exceeding 10x English.
- Ahia et al., Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models (EMNLP 2023, arXiv:2305.13707). API-pricing disparity induced by tokenizer fertility.
- Zouhar, Meister, Gastaldi, Cotterell, Formal Perspectives on BPE (ACL 2023, arXiv:2306.16837). The closest thing to a formal approximation analysis of BPE as dictionary coding.
Frontier (tokenizer-free alternatives):
- Xue et al., ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models (TACL 2022, arXiv:2105.13626). Byte-level sequences, no tokenizer.
- Yu et al., MegaByte: Predicting Million-byte Sequences with Multiscale Transformers (NeurIPS 2023, arXiv:2305.07185). Patch-based byte modeling.
- Pagnoni et al., Byte Latent Transformer: Patches Scale Better Than Tokens (BLT, 2024, arXiv:2412.09871). Dynamic patch-size byte modeling.
Reference:
- Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapter 2 (regex, tokenization) and Chapter 3 (n-gram language models). Narrower-than-previous citation: Chapters 7-12 drifted too broad.
Next Topics
Tokenization feeds directly into:
- Distributed training theory: how token sequence length affects batch size and communication
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Common Probability Distributionslayer 0A · tier 1
- Information Theory Foundationslayer 0B · tier 2
Derived topics
6- Cohere Modelslayer 4 · tier 2
- Mistral Modelslayer 4 · tier 2
- GPT Series Evolutionlayer 5 · tier 2
- LLaMA and Open Weight Modelslayer 5 · tier 2
- Byte-Level Language Modelslayer 4 · tier 3
+1 more on the derived-topics page.