Skip to main content

LLM Construction

Tokenization and Information Theory

Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.

AdvancedTier 3CurrentSupporting~50 min

Why This Matters

Before a language model sees any text, a tokenizer chops it into pieces. This seemingly mundane preprocessing step has profound consequences. The tokenizer determines what the model can represent, how efficiently it compresses text, whether it handles non-English languages well, and even whether it can do arithmetic.

A model with 32K tokens per context window is not limited by characters or words but by tokens. If your tokenizer wastes tokens on inefficient encodings (e.g., splitting every Chinese character into 3 byte-level tokens), your effective context shrinks by 3x for that language. Information theory gives us the tools to reason about what "efficient" means.

Mental Model

Think of tokenization as building a codebook. You have a source language (raw text) and you want to compress it into a shorter sequence of symbols (tokens). Shannon's source coding theorem says the best you can do is the entropy of the source. BPE approximately achieves this by greedily building a codebook that captures frequent patterns.

A good tokenizer compresses common text heavily (short token sequences) while gracefully handling rare text (falling back to character or byte-level tokens). The vocabulary size VV is a design choice that trades off between compression (large VV means shorter sequences) and the difficulty of learning token embeddings (large VV means more parameters and sparser training signal).

Formal Setup and Notation

Let Σ\Sigma be an alphabet (e.g., UTF-8 bytes) and Σ\Sigma^* the set of all strings over Σ\Sigma. A tokenizer is a function tok:ΣV\text{tok}: \Sigma^* \to \mathcal{V}^* that maps strings to sequences of tokens from a vocabulary V\mathcal{V} with V=V|\mathcal{V}| = V.

Definition

Byte Pair Encoding (BPE)

BPE builds a vocabulary by iteratively merging the most frequent adjacent pair of symbols:

  1. Start with a vocabulary of individual bytes (or characters)
  2. Count all adjacent pairs in the training corpus
  3. Merge the most frequent pair into a single new token
  4. Repeat steps 2-3 until the vocabulary reaches the desired size VV

At inference time, apply the learned merges greedily in order of priority to tokenize new text.

The algorithm is originally due to Gage (1994) in a compression context (C Users Journal 12:2). Sennrich-Haddow-Birch (2016) adapted it to neural machine translation, which is the form used in modern LLMs. WordPiece (Schuster-Nakajima 2012, used in BERT) is a closely related subword algorithm that picks merges to maximize training-corpus likelihood rather than raw frequency.

Definition

Compression Ratio

The compression ratio of a tokenizer on a text corpus is:

CR=number of bytes in raw textnumber of tokens after tokenization\text{CR} = \frac{\text{number of bytes in raw text}}{\text{number of tokens after tokenization}}

A higher compression ratio means each token carries more information on average. English text typically achieves CR around 3.5-4.5 with modern tokenizers.

Definition

Fertility

The fertility of a tokenizer for a language or domain is the average number of tokens per word (or per character):

fertility=number of tokensnumber of words\text{fertility} = \frac{\text{number of tokens}}{\text{number of words}}

Lower fertility means more efficient representation. English has fertility around 1.3 with GPT-4's tokenizer; some languages exceed 3.0.

Core Definitions

SentencePiece is a widely used tokenization library that treats the input as a raw stream of Unicode characters (not raw bytes), encoding whitespace as the meta-character so it can be reversed losslessly without pre-tokenization rules. This is the key distinction from byte-level BPE (GPT-2/GPT-4): SentencePiece's base alphabet is Unicode characters with optional byte-fallback for unseen codepoints, while byte-level BPE always operates on the 256-symbol UTF-8 byte alphabet. SentencePiece supports BPE and unigram language model tokenization. The unigram approach assigns probabilities to all substrings and finds the tokenization that maximizes the likelihood, which is more principled than greedy BPE but gives similar results in practice.

The information-theoretic view: let HH be the per-character entropy of the source text. A tokenizer with vocabulary size VV and average token length LL characters produces sequences of length n/Ln/L tokens for input of length nn characters. Each token carries log2V\log_2 V bits. The minimum achievable sequence length is bounded by the entropy:

tokens per characterHlog2V\text{tokens per character} \geq \frac{H}{\log_2 V}

Good tokenization pushes toward this bound. In practice, BPE gets surprisingly close for the language it was trained on.

Main Theorems

Proposition

Tokenization Length and Source Entropy

Statement

For a stationary ergodic source with entropy rate HH (bits per character), any tokenizer with vocabulary size VV satisfies:

E[number of tokensnumber of characters]Hlog2V\mathbb{E}\left[\frac{\text{number of tokens}}{\text{number of characters}}\right] \geq \frac{H}{\log_2 V}

Equality is achieved by an optimal variable-length code. BPE approximates this bound for frequent patterns in the training distribution.

Intuition

Each token is one of VV symbols, so it carries at most log2V\log_2 V bits of information. The source produces HH bits per character. You need at least H/log2VH / \log_2 V tokens per character to avoid losing information, just as you cannot compress a file below its entropy.

Proof Sketch

This follows from Shannon's source coding theorem. The total information in nn characters is approximately nHnH bits. Encoding this into a sequence of mm tokens from an alphabet of size VV provides at most mlog2Vm \log_2 V bits. For lossless encoding, mlog2VnHm \log_2 V \geq nH, giving m/nH/log2Vm/n \geq H / \log_2 V.

Why It Matters

This bound tells you the theoretical limit of tokenization efficiency. If your tokenizer is far from this bound for some language, it is wasting context window on that language. This directly explains the multilingual tokenization gap: tokenizers trained mostly on English are near-optimal for English but far from optimal for underrepresented languages.

Empirical observation: BPE approaches the entropy rate

BPE with vocabulary size VV empirically achieves a compression ratio that approaches the entropy rate as VV grows, on the distribution BPE was trained on. Measured compression-ratio gaps shrink roughly monotonically in VV for English text across GPT-2, GPT-3.5, GPT-4, Llama-2, and Llama-3 tokenizers (Rust et al. 2021; Petrov et al. 2023 for the multilingual picture).

This is stated as an empirical regularity, not as a rigorous theorem. A clean asymptotic redundancy bound of the form O(1/logV)O(1/\log V) for BPE specifically (as opposed to LZ-family dictionary coders) is not established in the literature. The closest formal results are Zouhar-Meister-Gastaldi-Cotterell (Formal Perspectives on BPE, ACL 2023), which shows BPE is an instance of greedy dictionary coding and shares its approximation guarantees under assumptions that are strong for natural language (stationary memoryless source, unique merge ties). For natural text, the correct framing is "empirically competitive with optimal dictionary codes, no closed-form redundancy bound known."

Intuition. BPE greedily captures the most frequent pairs first. As the vocabulary grows, it captures longer common substrings, shrinking the gap to the entropy rate. Each merge step removes the most redundancy per added vocabulary entry, which is a greedy approximation to optimal dictionary coding.

Caveat. BPE is optimized for its training distribution. On out-of-distribution text (rare languages, code, mathematical notation), it can produce very long token sequences. Any near-optimality claim holds in expectation over the training distribution, not worst-case.

Canonical Examples

Example

BPE merge example

Starting vocabulary: . Corpus: "abab cdcd abab".

Step 1: Most frequent pair is (a, b) with count 4. Merge into token "ab". Corpus becomes: "ab ab cd cd ab ab". Step 2: Most frequent pair is (c, d) with count 2. Merge into "cd". Step 3: Most frequent pair is (ab, space) or other patterns.

After enough merges, common words become single tokens. The string "abab" that initially took 4 tokens now takes 2 (or 1 with another merge).

Example

Tokenization failure: arithmetic

GPT-style tokenizers merge digit sequences inconsistently. The number "123456" might be tokenized as ["123", "456"] or ["12", "3456"] depending on context. The model must learn that "123" followed by "456" represents one hundred twenty-three thousand four hundred fifty-six, not two separate numbers. This is why LLMs struggle with arithmetic: the tokenizer imposes an unnatural decomposition of numbers.

Common Confusions

Watch Out

BPE is not word-level tokenization

BPE operates on subword units. Common words become single tokens, but rare words are split into pieces. This is not a bug but a feature: it gives the model a fallback for any string (down to individual bytes) while keeping common text compact. The vocabulary never encounters an out-of-vocabulary token.

Watch Out

Larger vocabulary is not always better

Increasing VV improves compression (fewer tokens per text) but increases the embedding matrix size (V×dV \times d parameters) and makes each token's training signal sparser. There is an optimal VV that balances compression, parameter count, and training efficiency. Most modern LLMs use VV between 32K and 256K.

Watch Out

Tokenization is not language-neutral

A tokenizer trained on 90% English text learns English-optimal merges. Non-Latin scripts may be split at the byte level, tripling token counts. This means the model's effective context length and per-token information content vary by language. Multilingual tokenizers address this by balancing the training corpus or using character-aware methods.

Summary

  • BPE builds a vocabulary by greedily merging frequent character pairs; it approximates entropy-optimal compression
  • Good tokenization minimizes expected sequence length, approaching the information-theoretic bound H/log2VH / \log_2 V tokens per character
  • Tokenizer quality varies by language: underrepresented languages get worse compression and shorter effective context
  • Tokenization failures (numbers, rare scripts, code) directly cause model failures because the model only sees the tokenized representation
  • Vocabulary size VV is a tradeoff: larger VV means better compression but more parameters and sparser training

Exercises

ExerciseCore

Problem

English text has approximately H1.2H \approx 1.2 bits per character. A tokenizer with V=50,000V = 50{,}000 tokens achieves an average token length of 4.1 characters. How close is this to the entropy bound?

ExerciseAdvanced

Problem

Explain why byte-level BPE (starting from 256 byte tokens) is strictly more general than character-level BPE (starting from Unicode characters). What does it gain and what does it lose?

ExerciseResearch

Problem

The unigram language model tokenizer (used in SentencePiece) finds the tokenization that maximizes the log-likelihood ilogp(ti)\sum_i \log p(t_i) where p(t)p(t) are learned token probabilities. How does this relate to the information-theoretic view of tokenization as compression?

References

Pre-canonical (compression lineage):

  • Gage, A New Algorithm for Data Compression (C Users Journal 12:2, 1994). The original BPE algorithm in a compression context, 22 years before its NLP adaptation.
  • Schuster & Nakajima, Japanese and Korean Voice Search (ICASSP 2012). The original WordPiece algorithm, later adopted by BERT.

Canonical (NLP adaptations):

  • Sennrich, Haddow, Birch, Neural Machine Translation of Rare Words with Subword Units (ACL 2016, arXiv:1508.07909). Adapts Gage BPE to NMT; the form used in modern LLM tokenizers.
  • Kudo, Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (ACL 2018, arXiv:1804.10959). Introduces the unigram LM tokenizer with EM training, used by SentencePiece alongside BPE.
  • Kudo & Richardson, SentencePiece: A Simple and Language Independent Subword Tokenizer (EMNLP 2018, arXiv:1808.06226). The widely used tokenization library.

Current (formal and fairness analysis):

  • Rust et al., How Good is Your Tokenizer? (ACL 2021, arXiv:2012.15613). Cross-linguistic tokenizer evaluation.
  • Petrov et al., Language Model Tokenizers Introduce Unfairness Between Languages (NeurIPS 2023, arXiv:2305.15425). Multilingual fertility disparities on GPT-4's tokenizer, some languages exceeding 10x English.
  • Ahia et al., Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models (EMNLP 2023, arXiv:2305.13707). API-pricing disparity induced by tokenizer fertility.
  • Zouhar, Meister, Gastaldi, Cotterell, Formal Perspectives on BPE (ACL 2023, arXiv:2306.16837). The closest thing to a formal approximation analysis of BPE as dictionary coding.

Frontier (tokenizer-free alternatives):

  • Xue et al., ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models (TACL 2022, arXiv:2105.13626). Byte-level sequences, no tokenizer.
  • Yu et al., MegaByte: Predicting Million-byte Sequences with Multiscale Transformers (NeurIPS 2023, arXiv:2305.07185). Patch-based byte modeling.
  • Pagnoni et al., Byte Latent Transformer: Patches Scale Better Than Tokens (BLT, 2024, arXiv:2412.09871). Dynamic patch-size byte modeling.

Reference:

  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapter 2 (regex, tokenization) and Chapter 3 (n-gram language models). Narrower-than-previous citation: Chapters 7-12 drifted too broad.

Next Topics

Tokenization feeds directly into:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

6

+1 more on the derived-topics page.