Qwen and Chinese Models

Sneiderman, Robby

Model Timeline

Qwen and Chinese Models

The Chinese open-weight model ecosystem: Qwen (Alibaba), Yi (01.AI), Baichuan, GLM (Zhipu AI), and Kimi (Moonshot AI), with a focus on multilingual capability and independent scaling.

CoreTier 3FrontierReference~35 min

Prerequisites

Transformer Architecture

Prereq Map

Why This Matters

China has developed an independent LLM ecosystem that is competitive with Western models on many benchmarks. Qwen (Alibaba) is the most prominent family, with open weights, strong multilingual support, and sizes from sub-billion to 235B MoE. Understanding this ecosystem matters because it represents a parallel development path with different data, hardware constraints (US export controls on top-tier GPUs), and design priorities.

Qwen Family (Alibaba Cloud)

Qwen 1 (August 2023)

Sizes. 1.8B, 7B, 14B, 72B. Training data. Over 2.4T tokens, with strong coverage of Chinese and English. Architecture. Dense decoder-only transformer with RoPE, SwiGLU activations, and RMSNorm.

Key feature. Multilingual tokenizer with a large vocabulary (approximately 150K tokens) covering Chinese, English, and code. Chinese-optimized tokenizers are significantly more efficient for Chinese text than tokenizers designed primarily for English (like LLaMA's).

Qwen 1.5 (February 2024)

Incremental improvement over Qwen 1. Better alignment, improved long-context handling (up to 32K tokens for larger sizes). Released under a more permissive license (Apache 2.0 for smaller sizes).

Qwen 2 (June 2024)

Sizes. 0.5B, 1.5B, 7B, 57B-A14B (MoE, 57B total with 14B active), 72B. Training data. Over 7T tokens. Context length. Up to 128K tokens for the 7B and 72B models.

Improvements. Significant quality gains across math, coding, and reasoning. The 72B model was competitive with Llama 3 70B on English benchmarks and stronger on Chinese. The MoE variant (57B-A14B) demonstrated that Alibaba was also pursuing sparse architectures alongside dense models.

Qwen 2.5 (September 2024)

Sizes. 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B. Training data. Over 18T tokens (a major increase from Qwen 2's 7T). Specialized variants: Qwen2.5-Coder for code generation and Qwen2.5-Math for mathematical reasoning.

Result. Qwen 2.5 72B was competitive with Llama 3.1 70B and Claude 3.5 Sonnet on several benchmarks. The breadth of model sizes (from 0.5B to 72B) makes the family useful across deployment scenarios from edge devices to servers.

Qwen 3 (2025)

Extended the family with a 235B MoE model (estimated 22B active per token). Continued improvements in reasoning, code, and multilingual performance. The MoE variant follows the pattern established by DeepSeek-V2/V3: large total parameter count with a fraction active per token.

Tokenizer Efficiency

Proposition

Tokenizer Compression Ratio for Multilingual Text

Statement

The compression ratio of a BPE tokenizer is the ratio of characters (or bytes) to tokens for a given text. A tokenizer trained primarily on English text achieves a compression ratio of approximately 4:1 for English (4 characters per token) but only 1.5-2:1 for Chinese (1.5-2 characters per token), because Chinese characters are underrepresented in the vocabulary. A tokenizer with adequate Chinese coverage achieves 2.5-3.5:1 compression for Chinese. Since API pricing and context limits are denominated in tokens, a 2x difference in compression ratio means 2x higher effective cost and 2x shorter effective context for the same text.

Intuition

BPE builds vocabulary by merging frequent character pairs. If the training corpus is mostly English, most vocabulary entries represent English subwords. Chinese characters that appear less frequently in training get fewer vocabulary entries, so each Chinese character often maps to multiple tokens. Qwen's tokenizer allocates more vocabulary to Chinese, giving better compression for Chinese text.

Proof Sketch

Empirical measurement. Tokenize the same Chinese text with LLaMA's tokenizer (32K vocab, English-heavy) and Qwen's tokenizer (150K vocab, Chinese-balanced). LLaMA produces roughly 2x more tokens for the same Chinese text. The theoretical basis is that BPE compression approaches the empirical entropy of the training corpus distribution; if Chinese is underrepresented, the tokenizer's "language model" of Chinese is poor, yielding longer encodings.

Why It Matters

For a multilingual application, tokenizer efficiency directly affects cost and capability. If your tokenizer uses 2x as many tokens for Chinese as for English, then Chinese users get half the effective context window and pay twice as much per character. Qwen's large multilingual vocabulary addresses this directly.

Failure Mode

A larger vocabulary increases the embedding matrix size ( $V \times d$ ) and the output projection size. For a 150K vocabulary with $d = 4096$ , the embedding and output matrices together use $2 \times 150{,}000 \times 4{,}096 \approx 1.2$ B parameters. For smaller models (0.5B-1.5B), this is a significant fraction of total parameters. There is a trade-off between tokenizer efficiency and model parameter allocation.

report a correction →

Other Chinese Model Families

Yi (01.AI)

Organization. 01.AI, founded by Kai-Fu Lee. Models. Yi-6B, Yi-34B (November 2023), Yi-1.5 (2024) with sizes up to 34B. Yi-Lightning and Yi-Large (2024) as proprietary API models.

Approach. High-quality English and Chinese performance from relatively small models. Yi-34B at launch was competitive with much larger models on both English and Chinese benchmarks. The focus is on data quality over raw scale: careful curation and filtering of training data.

Baichuan (Baichuan Intelligence)

Models. Baichuan-7B, Baichuan-13B (2023), Baichuan 2 (2023). Open-weight models with strong Chinese performance. Less prominent internationally than Qwen or Yi but widely used in Chinese industry applications.

GLM (Zhipu AI)

Models. ChatGLM, ChatGLM2, ChatGLM3, GLM-4 (2024), GLM-5 (February 2026). Based on the General Language Model architecture, which originally used a bidirectional attention prefix followed by autoregressive generation. GLM-4 was competitive with GPT-4 on Chinese benchmarks.

GLM-5 (announced by Zhipu AI in 2026) is a native multimodal model (text, image, video, audio, code in a single architecture), supports long context (128K tokens), and includes built-in tool use and agentic capabilities. The technical report describes a three-stage training pipeline: multimodal pretraining, supervised fine-tuning with diverse instruction data, and RLHF with both human and AI feedback. Zhipu AI released it under a permissive license, continuing the Chinese open-weight trend. Exact benchmark numbers are vendor-reported; independent replication has been partial.

DeepSeek (DeepSeek AI / High-Flyer)

Why it deserves its own row. DeepSeek is the Chinese lab that has pushed hardest on the MoE + efficiency frontier, and its papers are the main Chinese source other labs (including Qwen on its MoE variants) explicitly cite.

Models. DeepSeek-LLM (2023, dense 7B and 67B), DeepSeek-Coder (2023, code-specialized), DeepSeek-V2 (2024, 236B MoE with 21B active per token, introducing multi-head latent attention for KV-cache compression), DeepSeek-V3 (late 2024, 671B MoE with 37B active), DeepSeek-R1 (early 2025, reasoning-specialized model trained with large-scale rule-based RL and released with open weights). See DeepSeek-R1 and reasoning RL for the reasoning pipeline.

Architectural contributions. Multi-head latent attention (MLA) compresses keys and values to a smaller latent, reducing KV-cache memory at long context at the cost of a small quality hit relative to GQA. The V3 paper also documents aggressive FP8 training and fine-grained expert routing.

Kimi (Moonshot AI)

Key innovation. Long context and reasoning. Kimi launched with 200K token context support in early 2024, making it one of the first consumer-facing products with context windows beyond 100K tokens. In 2025, Kimi k1.5 introduced multi-step reasoning with reinforcement learning, following the approach pioneered by DeepSeek-R1 and OpenAI o1. Kimi demonstrated that long context combined with chain-of-thought reasoning creates qualitatively different user experiences: the model can read an entire codebase and reason about architectural changes across files.

Kimi K2 and successors. In mid-2025 Moonshot AI shifted to an MoE architecture with K2, followed by K2 Thinking (late 2025, released with INT4 quantization and 256K context), K2.5 (early 2026, a multimodal iteration trained on mixed vision-text tokens), and K2.6 Code (April 2026, a coding-specialized variant). The lab has publicly discussed a K3 successor in the trillion-parameter range but no technical report is out at the time of writing.

Other Notable Models (2025-2026)

MiniMax-01 (MiniMax, 2025): 456B parameter MoE model with "lightning attention" (a hybrid of linear attention and softmax attention). Strong on long-context tasks.

Step-2 (StepFun, 2025): Multimodal model with native video understanding. One of the first Chinese models to reach strong video comprehension scores on public benchmarks.

Seed (ByteDance, 2025): TikTok's parent company entered the frontier model race. Limited public details but deployed in ByteDance products.

The Chinese Open-Weight Ecosystem

Several factors shape the Chinese model landscape:

Hardware constraints. US export controls restrict Chinese labs' access to Nvidia H100 GPUs. Labs use H800 (a lower-bandwidth variant) or domestic alternatives. This may incentivize efficiency-focused architectures (MoE, quantization-friendly designs) and is one hypothesis for why Chinese labs like DeepSeek have led on MoE innovation.

Data landscape. Chinese internet data is distinct from English internet data in scale, quality distribution, and content. Models trained primarily on Chinese data perform better on Chinese tasks but may lag on English, and vice versa. Qwen's strategy of training on balanced multilingual data addresses this directly.

Regulatory environment. China's AI regulations require government review of public-facing AI services. This affects how models are deployed commercially but has not slowed research output.

Open-weight norms. Qwen, Yi, Baichuan, and DeepSeek all release model weights, often under permissive licenses. The Chinese open-weight ecosystem is growing independently of and in parallel with the Western ecosystem (Llama, Mistral, Gemma).

Common Confusions

Watch Out

Chinese models are not just for Chinese language

Qwen 2.5 72B performs competitively with Llama 3.1 70B on English benchmarks. Modern Chinese models are multilingual and competitive across languages. The "Chinese model" label refers to the organization's origin, not a language limitation.

Watch Out

Open weights does not mean identical training

Qwen, Yi, and DeepSeek all release weights, but their training data, procedures, and filtering criteria differ. The models are not interchangeable; they have different strengths even at similar parameter counts. Evaluating on your specific task is necessary.

Exercises

ExerciseCore

Problem

A tokenizer with 32K vocabulary tokenizes Chinese text at a compression ratio of 1.5 characters per token. A tokenizer with 150K vocabulary achieves 3.0 characters per token. If a model has a 128K token context window, how many Chinese characters can each tokenizer fit? What is the equivalent in approximate page count (assuming 800 Chinese characters per page)?

ExerciseAdvanced

Problem

Qwen 2.5 has a vocabulary of approximately 150K tokens with embedding dimension $d = 5120$ (for the 72B model). Compute the number of parameters in the embedding layer and the output projection (LM head). What fraction of the total 72B parameters do these represent? Compare to Llama 3 70B with a 128K vocabulary and $d = 8192$ .

References

Canonical:

Bai et al., "Qwen Technical Report" (Alibaba, 2023)
Yang et al., "Qwen2 Technical Report" (2024)
Qwen Team, "Qwen2.5 Technical Report" (2024)
DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024). Multi-head latent attention.
DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024). 671B MoE with 37B active.

Tokenizer inequity (pre-canonical):

Petrov et al., "Language Model Tokenizers Introduce Unfairness Between Languages," NeurIPS (2023). Quantifies the per-language token-cost gap.
Ahia et al., "Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models," EMNLP (2023). Shows API pricing is regressive in language.

Current:

01.AI, "Yi: Open Foundation Models by 01.AI" (2024)
Zhipu AI, "ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4" (2024)
Zhipu AI, "GLM-5: from Vibe Coding to Agentic Engineering," arXiv:2602.15763 (February 2026). 745B total / 44B active MoE with DeepSeek-style sparse attention, async RL infrastructure, released under MIT license as zai-org/GLM-5.
Kimi Team, "Kimi k1.5: Scaling Reinforcement Learning with LLMs" (2025)
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)

Next Topics

Model comparison table: structured comparison across frontier model families
DeepSeek models: the MoE architecture and R1 reasoning approach

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Transformer Architecturelayer 4 · tier 2

Derived topics

1

Model Comparison Tablelayer 5 · tier 2

Graph-backed continuations

Model Comparison Table