Hyperbolic Embeddings for Graphs

Sneiderman, Robby

ML Methods

Hyperbolic Embeddings for Graphs

Negatively curved space embeds tree-like graphs with exponentially less distortion than Euclidean space at the same dimension. The Poincare ball model, Mobius gyrovector arithmetic, and Sala et al.'s sharp distortion bound make this concrete; Ganea et al. lift the construction to neural network layers.

AdvancedTier 2CurrentSupporting~30 min

Prerequisites

Non Euclidean and Hyperbolic Geometry Metric Spaces Convergence Completeness

Prereq Map

Why This Matters

A perfect binary tree of depth $r$ has $2^{r+1}-1$ nodes, so the number of nodes within graph distance $r$ of the root grows like $2^r$ . A Euclidean ball of radius $r$ in $\mathbb{R}^d$ has volume $\Theta(r^d)$ , polynomial in $r$ . To fit $2^r$ nodes into a polynomial volume while preserving distances, an embedding must compress: tree leaves get shoved together and pairwise distances are crushed. This is not a tuning issue. It is a volume mismatch between the data and the space.

Hyperbolic space $\mathbb{H}^n$ has volume that grows like $e^{(n-1)r}$ (see non-Euclidean and hyperbolic geometry). The exponential matches the tree. Sala, De Sa, Gu, and Re (2018) made this quantitative: any tree on $N$ nodes embeds into the Poincare disk $\mathbb{H}^2$ with distortion $1+\varepsilon$ using $O((\log N)/\varepsilon)$ bits of precision, while the best Euclidean embedding into any fixed dimension has distortion $\Omega(\log N)$ for some trees.

This was not folklore. Nickel and Kiela (2017) reported 5-dimensional Poincare embeddings of WordNet hypernymy beating 200-dimensional Euclidean embeddings on link prediction. Ganea, Becigneul, and Hofmann (2018) then showed how to build neural-network layers (linear maps, biases, activations, attention) that operate inside $\mathbb{B}^n$ rather than around it.

Setup: the Poincare ball

Definition

Poincare ball model

The Poincare ball is the open unit ball

\mathbb{B}^n = \{x \in \mathbb{R}^n : \|x\| < 1\}

equipped with the Riemannian metric

g_x = \lambda_x^2 \, g_{\mathrm{eucl}}, \qquad \lambda_x = \frac{2}{1-\|x\|^2}.

The induced distance between $x,y \in \mathbb{B}^n$ is

d_{\mathbb{B}}(x,y) = \cosh^{-1}\!\left(1 + \frac{2\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right).

The boundary $\{x : \|x\|=1\}$ is at infinite distance from any interior point. The model has constant sectional curvature $-1$ .

Two facts make $\mathbb{B}^n$ usable as an embedding space:

Geodesics through the origin are straight lines. The geodesic from $0$ to a point $x \neq 0$ is the Euclidean segment, with hyperbolic arclength $d_{\mathbb{B}}(0,x) = 2\,\mathrm{arctanh}(\|x\|)$ .
Other geodesics are circular arcs orthogonal to the boundary $\partial \mathbb{B}^n$ . Picture: a "straight line" curves toward the center because pushing along the boundary is metrically expensive (the conformal factor $\lambda_x \to \infty$ as $\|x\| \to 1$ ).

The distortion bound for tree embeddings

The central result. Distortion of an embedding $f: (V, d_T) \to (M, d_M)$ is the multiplicative gap between graph and embedded distances:

\mathrm{dist}(f) = \sup_{u\neq v} \frac{d_M(f(u), f(v))}{d_T(u,v)} \cdot \sup_{u\neq v} \frac{d_T(u,v)}{d_M(f(u), f(v))}.

A perfect (isometric) embedding has distortion $1$ .

Theorem

Sala-De Sa-Gu-Re distortion bound for tree embeddings

Intuition

Embed the root at the origin. For each node already placed, reserve an angular cone for each child and place the child at hyperbolic distance equal to the edge weight along the bisector of its cone. Because the circumference of a hyperbolic circle of radius $r$ scales like $\sinh r \sim e^r$ , the cone available to each child is exponentially larger than the cone of the parent, so even a high-branching tree fits without children colliding. Scaling the whole embedding by a large $\tau$ pushes nodes outward where there is exponential room, controlling angular interference.

Proof Sketch

Embedding construction (Sala et al., Algorithm 1). Place the root $v_0$ at $0 \in \mathbb{B}^2$ . Process nodes in BFS order. For a node $u$ at hyperbolic distance $r_u$ from the origin with $k$ children, partition the angular cone at $u$ (the $2\pi$ minus the angle back to its parent) into $k$ sectors. Place child $c_j$ at hyperbolic distance $r_u + w(u,c_j)$ along the bisector of its sector, where $w$ is the edge weight.

Why the construction has low distortion. Two nodes $u,v$ in $T$ have tree distance $d_T(u,v) = r_u + r_v - 2 r_{\mathrm{lca}}$ , where $\mathrm{lca}$ is their lowest common ancestor. After scaling by $\tau \geq 1$ , the embedded points satisfy $d_{\mathbb{B}}(f(u), f(v)) \approx \tau d_T(u,v) - O(\log k_{\max})$ , where $k_{\max}$ is the max degree. The negative correction term is the angular wiggle from approximating "go up then down" by a hyperbolic geodesic. It is bounded uniformly in $N$ , so for $\tau$ large enough relative to $\log k_{\max}/\varepsilon$ , every pairwise ratio lies in $[1, 1+\varepsilon]$ .

Why precision $O(\log N / \varepsilon)$ bits is needed. Hyperbolic distances near the boundary saturate floating-point precision. A point at hyperbolic distance $R$ has Euclidean norm $\tanh(R/2) = 1 - 2 e^{-R} + O(e^{-2R})$ , so distinguishing two points at hyperbolic distance $R$ apart requires $\Omega(R)$ bits to keep $1 - \|x\|^2$ from rounding to zero. Scaling pushes typical embedded distances to $\tau \cdot \mathrm{diam}(T) = \Omega(\log N / \varepsilon)$ , hence the bound. Sala et al. give an exact ulp-level analysis.

Euclidean lower bound. Bourgain (1985) and Linial-London-Rabinovich (1995) show any $N$ -point metric embeds into $\ell_2$ with distortion $O(\log N)$ , and this is tight: the complete binary tree (or random regular expander) requires $\Omega(\log N)$ distortion in any Euclidean space. So scaling dimension at fixed distortion does not save Euclidean embeddings of trees: distortion $\Omega(\log N)$ is unavoidable in $\mathbb{R}^d$ for every $d$ .

Why It Matters

This is the only result that cleanly justifies "use hyperbolic space for hierarchies." The constructive direction tells you the minimum dimension and precision needed. The Euclidean lower bound tells you that no clever Euclidean parametrization can match it asymptotically. Together they explain Nickel and Kiela's 5D-vs-200D result without appealing to mystery: the Euclidean baseline was paying $\Omega(\log N)$ distortion, which a 200-dimensional bag-of-features partially absorbed but did not remove.

Failure Mode

The theorem is about trees and tree-like graphs (low Gromov hyperbolicity $\delta$ ). Grids, lattices, and dense graphs have $\delta = \Theta(\mathrm{diam})$ and gain nothing. The same construction applied to a 2D grid embeds it with distortion $\Omega(\sqrt{N})$ , much worse than its trivial Euclidean embedding. Always estimate $\delta$ on a sample of 4-tuples before committing to a hyperbolic embedding (Adcock-Sullivan-Mahoney 2013 give a sampling estimator).

report a correction →

Mobius gyrovector arithmetic

To build neural networks in $\mathbb{B}^n$ you need an analogue of vector addition that respects the manifold. Ungar's (2008) gyrovector formalism provides one.

Definition

Mobius addition on the Poincare ball

For $x,y \in \mathbb{B}^n$ , the Mobius sum is

x \oplus y = \frac{(1 + 2\langle x,y\rangle + \|y\|^2) x + (1 - \|x\|^2) y}{1 + 2\langle x,y\rangle + \|x\|^2 \|y\|^2}.

It is not associative or commutative, but it is a smooth left-loop operation: $x \oplus (-x) = 0$ , $0 \oplus x = x$ , and the map $y \mapsto x \oplus y$ is a smooth bijection $\mathbb{B}^n \to \mathbb{B}^n$ . The inverse is $\ominus y = -y$ .

Definition

Exponential and logarithmic maps at the origin

The exponential map $\exp_0 : T_0 \mathbb{B}^n \cong \mathbb{R}^n \to \mathbb{B}^n$ at the origin is

\exp_0(v) = \tanh(\|v\|) \, \frac{v}{\|v\|},

and the logarithm is its inverse:

\log_0(y) = \mathrm{arctanh}(\|y\|) \, \frac{y}{\|y\|}.

$\exp_0$ takes a tangent vector at $0$ and returns the point in $\mathbb{B}^n$ reached by moving along the geodesic from $0$ in that direction for time equal to the tangent vector's length.

A Ganea-style hyperbolic linear layer is then

y = \exp_0\!\left( W \, \log_0(x) \right) \oplus b,

where $W \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{B}^m$ . The construction reads as: lift $x$ to the tangent space at the origin, apply a Euclidean linear map, project back to the manifold, then translate by Mobius addition. Hyperbolic activations, attention scores, and MLR classifiers follow the same lift-act-project pattern.

Watch Out

Hyperbolic distance is not a rescaling of Euclidean distance

The Poincare distance formula is not just $\|x-y\|$ scaled by a function of position. Two points with small Euclidean separation near the boundary can be very far apart in $d_{\mathbb{B}}$ because $\lambda_x \to \infty$ as $\|x\| \to 1$ . This is why hyperbolic embeddings have capacity: the boundary is at infinity, so there is always more room to push deeper subtrees outward.

Watch Out

Hyperbolic neural networks are not Euclidean networks with a tanh squash

The output of $\tanh$ lies in $(-1,1)$ , so a Euclidean MLP with a final tanh produces points in $\mathbb{B}^n$ as a set. That is not a hyperbolic neural network. The intermediate operations (matrix multiplication, addition, normalization) treat the points as Euclidean and ignore the metric. A hyperbolic network uses Mobius operations and Riemannian gradients computed via $\exp_0$ and $\log_0$ , so each parameter step respects the geometry. Treating points as Euclidean, then re-projecting after each step, loses exactly the structure the model was supposed to exploit.

Watch Out

Curvature is a hyperparameter, not a fixed truth

Most papers set curvature $K=-1$ for convenience. Curvature scales distances and gradients in ways that interact with optimization and numerical precision. Chami et al. (2019) make $K$ learnable per layer; the learned values often drift toward zero on graphs with low Gromov $\delta$ but high local clustering, indicating that the model wants Euclidean geometry in places. If your model insists on $K \approx 0$ everywhere, that is evidence the dataset is not actually tree-like.

When hyperbolic embeddings help

The right diagnostic is Gromov hyperbolicity $\delta$ , the smallest constant such that every geodesic triangle is $\delta$ -thin (each side lies within $\delta$ of the union of the other two). Trees have $\delta=0$ ; trees with extra cross-edges have small $\delta$ ; expanders and 2D grids have $\delta = \Theta(\mathrm{diam})$ .

Empirically:

Graph family	Typical $\delta / \mathrm{diam}$	Hyperbolic gain over Euclidean at fixed dim
WordNet hypernymy, Gene Ontology, citation chains	$< 0.05$	Large; Nickel-Kiela report 30%+ MAP improvements at 5-50 dim
Social networks (Facebook, citation networks)	$0.05$ to $0.2$	Moderate; Chami et al. show small but consistent gains
Web graphs, biological PPI	$0.2$ to $0.4$	Marginal; depends on subgraph and task
Road networks, 2D meshes, regular grids	$> 0.4$	None or negative; Euclidean is the right choice
Random Erdos-Renyi $G(n,p)$	$\Theta(\log n / \log np)$	None for typical $p$

The takeaway: estimate $\delta$ before committing. It is a one-pass sampling computation.

Worked Example: a 7-node binary tree in $\mathbb{B}^2$

Take the perfect binary tree $T$ with root $r$ , children $u_1,u_2$ , grandchildren $v_{ij}$ for $i,j \in \{1,2\}$ . All edges have weight $1$ .

Place $f(r)=0$ . Place the two children on opposite rays at hyperbolic distance $1$ :

f(u_1) = (\tanh(1/2), 0), \qquad f(u_2) = (-\tanh(1/2), 0).

For $u_1$ , the angular cone facing away from $r$ is the right half-plane. Split it into two $90^\circ$ sectors and place the grandchildren on the bisectors at hyperbolic distance $1$ from $u_1$ . This requires Mobius translation: $f(v_{11}) = f(u_1) \oplus \exp_0((\cos 45^\circ, \sin 45^\circ))$ , and similarly for $v_{12}$ .

The hyperbolic distance $d_{\mathbb{B}}(f(v_{11}), f(v_{12}))$ comes out to roughly $1.93$ , and the tree distance is $2$ . With scale factor $\tau=2$ , the same construction gives $d_{\mathbb{B}}(\tau f(v_{11}), \tau f(v_{12})) \approx 3.97$ versus tree distance $4$ , distortion $\approx 1.01$ . Pushing $\tau$ higher pushes the ratio further toward $1$ , illustrating the construction in the proof sketch.

ExerciseCore

Problem

Verify directly that the Poincare ball model has constant curvature $-1$ at the origin. Specifically: compute the geodesic distance from $0$ to a point $x \in \mathbb{B}^n$ along the Euclidean segment, and show it equals $2\,\mathrm{arctanh}(\|x\|)$ . Then compute the circumference of the hyperbolic circle of radius $R$ centered at $0$ and show it equals $2\pi \sinh R$ , the signature of curvature $-1$ .

ExerciseAdvanced

Problem

Prove the Euclidean lower bound for the path graph: any embedding of the unweighted path $P_N$ on $N$ nodes into Euclidean space $\mathbb{R}^d$ has distortion $\Omega(1)$ , but any embedding of the complete binary tree $B_h$ on $N = 2^{h+1}-1$ nodes has distortion $\Omega(\sqrt{\log N})$ .

Then explain why "complete binary tree" is the right witness: what specific feature of $B_h$ defeats Euclidean embedding that paths and stars do not have?

ExerciseAdvanced

Problem

Implement Mobius addition $x \oplus y$ in numpy and verify three properties on random points in $\mathbb{B}^2$ :

Left identity: $0 \oplus y = y$ .
Left inverse: $(\ominus x) \oplus (x \oplus y) = y$ .
Non-associativity: there exist $x,y,z$ such that $(x \oplus y) \oplus z \neq x \oplus (y \oplus z)$ .

Then describe what the "gyration" $\mathrm{gyr}[x,y]$ corrects for and where it appears in the Ganea hyperbolic-MLP code (in scalar matrix multiplication versus pointwise activation).

References

Maximilian Nickel, Douwe Kiela. Poincare Embeddings for Learning Hierarchical Representations. NeurIPS 2017. The empirical opener: 5D Poincare beats 200D Euclidean on WordNet hypernymy. arXiv:1705.08039
Frederic Sala, Christopher De Sa, Albert Gu, Christopher Re. Representation Tradeoffs for Hyperbolic Embeddings. ICML 2018. Sharp distortion bound (1+epsilon at O(log N / epsilon) precision) and an information-theoretic lower bound matching it. arXiv:1804.03329
Octavian Ganea, Gary Becigneul, Thomas Hofmann. Hyperbolic Neural Networks. NeurIPS 2018. Mobius addition, exp/log maps, and hyperbolic linear, GRU, and MLR layers; the construction reused by every later hyperbolic NN paper. arXiv:1805.09112
Ines Chami, Rex Ying, Christopher Re, Jure Leskovec. Hyperbolic Graph Convolutional Neural Networks. NeurIPS 2019. Hyperbolic GCNs with learnable curvature; gains on tree-like graphs, parity on others. The empirical evidence that curvature is dataset-dependent. arXiv:1910.12933
James W. Anderson. Hyperbolic Geometry (2nd ed.). Springer Undergraduate Mathematics Series, 2005. Chapters 1 to 4: Poincare disk and upper-half-plane models, isometries, distance formulas, Mobius transformations as the conformal automorphism group.
Abraham A. Ungar. Analytic Hyperbolic Geometry and Albert Einstein's Special Theory of Relativity. World Scientific, 2008. Chapters 3 and 6: gyrovector spaces, Mobius gyrogroup, gyration as the associativity defect.
Martin R. Bridson, Andre Haefliger. Metric Spaces of Non-Positive Curvature. Springer Grundlehren 319, 1999. Chapter II.1 (CAT(0) spaces) and III.H (hyperbolic groups and Gromov hyperbolicity).
Aaron B. Adcock, Blair D. Sullivan, Michael W. Mahoney. Tree-Like Structure in Large Social and Information Networks. ICDM 2013. Sampling estimator for Gromov hyperbolicity delta on graphs with millions of nodes, with case studies.
Albert Gu, Frederic Sala, Beliz Gunel, Christopher Re. Learning Mixed-Curvature Representations in Product Spaces. ICLR 2019. Per-component curvature on products $\mathbb{H}^{n_1} \times \mathbb{S}^{n_2} \times \mathbb{R}^{n_3}$ for graphs with mixed Gromov $\delta$ . arXiv:1806.03417
Maximillian Nickel, Douwe Kiela. Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry. ICML 2018. The Lorentz (hyperboloid) model is numerically better behaved than the Poincare ball for optimization near the boundary; preferred in modern hyperbolic NN implementations. arXiv:1806.03417
Tao Yu, Christopher De Sa. Numerically Accurate Hyperbolic Embeddings Using Tiling-Based Models. NeurIPS 2019. Multicomponent floating-point representations that avoid the $1 - \|x\|^2$ rounding pathology of the Poincare ball. arXiv:1911.08543
Aaron Lou, Isay Katsman, Qingxuan Jiang, Serge Belongie, Ser-Nam Lim, Christopher De Sa. Differentiating through the Frechet Mean. ICML 2020. Closed-form Riemannian centroids and gradients on hyperbolic manifolds, enabling layer-wise hyperbolic batch normalization. arXiv:2003.00335
Yiding Yang, Liang Wu, Wei Xu et al. Hyperbolic Graph Convolutional Networks: A Survey (2024). Synthesis of post-2019 hyperbolic GNN work covering Lorentz GCN, hyperbolic attention, hyperbolic graph transformers, and benchmarks. arXiv:2406.16438

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Metric Spaces, Convergence, and Completenesslayer 0A · tier 1
Non-Euclidean and Hyperbolic Geometrylayer 1 · tier 2

Derived topics

2

Representation Learning Theorylayer 3 · tier 2
Riemannian Optimization and Manifold Constraintslayer 3 · tier 2

Graph-backed continuations

Riemannian Optimization and Manifold Constraints Representation Learning Theory

Why This Matters

Setup: the Poincare ball

The distortion bound for tree embeddings

Mobius gyrovector arithmetic

When hyperbolic embeddings help

Worked Example: a 7-node binary tree in B2\mathbb{B}^2B2

References

Related Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Worked Example: a 7-node binary tree in $\mathbb{B}^2$