Visual and Semantic SLAM

Sneiderman, Robby

Applied Math

Visual and Semantic SLAM

Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.

AdvancedTier 3CurrentSupporting~50 min

Prerequisites

Graphslam and Factor Graphs

Prereq Map

Why This Matters

Classical SLAM systems use laser range finders to measure distances. These sensors are expensive, heavy, and produce geometric point clouds with no notion of what objects are present. Cameras are cheap, lightweight, and capture rich appearance information. Visual SLAM extracts geometric structure from image sequences. Semantic SLAM goes further: it labels map elements with object categories, producing maps that a robot can reason about ("navigate to the chair" rather than "navigate to point (3.2, 1.7)").

Mental Model

Visual SLAM solves two problems simultaneously from camera images: where is the robot, and what does the world look like? Feature points tracked across frames give geometric constraints. When the robot revisits a location, matching features trigger loop closures that correct accumulated drift. Semantic SLAM adds a third question: what are the objects in the scene? Attaching labels like "door", "table", or "wall" to map regions enables higher-level planning and more robust data association.

Visual SLAM Foundations

Feature Extraction

Visual SLAM begins with detecting and describing keypoints in images. A keypoint detector finds distinctive image locations (corners, blobs). A descriptor encodes the local appearance around each keypoint as a vector.

Definition

Feature Descriptor $d_{i} \in R^{k}$

A feature descriptor $d_i$ is a $k$ -dimensional vector encoding the local image patch around keypoint $i$ . Two keypoints in different images are matched when $\|d_i - d_j\|$ is below a threshold. Standard descriptors include ORB (binary, 256 bits), SIFT (float, 128 dimensions), and SuperPoint (learned, 256 dimensions).

ORB (Oriented FAST and Rotated BRIEF) is fast and produces binary descriptors matched with Hamming distance. SuperPoint is a learned detector/descriptor trained on synthetic homographies, offering better performance under illumination changes at higher compute cost.

Visual Odometry

Visual odometry estimates the camera motion between consecutive frames using matched features.

Definition

Essential Matrix $E$

For calibrated cameras, the essential matrix $E \in \mathbb{R}^{3 \times 3}$ encodes the relative rotation $R$ and translation $t$ between two views. It satisfies $E = [t]_\times R$ where $[t]_\times$ is the skew-symmetric matrix of $t$ . For corresponding points $p_1, p_2$ in normalized coordinates: $p_2^\top E\, p_1 = 0$ .

Theorem

Epipolar Constraint for Visual Odometry

Statement

Given $n \geq 5$ point correspondences $\{(p_1^{(i)}, p_2^{(i)})\}$ in normalized image coordinates from two calibrated views, the essential matrix $E$ satisfying $p_2^{(i)\top} E\, p_1^{(i)} = 0$ for all $i$ can be recovered up to scale. Decomposing $E$ via SVD yields the relative rotation $R$ and translation direction $t/\|t\|$ .

Intuition

Each point correspondence constrains the camera motion by one equation. Five equations suffice to pin down the five degrees of freedom of $E$ (3 rotation, 2 translation direction). Translation magnitude is unrecoverable from images alone without additional information.

Proof Sketch

Stack the epipolar constraints into a linear system $A\, \text{vec}(E) = 0$ where each row comes from one correspondence. With 5 or more points, the null space of $A$ is one-dimensional (generically), giving $E$ up to scale. The constraint that $E$ has two equal singular values and one zero singular value resolves remaining ambiguity. SVD of $E = U \Sigma V^\top$ gives $R$ and $t$ .

Why It Matters

This is the geometric foundation of visual odometry. Every visual SLAM system uses this or a variant (the fundamental matrix for uncalibrated cameras) to estimate frame-to-frame motion. RANSAC is applied on top to handle outlier matches.

Failure Mode

Fails when the scene is planar (degenerate configuration requiring a homography instead), when there is pure rotation with no translation (epipolar geometry is undefined), or when feature matches are predominantly incorrect. Also degrades with rolling shutter cameras where the pinhole model is violated.

report a correction →

Loop Closure Detection

As the robot moves, visual odometry accumulates drift. Loop closure detects when the robot returns to a previously visited location by matching current features against a database of past keyframes.

The standard approach uses bag-of-visual-words: feature descriptors are quantized into a visual vocabulary (trained offline via k-means on a large descriptor set), and each keyframe is represented as a histogram over visual words. A new frame queries the database using tf-idf scoring. When a match is found, a geometric verification step (computing the relative pose) confirms the loop closure, and the pose graph is optimized.

Semantic SLAM

Definition

Semantic Map

A semantic map augments a geometric map with object-level labels. Each map element (point, surface, or region) carries a category label $c \in \{1, \ldots, K\}$ with an associated probability $P(c \mid \text{observations})$ . The map encodes both where things are and what they are.

Why Semantics Helps

Semantic labels improve SLAM in three ways. First, data association becomes more robust: matching a detected "chair" in the current frame to a "chair" in the map is more reliable than matching anonymous point clouds. Second, semantic maps enable task-level planning: a robot told to "go to the kitchen" can use semantic labels to identify the goal. Third, semantic priors constrain geometry: floors are horizontal, walls are vertical, and objects have expected sizes.

Architectures

ORB-SLAM (Mur-Artal et al., 2015) is a feature-based visual SLAM system with three parallel threads: tracking (visual odometry), local mapping (bundle adjustment over recent keyframes), and loop closing (bag-of-words detection + pose graph optimization). ORB-SLAM2 extends this to stereo and RGB-D cameras. ORB-SLAM3 adds IMU integration and multi-map support.

RTAB-MAP (Labbe and Michaud, 2019) is a graph-based SLAM system designed for long-term operation. It manages memory by transferring old nodes to long-term storage and retrieving them when loop closures are detected, allowing it to run indefinitely without growing memory usage.

Semantic SLAM systems typically run a CNN-based object detector or segmentation network (Mask R-CNN, YOLO) in parallel with the geometric SLAM pipeline and fuse the outputs. The object detections provide semantic labels that are projected into the 3D map.

Beyond Feature-Based SLAM

Direct and Photometric Methods

Feature-based SLAM throws away most of the image, keeping only keypoint descriptors. Direct methods instead optimize photometric error across raw pixel intensities, tracking and mapping without explicit features. LSD-SLAM (Engel, Schöps, Cremers 2014) performs large-scale semi-dense direct monocular SLAM, producing a semi-dense depth map at high-gradient pixels. DSO (Direct Sparse Odometry, Engel, Koltun, Cremers 2016, arXiv:1607.02565) sparsifies this further by selecting a fixed set of active pixels and jointly optimizing camera poses, point depths, and photometric calibration parameters in a windowed bundle adjustment. Direct methods handle low-texture scenes where feature detectors fail but are sensitive to photometric violations (auto-exposure, rolling shutter, specularities).

Neural SLAM

DROID-SLAM (Teed and Deng 2021, arXiv:2108.10869) replaces the hand-engineered front end with a recurrent neural network trained end-to-end. A ConvGRU iteratively refines optical flow and revises camera poses and pixel-wise depth through a differentiable dense bundle adjustment layer. It runs on monocular, stereo, and RGB-D input with the same weights and substantially outperforms ORB-SLAM3 on TartanAir and EuRoC.

3D Gaussian Splatting SLAM

The current frontier uses 3D Gaussian splatting as the underlying map representation. MonoGS (Matsuki, Murai, Kelly, Davison 2024, arXiv:2312.06741) performs incremental Gaussian splatting SLAM from live monocular, stereo, or RGB-D streams, optimizing camera poses and Gaussian parameters jointly by rasterizing differentiable renderings and minimizing photometric loss. The resulting map supports photorealistic novel-view synthesis alongside geometric tracking.

Visual-Inertial Odometry

Fusing camera with an IMU resolves monocular scale ambiguity and stabilizes tracking through fast motion and brief visual degradation. VINS-Mono (Qin, Li, Shen 2018, arXiv:1708.03852) is a tightly coupled optimization-based monocular visual-inertial estimator with online extrinsic calibration, loop closure, and four-degree-of-freedom pose graph optimization. OpenVINS (Geneva et al. 2020) is a filter-based alternative built on MSCKF, widely used as a production VIO backbone on drones and AR devices. On phones and headsets, VIO is typically the default, not pure visual SLAM.

Factor Graphs and Backends

The SLAM back end is almost always a factor graph. Variables (poses, landmarks, calibration) are nodes; measurements (odometry, loop closures, IMU pre-integration, GPS) are factors encoding constraints. MAP inference becomes sparse nonlinear least squares. Dellaert and Kaess (2017), Factor Graphs for Robot Perception, is the canonical reference. GTSAM with iSAM2 (Kaess et al. 2012) exploits the Bayes tree for incremental updates, relinearizing only the affected subtree on each new measurement. Ceres Solver and g2o are competing backends used by ORB-SLAM, DSO, and many others.

Common Confusions

Watch Out

Visual odometry is not visual SLAM

Visual odometry estimates the local trajectory from frame to frame. Visual SLAM also builds a persistent map and performs loop closures to correct global drift. A system doing only visual odometry will accumulate unbounded drift over time.

Watch Out

Semantic SLAM does not require perfect segmentation

The semantic labels do not need to be correct in every frame. Bayesian fusion over multiple observations from different viewpoints allows the map to converge to correct labels even when individual detections are noisy. A chair misclassified as a table in one frame will be corrected by ten frames that classify it correctly.

Summary

Visual SLAM replaces lidar with cameras, using feature matching and epipolar geometry for pose estimation
Loop closure detection corrects accumulated drift by recognizing revisited locations
Semantic SLAM attaches category labels to map elements, improving data association and enabling task-level planning
ORB-SLAM is the standard open-source feature-based visual SLAM system
Semantic labels are fused probabilistically over time, so individual detection errors are tolerable

Exercises

ExerciseCore

Problem

A monocular visual odometry system estimates translation only up to scale. Explain why, and describe two ways to recover absolute scale.

ExerciseAdvanced

Problem

In a semantic SLAM system, you observe an object from $T$ viewpoints. Each viewpoint $t$ produces a categorical distribution $q_t \in \Delta^K$ over $K$ classes from a segmentation network. Assuming observations are conditionally independent given the true class, write the Bayesian update for the posterior $P(c \mid q_1, \ldots, q_T)$ and show how it simplifies to a product of likelihoods.

References

Canonical:

Mur-Artal, Montiel, Tardos, ORB-SLAM: A Versatile and Accurate Monocular SLAM System, IEEE T-RO (2015)
Hartley and Zisserman, Multiple View Geometry in Computer Vision (2004), Chapters 9-11
Dellaert and Kaess, Factor Graphs for Robot Perception, Foundations and Trends in Robotics (2017)
Engel, Schöps, Cremers, LSD-SLAM: Large-Scale Direct Monocular SLAM, ECCV (2014)
Engel, Koltun, Cremers, Direct Sparse Odometry, IEEE TPAMI (2018), arXiv:1607.02565

Current:

Campos et al., ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM, IEEE T-RO (2021)
Labbe and Michaud, RTAB-MAP as an Open-Source Lidar and Visual SLAM Library for Large-Scale and Long-Term Online Operation, JFRP (2019)
Qin, Li, Shen, VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator, IEEE T-RO (2018), arXiv:1708.03852
Geneva et al., OpenVINS: A Research Platform for Visual-Inertial Estimation, ICRA (2020)
Teed and Deng, DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras, NeurIPS (2021), arXiv:2108.10869

Frontier:

Matsuki, Murai, Kelly, Davison, Gaussian Splatting SLAM (MonoGS), CVPR (2024), arXiv:2312.06741

Further directions

Dense visual SLAM (ElasticFusion, BundleFusion) expanded
NeRF-based SLAM (iMAP, NICE-SLAM) expanded
Event cameras for SLAM
Scale from monocular-depth networks
Keyframe selection strategies (practical detail)
Bundle adjustment mathematical treatment
Ceres Solver and g2o backend comparison

Next Topics

Active SLAM and POMDPs: choosing actions to maximize map quality

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

GraphSLAM and Factor Graphslayer 3 · tier 2

Derived topics

1

Active SLAM and POMDPslayer 4 · tier 3

Graph-backed continuations

Active SLAM and POMDPs