Applied Math
Visual and Semantic SLAM
Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.
Prerequisites
Why This Matters
Classical SLAM systems use laser range finders to measure distances. These sensors are expensive, heavy, and produce geometric point clouds with no notion of what objects are present. Cameras are cheap, lightweight, and capture rich appearance information. Visual SLAM extracts geometric structure from image sequences. Semantic SLAM goes further: it labels map elements with object categories, producing maps that a robot can reason about ("navigate to the chair" rather than "navigate to point (3.2, 1.7)").
Mental Model
Visual SLAM solves two problems simultaneously from camera images: where is the robot, and what does the world look like? Feature points tracked across frames give geometric constraints. When the robot revisits a location, matching features trigger loop closures that correct accumulated drift. Semantic SLAM adds a third question: what are the objects in the scene? Attaching labels like "door", "table", or "wall" to map regions enables higher-level planning and more robust data association.
Visual SLAM Foundations
Feature Extraction
Visual SLAM begins with detecting and describing keypoints in images. A keypoint detector finds distinctive image locations (corners, blobs). A descriptor encodes the local appearance around each keypoint as a vector.
Feature Descriptor
A feature descriptor is a -dimensional vector encoding the local image patch around keypoint . Two keypoints in different images are matched when is below a threshold. Standard descriptors include ORB (binary, 256 bits), SIFT (float, 128 dimensions), and SuperPoint (learned, 256 dimensions).
ORB (Oriented FAST and Rotated BRIEF) is fast and produces binary descriptors matched with Hamming distance. SuperPoint is a learned detector/descriptor trained on synthetic homographies, offering better performance under illumination changes at higher compute cost.
Visual Odometry
Visual odometry estimates the camera motion between consecutive frames using matched features.
Essential Matrix
For calibrated cameras, the essential matrix encodes the relative rotation and translation between two views. It satisfies where is the skew-symmetric matrix of . For corresponding points in normalized coordinates: .
Epipolar Constraint for Visual Odometry
Statement
Given point correspondences in normalized image coordinates from two calibrated views, the essential matrix satisfying for all can be recovered up to scale. Decomposing via SVD yields the relative rotation and translation direction .
Intuition
Each point correspondence constrains the camera motion by one equation. Five equations suffice to pin down the five degrees of freedom of (3 rotation, 2 translation direction). Translation magnitude is unrecoverable from images alone without additional information.
Proof Sketch
Stack the epipolar constraints into a linear system where each row comes from one correspondence. With 5 or more points, the null space of is one-dimensional (generically), giving up to scale. The constraint that has two equal singular values and one zero singular value resolves remaining ambiguity. SVD of gives and .
Why It Matters
This is the geometric foundation of visual odometry. Every visual SLAM system uses this or a variant (the fundamental matrix for uncalibrated cameras) to estimate frame-to-frame motion. RANSAC is applied on top to handle outlier matches.
Failure Mode
Fails when the scene is planar (degenerate configuration requiring a homography instead), when there is pure rotation with no translation (epipolar geometry is undefined), or when feature matches are predominantly incorrect. Also degrades with rolling shutter cameras where the pinhole model is violated.
Loop Closure Detection
As the robot moves, visual odometry accumulates drift. Loop closure detects when the robot returns to a previously visited location by matching current features against a database of past keyframes.
The standard approach uses bag-of-visual-words: feature descriptors are quantized into a visual vocabulary (trained offline via k-means on a large descriptor set), and each keyframe is represented as a histogram over visual words. A new frame queries the database using tf-idf scoring. When a match is found, a geometric verification step (computing the relative pose) confirms the loop closure, and the pose graph is optimized.
Semantic SLAM
Semantic Map
A semantic map augments a geometric map with object-level labels. Each map element (point, surface, or region) carries a category label with an associated probability . The map encodes both where things are and what they are.
Why Semantics Helps
Semantic labels improve SLAM in three ways. First, data association becomes more robust: matching a detected "chair" in the current frame to a "chair" in the map is more reliable than matching anonymous point clouds. Second, semantic maps enable task-level planning: a robot told to "go to the kitchen" can use semantic labels to identify the goal. Third, semantic priors constrain geometry: floors are horizontal, walls are vertical, and objects have expected sizes.
Architectures
ORB-SLAM (Mur-Artal et al., 2015) is a feature-based visual SLAM system with three parallel threads: tracking (visual odometry), local mapping (bundle adjustment over recent keyframes), and loop closing (bag-of-words detection + pose graph optimization). ORB-SLAM2 extends this to stereo and RGB-D cameras. ORB-SLAM3 adds IMU integration and multi-map support.
RTAB-MAP (Labbe and Michaud, 2019) is a graph-based SLAM system designed for long-term operation. It manages memory by transferring old nodes to long-term storage and retrieving them when loop closures are detected, allowing it to run indefinitely without growing memory usage.
Semantic SLAM systems typically run a CNN-based object detector or segmentation network (Mask R-CNN, YOLO) in parallel with the geometric SLAM pipeline and fuse the outputs. The object detections provide semantic labels that are projected into the 3D map.
Beyond Feature-Based SLAM
Direct and Photometric Methods
Feature-based SLAM throws away most of the image, keeping only keypoint descriptors. Direct methods instead optimize photometric error across raw pixel intensities, tracking and mapping without explicit features. LSD-SLAM (Engel, Schöps, Cremers 2014) performs large-scale semi-dense direct monocular SLAM, producing a semi-dense depth map at high-gradient pixels. DSO (Direct Sparse Odometry, Engel, Koltun, Cremers 2016, arXiv:1607.02565) sparsifies this further by selecting a fixed set of active pixels and jointly optimizing camera poses, point depths, and photometric calibration parameters in a windowed bundle adjustment. Direct methods handle low-texture scenes where feature detectors fail but are sensitive to photometric violations (auto-exposure, rolling shutter, specularities).
Neural SLAM
DROID-SLAM (Teed and Deng 2021, arXiv:2108.10869) replaces the hand-engineered front end with a recurrent neural network trained end-to-end. A ConvGRU iteratively refines optical flow and revises camera poses and pixel-wise depth through a differentiable dense bundle adjustment layer. It runs on monocular, stereo, and RGB-D input with the same weights and substantially outperforms ORB-SLAM3 on TartanAir and EuRoC.
3D Gaussian Splatting SLAM
The current frontier uses 3D Gaussian splatting as the underlying map representation. MonoGS (Matsuki, Murai, Kelly, Davison 2024, arXiv:2312.06741) performs incremental Gaussian splatting SLAM from live monocular, stereo, or RGB-D streams, optimizing camera poses and Gaussian parameters jointly by rasterizing differentiable renderings and minimizing photometric loss. The resulting map supports photorealistic novel-view synthesis alongside geometric tracking.
Visual-Inertial Odometry
Fusing camera with an IMU resolves monocular scale ambiguity and stabilizes tracking through fast motion and brief visual degradation. VINS-Mono (Qin, Li, Shen 2018, arXiv:1708.03852) is a tightly coupled optimization-based monocular visual-inertial estimator with online extrinsic calibration, loop closure, and four-degree-of-freedom pose graph optimization. OpenVINS (Geneva et al. 2020) is a filter-based alternative built on MSCKF, widely used as a production VIO backbone on drones and AR devices. On phones and headsets, VIO is typically the default, not pure visual SLAM.
Factor Graphs and Backends
The SLAM back end is almost always a factor graph. Variables (poses, landmarks, calibration) are nodes; measurements (odometry, loop closures, IMU pre-integration, GPS) are factors encoding constraints. MAP inference becomes sparse nonlinear least squares. Dellaert and Kaess (2017), Factor Graphs for Robot Perception, is the canonical reference. GTSAM with iSAM2 (Kaess et al. 2012) exploits the Bayes tree for incremental updates, relinearizing only the affected subtree on each new measurement. Ceres Solver and g2o are competing backends used by ORB-SLAM, DSO, and many others.
Common Confusions
Visual odometry is not visual SLAM
Visual odometry estimates the local trajectory from frame to frame. Visual SLAM also builds a persistent map and performs loop closures to correct global drift. A system doing only visual odometry will accumulate unbounded drift over time.
Semantic SLAM does not require perfect segmentation
The semantic labels do not need to be correct in every frame. Bayesian fusion over multiple observations from different viewpoints allows the map to converge to correct labels even when individual detections are noisy. A chair misclassified as a table in one frame will be corrected by ten frames that classify it correctly.
Summary
- Visual SLAM replaces lidar with cameras, using feature matching and epipolar geometry for pose estimation
- Loop closure detection corrects accumulated drift by recognizing revisited locations
- Semantic SLAM attaches category labels to map elements, improving data association and enabling task-level planning
- ORB-SLAM is the standard open-source feature-based visual SLAM system
- Semantic labels are fused probabilistically over time, so individual detection errors are tolerable
Exercises
Problem
A monocular visual odometry system estimates translation only up to scale. Explain why, and describe two ways to recover absolute scale.
Problem
In a semantic SLAM system, you observe an object from viewpoints. Each viewpoint produces a categorical distribution over classes from a segmentation network. Assuming observations are conditionally independent given the true class, write the Bayesian update for the posterior and show how it simplifies to a product of likelihoods.
References
Canonical:
- Mur-Artal, Montiel, Tardos, ORB-SLAM: A Versatile and Accurate Monocular SLAM System, IEEE T-RO (2015)
- Hartley and Zisserman, Multiple View Geometry in Computer Vision (2004), Chapters 9-11
- Dellaert and Kaess, Factor Graphs for Robot Perception, Foundations and Trends in Robotics (2017)
- Engel, Schöps, Cremers, LSD-SLAM: Large-Scale Direct Monocular SLAM, ECCV (2014)
- Engel, Koltun, Cremers, Direct Sparse Odometry, IEEE TPAMI (2018), arXiv:1607.02565
Current:
- Campos et al., ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM, IEEE T-RO (2021)
- Labbe and Michaud, RTAB-MAP as an Open-Source Lidar and Visual SLAM Library for Large-Scale and Long-Term Online Operation, JFRP (2019)
- Qin, Li, Shen, VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator, IEEE T-RO (2018), arXiv:1708.03852
- Geneva et al., OpenVINS: A Research Platform for Visual-Inertial Estimation, ICRA (2020)
- Teed and Deng, DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras, NeurIPS (2021), arXiv:2108.10869
Frontier:
- Matsuki, Murai, Kelly, Davison, Gaussian Splatting SLAM (MonoGS), CVPR (2024), arXiv:2312.06741
Further directions
- Dense visual SLAM (ElasticFusion, BundleFusion) expanded
- NeRF-based SLAM (iMAP, NICE-SLAM) expanded
- Event cameras for SLAM
- Scale from monocular-depth networks
- Keyframe selection strategies (practical detail)
- Bundle adjustment mathematical treatment
- Ceres Solver and g2o backend comparison
Next Topics
- Active SLAM and POMDPs: choosing actions to maximize map quality
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- GraphSLAM and Factor Graphslayer 3 · tier 2
Derived topics
1- Active SLAM and POMDPslayer 4 · tier 3
Graph-backed continuations