In recent years, SLAM (Simultaneous Localization and Mapping) has advanced rapidly. LiDAR-based SLAM has already matured and been deployed across many real-world scenarios. Visual SLAM (vSLAM) is not yet as widely adopted in production as LiDAR SLAM, but it remains one of the most active research directions. This article provides a focused, technical overview of visual SLAM.
Visual SLAM primarily relies on cameras to perceive the environment. Cameras are relatively low-cost, easy to integrate into mass-market hardware, and provide information-rich imagery—making vSLAM an attractive approach.
In general, visual SLAM can be categorized into three major types:
There are also special camera types such as fisheye and panoramic cameras, though they are less common in both research and products. In addition, visual–inertial SLAM (vSLAM fused with an IMU, Inertial Measurement Unit) is another major research hotspot.
From an implementation difficulty standpoint, the typical ranking is:
Monocular > Stereo > RGB-D
Monocular SLAM uses only a single camera. Its primary advantage is simple sensor hardware and low cost. However, it has a fundamental limitation: absolute depth cannot be directly observed.
Stereo vision can estimate depth both during motion and at standstill, removing many monocular constraints. However:
RGB-D cameras became popular around 2010. They can directly measure per-pixel depth using structured light or time-of-flight (ToF), producing richer information than conventional cameras—without requiring monocular/stereo depth reconstruction.
A typical visual SLAM pipeline includes the following modules:
This stage handles camera frame acquisition and preprocessing. In robotic systems, it may also include reading and synchronizing wheel encoders and inertial sensors (IMU).
Visual odometry estimates:
VO is often called the front-end. It behaves like an “odometer” because it estimates motion only between consecutive timestamps, without directly using long-term historical constraints. Chaining these incremental motions yields the robot trajectory (localization). With estimated camera poses, the system can also reconstruct 3D structure and build a map.
To compute camera motion from images, the system must model the geometric relationship between 3D points and their 2D projections in the image plane (camera model + projection geometry).
The back-end addresses noise and uncertainty. Since all sensors are noisy, the system must not only estimate motion but also manage error accumulation and optimize the global consistency of poses and map.
In vSLAM, the front-end is closely tied to computer vision tasks such as feature extraction and matching, while the back-end mainly involves filtering and nonlinear optimization.
Loop closure (closed-loop detection) is the ability to recognize previously visited places. When successful, it can significantly reduce accumulated drift.
Loop closure is essentially a similarity-detection algorithm over observations. Many visual SLAM systems use a mature Bag-of-Words (BoW) model:
Some approaches formulate loop closure as a classification problem and train a classifier using traditional pattern recognition methods.
Mapping constructs an environment representation consistent with task requirements. Common map representations in robotics include:
Visual SLAM commonly uses feature-point maps, representing the environment with geometric primitives (points, lines, planes). These maps are often produced by sparse vSLAM (sometimes combined with GPS/UWB/cameras) and offer lower storage and compute cost, making them typical in early SLAM approaches.
Most visual SLAM systems process a continuous stream of camera frames, track a set of keypoints, and use triangulation to estimate their 3D positions. Using these constraints, the system estimates the camera pose over time and builds an environment map aligned to the robot’s trajectory. In simplified terms, the goal is to produce a map and a consistent estimate of the robot’s location within it—enabling navigation.
By tracking enough keypoints across video frames, the system can infer sensor orientation and the structure of the surrounding physical environment. Visual SLAM continuously minimizes reprojection error (the difference between projected 3D points and observed 2D measurements), typically using Bundle Adjustment (BA).
Because vSLAM must operate in real time and BA is computationally intensive, optimization is often separated into pose updates and mapping updates, then merged later to improve throughput.
Whether visual SLAM or LiDAR SLAM is “better” depends heavily on requirements and constraints. Below is a comparison across cost, scenarios, map accuracy, and usability.
LiDAR sensors are typically higher cost, though lower-cost LiDAR solutions now exist. vSLAM relies mainly on cameras, which are significantly cheaper. However, LiDAR can directly measure obstacle distance and angle with high precision, making localization and navigation simpler and often more robust.
vSLAM supports a broader range of indoor/outdoor scenarios, but it depends strongly on lighting and visual texture:
LiDAR SLAM typically yields higher mapping accuracy. For example, maps built using SLAMTEC’s RPLIDAR series can reach roughly 2 cm accuracy.
For vSLAM, a common example is the depth camera Kinect (range ~3–12 m), with mapping accuracy around 3 cm. In general, LiDAR SLAM provides higher map accuracy and can be used directly for localization and navigation.
LiDAR SLAM and RGB-D-based vSLAM can obtain point cloud data directly and compute obstacle distances from measured depth.
In contrast, monocular/stereo/fisheye-camera vSLAM cannot directly obtain a point cloud. It must estimate depth indirectly through motion, feature extraction/matching, and triangulation—making the pipeline more sensitive to motion patterns, texture, and lighting.
Overall, LiDAR SLAM is currently more mature and remains one of the most reliable solutions for localization and navigation. Visual SLAM continues to be a major research direction, and multi-sensor fusion (LiDAR + vision + IMU) is widely viewed as an inevitable future trend.
In the fields of autonomous driving and robotics, LiDAR and vision technologies have traditionally operated in isolation, often at odds with each other. However, we’ve taken a different approach, successfully merging the two to achieve technological harmony. SLAMTEC Aurora integrates 2D LiDAR, Binocular vision, 6DOF IMU, and an AI processor into one compact module, ready to use right out of the box.
Keywords: SLAM,Technology Explained
Visual SLAM Explained: A Fast, Practical Introduction to vSLAM
Comparison of Laser SLAM and Visual SLAM: Advantages and Disadvantages
LiDAR Channels: Differences Between Single-Line and Multi-Line LiDAR
What is Loop Closure Detection? Understanding Its Role in SLAM
Detailed Explanation of the Laser Triangulation Ranging Principle
How Much Do You Know About Laser SLAM Mapping Principles and Tips?
Lidar vs. Millimeter-Wave Radar: What’s the Difference?
What is SLAM? A Beginner’s Guide from Zero to One