Building a Stereo Visual SLAM
Engine from Scratch in C++ and CUDA
I never really understood how an autonomous system could navigate a complex city without causing complete chaos. How does it weave through pedestrians or moving vehicles? I found the problem so puzzling that I wanted to take a stab at it without any outside help.
That meant NO pre-built perception stacks. NO black-box libraries. NO deep learning priors. Just raw pixels in, and a 6-DOF pose out.
How hard could it be?
Cut to four months later, and I've pretty much lost all my hair. Between late nights reading SLAM textbooks, studying branches of mathematics I didn't know existed, and experiencing integration hell for the tenth time in a single week, I finally have something to show for it.
System Architecture
To a computer, a city isn't a physical space. It's a stream of frames that need to be aggressively manipulated into a 3D coordinate system. To get from raw pixels to a reliable trajectory, I built a six-stage gauntlet:
Stereo Vision & The Depth Problem
The first decision was which camera setup to use. In Monocular SLAM you have a single camera and no depth perception, so you can't tell if an object is 2, 4, or 6 meters away. I started there because it felt like SLAM in its purest form. It was also a total disaster. Mono SLAM doesn't know what a "meter" is, so every time the tracker fails and resets, it just guesses a new scale. Two resets later, the system might think it travelled 400 meters when it only moved 10. Although there are other workarounds, I won't be focussing on them here.
Switching to two cameras was a cheat code.
Think about it intuitively: raise your thumb and look at it through only your left eye. Now, without moving your thumb, switch to your right eye. Your thumb jumps. That horizontal shift is called disparity \(d\). Repeat the same experiment with something far away, and the disparity is much smaller. This is exactly the logic we use to recover depth.
Given the focal length \(f_x\) and the physical distance between the two lenses (the baseline \(b\)), the math works out cleanly:
On the very first frame, we run this for hundreds of distinct feature points. Instantly, we have 300–500 metric-scale 3D map points with no motion required. That metric anchor is what separates a working system from an elaborate guess.
The Depth Uncertainty Ceiling
Stereo depth isn't free at all distances. The uncertainty in depth grows quadratically:
On KITTI, at 43 meters, that uncertainty already balloons to ~4.8 meters. A triangulated point "at 43 m" could realistically sit anywhere in a 10-meter window. Putting those points into the optimizer doesn't help. It actively hurts, pulling poses toward phantom geometry. I cap triangulation at 80× the baseline. Anything further gets discarded.
Fixing the Teleportation Bug
Initialization has one more subtlety that isn't obvious until it breaks on you. When tracking fails and the system re-initializes, the naive approach sets the camera pose \(T_{cw}\) to Identity, snapping the trajectory back to the origin even though the physical car is still moving down the road.
The fix is simple in hindsight: on re-initialization, propagate the last known pose forward. The new map seeds from where the old one died, and the trajectory stays continuous.
Feature Extraction & The Cost of Greed
A 3D map point is a ghost unless you can recognize it again in the next frame. That's the job of the descriptor.
I used ORB because it produces binary descriptors. Matching two of them is just an XOR followed by a popcount: one 32-bit instruction per word, eight words per 256-bit descriptor. That's the kind of arithmetic that scales inside a CUDA kernel without breaking a sweat.
But KITTI sequences have long stretches of grey, textureless road where the FAST corner detector would occasionally just give up. I spent a week tuning thresholds and adding contrast normalization before realizing the problem wasn't the thresholds. I was being greedy, running the full descriptor pipeline on every single frame.
The fix was to stop. ORB now runs only at keyframe creation. Every standard frame between keyframes tracks via KLT Optical Flow, which never touches a descriptor. Run the expensive matcher only on the frames you've already decided are worth keeping.
Tracking: Optical Flow & Pose Recovery
The robot is moving. So those 3D points we triangulated: where did they go in the next frame? You could search the entire image again, but that's far too slow for real time.
The smarter approach is to track them. Think of it like following a specific branch on a tree: instead of scanning the whole landscape, you isolate a small patch of pixels and watch them slide across successive frames. That's KLT Optical Flow, which tracks local patches through an image pyramid for efficiency at multiple scales.
Reality is messy though. Occlusions, lighting changes, and reflections cause tracks to drift silently. To catch this, every track goes through a Forward-Backward Filter: flow the pixel from frame A to B, then immediately flow it back to A. If it doesn't land within 1 pixel of where it started, kill the track. It's a blunt instrument and it works.
Recovering the Pose
Now we have surviving 2D tracks paired to known 3D map points. The question becomes: from what exact position and orientation was this photo taken?
The answer comes from minimizing reprojection error: the distance between where we observed a point in the image and where our current pose predicts it should appear:
Here \(u_{ij}\) is the observed pixel, \(X_w\) is the 3D world point, \(T_{cw}\) is our camera pose, and \(\pi\) is the projection function that maps 3D to 2D. We feed all these correspondences into a Ceres solver, hold the map points fixed, and wiggle the 6-DOF pose until this error is minimized. The result is our camera's position and orientation for this frame.
Expanding the Map: GPU Stereo Matching
If we only tracked the same initial points forever, the map would run dry as the robot drives past them. A new keyframe is inserted roughly every 5 meters of travel, or earlier if the number of surviving KLT tracks drops below 80. On insertion, we triangulate a fresh batch of 3D points by matching features across the left and right cameras.
Brute-force matching 2,000 features is 4 million comparisons. But stereo cameras have a useful physical property: epipolar geometry. Because both lenses are on the same horizontal plane, a feature in the left image is guaranteed to lie on the exact same horizontal row in the right image. Instead of searching the whole frame, we search a ±2 px vertical band, discarding the vast majority of candidates before computing a single Hamming distance.
To hit real-time throughput, I offloaded this to a custom CUDA kernel. Each block handles one left-image descriptor. Threads inside the block then use warp butterfly reductions (__shfl_down_sync) to find the minimum Hamming distance across all candidate right-image descriptors in a tree structure, with threads communicating across the silicon rather than writing to global memory. Multiple candidates within the epipolar band are evaluated in parallel.
Bundle Adjustment & The Manifold
Frame-by-frame pose estimation is a local fix. It doesn't see drift accumulating. Over thousands of frames, tiny reprojection errors compound into a trajectory that's visibly wrong. Local Bundle Adjustment is the corrective lens.
The intuition: imagine your 12 camera poses and 3,000 map points are connected by springs. Every imperfect measurement creates tension. LBA jointly adjusts both the poses and the map points until that tension is globally minimized across the entire window.
The Hessian & Schur Complement
Mathematically, this tension maps into a system of equations whose structure is captured by the Hessian matrix. For 3,000 points and 12 poses, solving that system directly is a computational nightmare. But there's a structural quirk we can exploit: most cameras only observe a small fraction of the total map points, so the Hessian is overwhelmingly sparse. Most of it is empty.
The Schur Complement is a mathematical shorthand that exploits this. By "pre-solving" and eliminating the thousands of point variables from the system, we reduce the problem to just the 12 camera poses: a \(72 \times 72\) dense system. That's the difference between a problem that takes minutes and one that takes milliseconds.
Deriving the Jacobians
For the solver to know which direction to nudge each camera, it needs a Jacobian: a map of exactly how the reprojection error changes with each small perturbation to the pose. I derived these analytically rather than using Ceres AutoDiff, which costs performance and gives you no intuition when things go wrong.
The derivation breaks into two layers. The first is the projection layer: how does a 2D pixel error respond to a point moving in 3D space? The key insight is that small changes in depth matter far more when the object is close. This shows up mathematically as \(Z^2\) in the denominator of the Jacobian. A point at 2 meters is four times more sensitive to depth error than a point at 4 meters.
The second is the rotation layer, and this is where it gets genuinely hard. 3D rotations live on a curved manifold called \(SO(3)\). You can't simply add a small number to a rotation the way you can to a position. Imagine standing on a large sphere and needing to take a small step: if you move in a straight Euclidean line, you step off the sphere. You have to follow the curve.
Ceres handles staying on that manifold automatically, keeping the quaternion on the unit sphere. What I had to derive manually was how each 2D pixel moves for every tiny twist of the quaternion's four components: twelve explicit partial derivatives in total.
Performance & The Honest Truth
RTX 3050 laptop, 4 GB VRAM, 8-core CPU. Full 4541-frame run on KITTI sequence 00.
| Pipeline Stage | Cost |
|---|---|
| Tracking, non-KF frame (avg) | 30.7 ms |
| Tracking, KF frame (avg) | 117.7 ms |
| Local Bundle Adjustment (avg / max) | 257 ms / 683 ms |
| Total pipeline (avg) | 110.5 ms |
| Effective throughput | 9.0 FPS |
| 906 keyframes over 4541 frames, roughly 1 KF every 5 frames. LBA dominates latency on KF frames. | |
Trajectory accuracy: 12.99 m ATE RMSE over a 3.7 km drive, a 0.35% error rate.
Building this from scratch was a rollercoaster of "I'm a genius" and "I have no idea what a matrix is." But watching that final computed trajectory match the real road? Totally worth it.