Pushing the Boundaries of 3D Spatial Understanding
Understanding the 3D world from data collected by sensors such as cameras and LiDAR is a fundamental problem in computer vision, with applications in robotics, augmented and virtual reality, and autonomous systems. Although current algorithms can produce high-quality 3D reconstructions, they often struggle under real-world conditions, such as when data is sparse, visual overlap is limited, or scenes contain strong ambiguities like repetition and symmetry. This thesis explores how learned priors can help overcome these challenges and improve 3D spatial understanding. I begin by addressing the problem of shape generation and reconstruction from sparse point clouds, proposing a method that learns shape priors through modeling gradient fields over 3D point cloud distributions. Next, I tackle the challenge of extreme camera pose estimation between image pairs with little or no overlap, using dense correlation volumes to extract semantic and geometric cues. Building on this, I further improve extreme pose estimation by leveraging visual world priors from generative video models, which hallucinate plausible intermediate frames to provide useful context. Finally, I address visual ambiguity in structure-from-motion by using 3D priors from feature-matching models to disambiguate visually similar but incorrect matches, what we call “doppelgangers”, in symmetric scenes like the Arc de Triomphe.