Modeling the 3D World and its Motion
Human visual systems have the remarkable ability to perceive the physical environment, providing a foundation for intelligent behaviors like navigation and interaction. To enhance artificial intelligence, one of the most crucial challenges is to improve machines' ability to perceive and model the dynamic 3D world from images and videos. This ability not only forms a fundamental basis for advanced intelligence but also enables numerous real-world applications like augmented reality (AR), virtual reality (VR), and content creation. In this thesis, my main focus is on developing algorithms that enhance machines' understanding and modeling of geometry and motion using everyday photos and videos. I first consider static scenes, where I tackle the problem of computing more accurate correspondences between image pairs, which serve as a crucial foundation for improving the estimation of camera poses and scene structure. Additionally, I propose a generalizable approach to effectively model the dense geometry and appearance of static scenes for view synthesis. Shifting to dynamic scenes, I study how to represent moving 3D scenes from the simple case of two views, and explore methods to model the dense and complete motion from video sequences.