Learning Geometry, Appearance and Motion in the Wild
Physics-based computer vision can be formulated as an inverse process of graphics rendering engine: we seek to take RGB images and recover the intrinsic properties of a scene, including geometry, material, illumination, and object motions. Computer vision as inverse graphics plays an important role in numerous real-world applications such as virtual reality, in which recovered scene intrinsics can be further used to render images at novel viewpoints with plausible lighting. However, most previous techniques either require a multi-camera setup or assume that the underlying scene is static, i.e., that the appearance and geometry do not change over time. In contrast, the photos we see on the Internet only constitute a single view observation for each scene; the videos often involve dynamics due to a variety of time-varying factors such as illumination changes and object motions. Therefore, In this thesis, I address these problems to in-the-wild scenarios by leveraging a compelling source of data: massive quantities of unlabeled photos and videos people take and upload to the Internet every day. I demonstrate how to make use of such massive but noisy visual data to capture scene geometry, appearance, lighting, and motions from a single RGB image or videos of dynamic scenes, which further enables me to synthesize photo-realistic novel view in both space and time.