Capturing and Understanding Photos Autonomously
MetadataShow full item record
This work considers autonomous photo capture, and improving visual understanding from large collections of photo. In terms of photo capture, we are interested in capturing aesthetically pleasing photos autonomously. The process of capturing a well-composed photo is difficult and it takes years of experience to master. We propose a novel pipeline for an autonomous agent to automatically capture an aesthetic photograph by navigating within a local region in a scene. Instead of classical optimization over heuristics such as the rule-of-thirds, we adopt a data-driven aesthetics estimator to assess photo quality. A reinforcement learning framework is used to optimize the model with respect to the learned aesthetics metric. We train our model in simulation with indoor scenes, and we demonstrate that our system can capture aesthetic photos in both simulation and real world environments on a ground robot. To our knowledge, it is the first system that can automatically explore an environment to capture an aesthetic photo with respect to a learned aesthetic estimator. While reinforcement learning works well in the existence of realistic simulated environments, a more versatile approach is to learn directly from collections of videos. Video clips are abundant and easy to obtain in any domain of interest, unlike 3D environment reconstructions that can be expensive and tedious to obtain. We discuss how autonomous photo capture can be done without relying on reinforcement learning in a way that allows us to train an autonomous photo capture model using videos. An orthogonal goal we consider in this work is improving visual understanding from large collections of photos. Specifically, we focus on the task of activity inference, where given a photo of a scene, we are interested in predicting the activities that could be performed at that scene or around it. Instead of using user collected annotations for this task, we construct a dataset by taking advantage of geotags from online photos. Specifically, we rely on the assumption that if there is a photo of a user engaging at an activity, then photos taken nearby can use that activity as one of its labels for the activity inference task. We propose a pipeline to construct an activity inference dataset and demonstrate an analysis on the constructed dataset and behaviors of models trained on that dataset. We show that our trained models attend to regions of the image that correspond to our intuitive understanding of the activity unlike baseline methods. We hope that autonomous photography will help users capture well-composed photos of their environments or assist users in the process. We expect that interest in photography and sharing photos will continue to rise, and that the photos users share can help improving the visual understanding of autonomous systems.
activity inference; aesthetics; autonomous photography; photography
Acharya, Jayadev; Hariharan, Bharath
M.S., Computer Science
Master of Science
dissertation or thesis