Cornell University
Library
Cornell UniversityLibrary

eCommons

Help
Log In(current)
  1. Home
  2. Cornell University Graduate School
  3. Cornell Theses and Dissertations
  4. Capturing and Understanding Photos Autonomously

Capturing and Understanding Photos Autonomously

File(s)
Alzayer_cornell_0058O_11503.pdf (6.63 MB)
Permanent Link(s)
https://doi.org/10.7298/g6h0-3r62
https://hdl.handle.net/1813/112110
Collections
Cornell Theses and Dissertations
Author
Alzayer, Hadi
Abstract

This work considers autonomous photo capture, and improving visual understanding from large collections of photo. In terms of photo capture, we are interested in capturing aesthetically pleasing photos autonomously. The process of capturing a well-composed photo is difficult and it takes years of experience to master. We propose a novel pipeline for an autonomous agent to automatically capture an aesthetic photograph by navigating within a local region in a scene. Instead of classical optimization over heuristics such as the rule-of-thirds, we adopt a data-driven aesthetics estimator to assess photo quality. A reinforcement learning framework is used to optimize the model with respect to the learned aesthetics metric. We train our model in simulation with indoor scenes, and we demonstrate that our system can capture aesthetic photos in both simulation and real world environments on a ground robot. To our knowledge, it is the first system that can automatically explore an environment to capture an aesthetic photo with respect to a learned aesthetic estimator. While reinforcement learning works well in the existence of realistic simulated environments, a more versatile approach is to learn directly from collections of videos. Video clips are abundant and easy to obtain in any domain of interest, unlike 3D environment reconstructions that can be expensive and tedious to obtain. We discuss how autonomous photo capture can be done without relying on reinforcement learning in a way that allows us to train an autonomous photo capture model using videos. An orthogonal goal we consider in this work is improving visual understanding from large collections of photos. Specifically, we focus on the task of activity inference, where given a photo of a scene, we are interested in predicting the activities that could be performed at that scene or around it. Instead of using user collected annotations for this task, we construct a dataset by taking advantage of geotags from online photos. Specifically, we rely on the assumption that if there is a photo of a user engaging at an activity, then photos taken nearby can use that activity as one of its labels for the activity inference task. We propose a pipeline to construct an activity inference dataset and demonstrate an analysis on the constructed dataset and behaviors of models trained on that dataset. We show that our trained models attend to regions of the image that correspond to our intuitive understanding of the activity unlike baseline methods. We hope that autonomous photography will help users capture well-composed photos of their environments or assist users in the process. We expect that interest in photography and sharing photos will continue to rise, and that the photos users share can help improving the visual understanding of autonomous systems.

Description
66 pages
Date Issued
2022-08
Keywords
activity inference
•
aesthetics
•
autonomous photography
•
photography
Committee Chair
Bala, Kavita
Committee Member
Acharya, Jayadev
Hariharan, Bharath
Degree Discipline
Computer Science
Degree Name
M.S., Computer Science
Degree Level
Master of Science
Type
dissertation or thesis
Link(s) to Catalog Record
https://newcatalog.library.cornell.edu/catalog/15578922

Site Statistics | Help

About eCommons | Policies | Terms of use | Contact Us

copyright © 2002-2026 Cornell University Library | Privacy | Web Accessibility Assistance