Other Titles



Computer vision models are conventionally trained on, and benchmarked against, curated datasets. While very high performance can be achieved in these settings, it is more challenging to deploy these models in the real world. When computer vision models are used in new environments, they must overcome distribution shift between their training data and their test environment. We consider computer vision models to be robust when they can perform well on a variety of images captured from different environments. This dissertation explores several research problems towards building perception systems that are robust in the real world. Building and labeling datasets is a key step for training a strong computer vision system. Capturing and representing a large diverse set of images allows our models to have experience with images from a variety of environments. However, data annotation is difficult, so efficiently labeling data is an important challenge to consider. It is also useful to consider the human visual system as a gold-standard point of reference for a robust visual system, as humans can readily understand new images. Paintings are an interesting type of image which are created by humans for human consumption – a key property of artwork lies in its ability to convey perceptual realism without necessarily being physically realistic. We would like our computer vision models to be able to understand paintings as well, and learning from paintings may allow computer vision models to be more robust. Lastly, even our best efforts to build robust models will not lead to a perfect model, and it is important to reason about failures and uncertainties when these models are used. We touch upon each of these challenges in the dissertation. First, a direct way to overcome distribution shift is to as closely mimic the real world distribution of (image, label) pairs as possible in the datasets we use. However, labeling data is very expensive and time-consuming. In the first work, we propose a new efficient annotation method for semantic segmentation. Our pipeline divides the traditional per-pixel annotation task for an entire image into per-pixel annotation for image subregions. This task is more palatable for annotators and leads to an increase in label quality for a lower cost. Furthermore, we find that only annotating a small number of image subregions is sufficiently informative for models to outperform conventional annotation and inexpensive weakly-supervised annotation methods. Next, we explore paintings as a medium for studying perception systems, and their utility as an alternative source of training data from natural images. The image distribution of paintings are different from the distribution of natural photographs, but paintings often depict meaningful objects and scenery that parallel those found in the real world. Common computer vision models are developed for natural photographs, and most existing labeled datasets focus on natural images. We explore several use cases of such models applied to paintings instead. Furthermore, paintings may emphasize features or invariances that humans utilize for robust perception, as they can be perceptually realistic without being physically realistic. Indeed, for finegrained fabric classification, we find evidence that models trained on paintings focus on cues that are both more interpretable and generalizable. The next work extends our previous findings by systematically exploring the invariances encoded in paintings for natural image recognition. We study how models behave when trained with real paintings and style transfer. Style transfer is a type of data augmentation that promises to create painting-like images from natural photographs. We train models on natural photographs, paintings, and/or stylized images, and evaluate them on test data representing real-world distribution shifts. Perception models must overcome such shifts to successfully understand different environments – for example, the test photographs may contain noise, or be drawn from another dataset which was sampled from different viewpoints than the training images. Our results show that learning from a combination of natural photographs and paintings leads to models that are far more robust than learning from natural photographs alone. Interestingly, we also find that style transfer does not capture the same invariances as paintings, and that paintings are unique among various artforms in enabling recognition models to learn useful invariances for natural image recognition. Finally, we conclude with a real world case study in using visual perception for autonomous navigation. Even the most accurate perception model will not be perfect, and it is important to account for failures when using these models as a building block in an autonomous system. We propose a planning pipeline that reasons about uncertainty in semantic segmentations to find a safe path in unknown environments. Given predictions of terrain and obstacles from a view of a scene, the autonomous agent determines which subsequent views of the environment to capture. By taking multiple views of the environment, the agent increases its confidence in successfully selecting a viable path across safe terrain to a goal location. Our results show that this pipeline allows safe paths to be planned when deployed in real world environments with noisy and inaccurate model predictions. These works represent meaningful steps towards tackling some key challenges in building robust perception systems: acquiring high-quality and diverse training data, learning robust features for recognition, and reasoning about perception uncertainties within broader autonomous systems. However, the problem of building a robust perception system is multifaceted in its complexities and challenges, and many exciting research problems remain to be explored in future work.

Journal / Series

Volume & Issue


190 pages


Date Issued




Computer Vision


Effective Date

Expiration Date




Union Local


Number of Workers

Committee Chair

Bala, Kavita

Committee Co-Chair

Committee Member

Bindel, David S.
Hariharan, Bharath

Degree Discipline

Computer Science

Degree Name

Ph. D., Computer Science

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)


Link(s) to Reference(s)

Previously Published As

Government Document




Other Identifiers


Rights URI


dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record