Learning Conditional Models for Visual Perception
In recent years, the field of computer vision has seen a series of major advances, made possible by rapid development in algorithms, data collection and computing infrastructure. As a result, vision systems have started to be broadly adopted in everyday applications. Progress has been particularly promising in image recognition, where algorithms now often match human performance. Nevertheless, vision systems still largely fall behind humans in their ability to understand the complexities of the visual world and its apparent contradictions. For example, an image can carry different meanings to different people in different contexts. However, being often limited to a single point of view, vision systems tend to focus on the meaning that dominates in the training data. In this dissertation, we address this limitation by building conditional vision models that can learn from multiple points of view and adapt their results to account for different conditions. First, we address the related tasks of image tagging and tag based image retrieval. In particular, we build a system that can take into account the fact that people may associate different meaning with certain images and tags. Thus, the system can personalize outputs for ambiguous tags such as #rock, which could refer either to a music genre, a geological object or even outdoor climbing. Further, we focus on the task of image based similarity search. Specifically, we design a system that can understand multiple notions of similarity. For example, when searching for related items to an input images of a shoe, users might be interested in shoes of similar color, style, or for the same kind of activity. By capturing the multitude of aspects in terms of which objects can be compared, our system can find the right set of related items. Lastly, we explore how the underlying convolutional networks themselves can be made aware of the context in which they are used. In a study, we first discover a new understanding of the roles that individual layers take on in modern convolutional networks. Then, we leverage our insights and design a network that can adaptively define its own topology conditioned on the input image to increase both accuracy and efficiency.
computer vision; machine learning; Computer science
Belongie, Serge J.
Kleinberg, Jon M.; Naaman, Mor
Ph. D., Computer Science
Doctor of Philosophy
Attribution-NonCommercial-ShareAlike 4.0 International
dissertation or thesis
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 4.0 International