Learning from Fine-Grained and Long-Tailed Visual Data
The visual world is fine-grained and long-tailed: many classes are difficult to distinguish; a few classes account for most of the data, while most classes are under-represented. With the remarkable advances in the field of computer vision fueled by large-scale datasets and deep learning, a central question is whether we can quantitatively model such visual data and design deep networks that learn from them. In this dissertation, we address this question from different perspectives. First, we propose a framework on how to grow a dataset and learn the corresponding model for fine-grained visual recognition with combined human and machine effort. We use deep metric learning to capture the relatively high intra-class variance in fine-grained visual data, assisted by human-annotated hard negatives during the labeling process. We then address the problem of how to design a deep network for fine-grained visual recognition. Specifically, we find nonlinearities in the classifier help the network and thus we explicitly incorporate higher-order nonlinearities into the classifier with our proposed kernel pooling. Further, we focus on methods for fine-grained visual recognition when large-scale, long-tailed data is available. In particular, we show how to measure domain similarity for purposes of selecting a suitable subset from the source domain for improved transfer learning in specific target domains. Next, we present a characterization of long-tailed data distributions based on the effective number of samples, in which we quantify data overlap using a small neighborhood centered around each sample. Finally, we explore how to measure dataset granularity based on clustering theory, as a step toward a more precise definition of ``fine-grained.''
Belongie, Serge J.
Snavely, Keith Noah; Weinberger, Kilian Quirin; Azenkot, Shiri
Ph.D., Computer Science
Doctor of Philosophy
Attribution 4.0 International
dissertation or thesis
Except where otherwise noted, this item's license is described as Attribution 4.0 International