Mining Visual Knowledge from Pre-trained Models
Computer vision has made significant progress in the past decade, primarily due to the dominant supervised learning paradigm, which involves training large-scale neural networks on extensive datasets for each task. However, scalable data and annotation collection often prove to be intractable. In contrast, humans can adapt to new vision tasks with very little data or few labels. This thesis aims to bridge this gap by presenting a practical solution: pre-training deep neural networks on accessible large-scale internet images, and then employing various techniques to adapt these pre-trained models to diverse downstream tasks with minimal or no additional data. In the pre-training stage, I introduce two meta-learning methods to achieve better pre-trained image representations that generalize to novel classes with minimal extra annotations. In the adaptation stage, I demonstrate multiple techniques for effectively adapting pre-trained models to data-constrained downstream tasks such as recognition, dense prediction, 3D generation, and reference-based image completion.