Learning with and beyond visual knowledge
Over the course of the last decades, we have witnessed the significant progress of machine learning with neural models in computer vision. An essential part of computer vision is learning visual knowledge, which allows artificial agents to perceive, understand and interact with the world. However, in the pursuit of learning visual knowledge, researchers often face two kinds of challenges. On the one hand, a lack of training sets makes it hard for neural models to generalize to unseen data or downstream tasks. On the other hand, learning only from visual information might hinder the flexibility and generality of visual models. Indeed, existing models require additional labeled data to specify any other visual concept and make it hard for users to interact with a system. To solve these challenges, I explore beyond these existing computer vision challenges and design novel data-efficient machine learning algorithms as well as intelligent systems to enable advances in the model's generalization ability and flexibility. The ultimate goal is to make computers understand the environment comprehensively and interact appropriately with people. This dissertation has made progress toward building data-efficient intelligent systems with highly competitive accuracy, generality, and flexibility through two parts: learning with and learning beyond visual knowledge. For the first part, I primarily focus on two topics: 1) improving neural model performance by reusing features for image generation (Chapter 2) and exchanging features for data augmentation (Chapter 3); 2) studying and achieving a light-weight neural network to translate textures between single image pairs for data augmentation, especially for rare data (Chapter 4). For the second part, my research targets at designing a multimodal framework for language-driven semantic segmentation, where joint embeddings from language and images allow the creation of semantic segmentation systems that can segment an image with any label set (Chapter 5). In the end, I will discuss several promising research directions such as common sense reasoning and multimodal modeling with 3D scenes.