Learning from Multimodal Web Data
No Access Until
Permanent Link(s)
Collections
Other Titles
Author(s)
Abstract
Learning from large, unlabeled web corpora has proven effective for a variety of multimodal understanding tasks. But algorithms that leverage this type of data often assume literal visual-textual correspondences, ignoring the non-literal ways in which users actually communicate online. As user attention is increasingly dominated by multimedia content (e.g., combinations of text, images, videos, etc.), community moderators require tools capable of processing these complex forms of communication. In this work, we detail our progress towards two related research goals. The first goal is to leverage multimodal web data in settings of weak (or "web") supervision. The ultimate aim of this line of work is to build models capable of drawing connections between different modes of data, e.g., images+text. To this end, we present algorithms that discover grounded image-text relationships from noisy, long documents, e.g., Wikipedia articles and the images they contain. We also demonstrate that noisy web signals, such as speech recognition tokens from user-generated web videos, can be leveraged to improve performance in caption generation tasks. While these results show that multimodal web data can be leveraged for building more powerful machine learning-based tools, the communicative intent of multimodal posts, which extend significantly beyond literal visual description, are not well understood. Thus, the second goal is to better understand communication in a non-textual web. We first conduct an in-vivo study of several Reddit communities that focus on sharing and discussing image+text content; we train algorithms that are able to predict popularity in this setting, even after controlling for important, non-content factors like post timing. Finally, inspired by the fact that when text accompanies images online, rarely does the text serve as pure literal visual description (an assumption enforced by most curated image captioning datasets), we introduce algorithms capable of quantifying the visual concreteness of concepts in multimodal corpora. We find not only that our scoring method aligns with human judgements, but that concreteness is context specific: our method discovers that "London" is a consistent, identifiable visual concept in an image captioning dataset (because post-hoc annotators only mention "London" in captions if the image is iconically so), but not in a Flickr image tagging dataset (because users may tag any image that happens to be taken in London with the geotag "London").
Journal / Series
Volume & Issue
Description
Sponsorship
Date Issued
Publisher
Keywords
Location
Effective Date
Expiration Date
Sector
Employer
Union
Union Local
NAICS
Number of Workers
Committee Chair
Committee Co-Chair
Committee Member
Marschner, Stephen Robert
Mimno, David