Learning from Multimodal Web Data

Other Titles


Learning from large, unlabeled web corpora has proven effective for a variety of multimodal understanding tasks. But algorithms that leverage this type of data often assume literal visual-textual correspondences, ignoring the non-literal ways in which users actually communicate online. As user attention is increasingly dominated by multimedia content (e.g., combinations of text, images, videos, etc.), community moderators require tools capable of processing these complex forms of communication. In this work, we detail our progress towards two related research goals. The first goal is to leverage multimodal web data in settings of weak (or "web") supervision. The ultimate aim of this line of work is to build models capable of drawing connections between different modes of data, e.g., images+text. To this end, we present algorithms that discover grounded image-text relationships from noisy, long documents, e.g., Wikipedia articles and the images they contain. We also demonstrate that noisy web signals, such as speech recognition tokens from user-generated web videos, can be leveraged to improve performance in caption generation tasks. While these results show that multimodal web data can be leveraged for building more powerful machine learning-based tools, the communicative intent of multimodal posts, which extend significantly beyond literal visual description, are not well understood. Thus, the second goal is to better understand communication in a non-textual web. We first conduct an in-vivo study of several Reddit communities that focus on sharing and discussing image+text content; we train algorithms that are able to predict popularity in this setting, even after controlling for important, non-content factors like post timing. Finally, inspired by the fact that when text accompanies images online, rarely does the text serve as pure literal visual description (an assumption enforced by most curated image captioning datasets), we introduce algorithms capable of quantifying the visual concreteness of concepts in multimodal corpora. We find not only that our scoring method aligns with human judgements, but that concreteness is context specific: our method discovers that "London" is a consistent, identifiable visual concept in an image captioning dataset (because post-hoc annotators only mention "London" in captions if the image is iconically so), but not in a Flickr image tagging dataset (because users may tag any image that happens to be taken in London with the geotag "London").

Journal / Series

Volume & Issue


182 pages


Date Issued





Effective Date

Expiration Date




Union Local


Number of Workers

Committee Chair

Lee, Lillian

Committee Co-Chair

Committee Member

Sridharan, Karthik
Marschner, Stephen Robert
Mimno, David

Degree Discipline

Computer Science

Degree Name

Ph. D., Computer Science

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)


Link(s) to Reference(s)

Previously Published As

Government Document




Other Identifiers


Rights URI


dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record