eCommons

 

Learning to Represent and Recognize Multimodal Videos

Other Titles

Author(s)

Abstract

In today's digital landscape, the staggering growth of video resources has resulted in a wealth of visual, auditory, and textual information readily available on the internet. To fully harness the potential of learning from multimodal videos, it is crucial to develop efficient techniques for processing and analyzing this information. This dissertation delves into two primary areas: label-efficient representation learning from videos and multimodal video recognition. We leverage the advances in large-scale deep learning to fully exploit the potential of internet videos. Label-efficient representation learning from videos is motivated by the fact that internet videos often lack high-quality labels. We first propose a contrastive learning-based framework to learn features in an unsupervised manner. This approach pulls together feature representations from the same video and pushes apart those from different videos. We then investigate how to learn more refined temporal features for time-sensitive tasks, such as temporal event classification and detection, and more precise spatial features for location-aware tasks, including tracking and detection. Lastly, we explore the use of stronger transformer backbones to integrate multimodal data and form a unified, robust representation. Regarding multimodal recognition for videos, we initially examine the challenging case of fine-grained video recognition to determine if different modalities can assist each other in ambiguous scenarios. We create a new expert-curated audiovisual benchmark and discover that multimodal fusion significantly benefits performance. We subsequently investigate open-vocabulary multimodal video recognition and propose a framework that can effectively classify videos from any given category.

Journal / Series

Volume & Issue

Description

153 pages

Sponsorship

Date Issued

2023-08

Publisher

Keywords

Audiovisual; Multimodal Learning; Open-vocabulary recognition; Representation Learning; Self-supervised Learning; Video Understanding

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Belongie, Serge

Committee Co-Chair

Committee Member

Hariharan, Bharath
Lee, Clarence

Degree Discipline

Computer Science

Degree Name

Ph. D., Computer Science

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Attribution 4.0 International

Types

dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record