Improving the Quality, Efficiency and Understanding of Generative Models
Access to this document is restricted. Some items have been embargoed at the request of the author, but will be made publicly available after the "No Access Until" date.
During the embargo period, you may request access to the item by clicking the link to the restricted file(s) and completing the request form. If we have contact information for a Cornell author, we will contact the author and request permission to provide access. If we do not have contact information for a Cornell author, or the author denies or does not respond to our inquiry, we will not be able to provide access. For more information, review our policies for restricted content.
Visual generation lies at the intersection of perception and imagination. Humans can effortlessly create, edit, and reason about complex visual scenes -- an ability that enables storytelling, communication, and design. Building artificial systems with similar generative capacities that connects understanding with creativity is a central challenge in computer vision and graphics. Such capability also underpins numerous applications in digital media, virtual reality, and content creation. In recent years, generative models have made remarkable progress toward this goal, producing high-fidelity images, animations, and 3D scenes, etc. Despite their success, existing methods still face key challenges in controllability, computational efficiency, and reasoning intelligence. This dissertation explores how to improve the quality, efficiency, and understanding of generative models across diverse visual modalities. The first part of this work addresses visual understanding through factorized video generation. FactorMatte introduces a counterfactual formulation of video matting that decomposes scenes into physically meaningful layers, enabling realistic re-composition and editing. The second part investigates improving the quality and efficiency of image generation for diffusion-based approaches. Filter-Guided Diffusion develops a training-free, architecture-independent guidance method that accelerates sampling while preserving fidelity. AniDiffusion enhances the controllability of pose-conditioned animation generation by learning automatic rigging from minimal examples, and ArtiScene advances language-driven 3D scene generation by leveraging 2D diffusion intermediaries to achieve stylistic consistency and diverse layouts without additional training. The final part focuses on improving reasoning capabilities and efficiency of autoregressive multimodal models. ShortCoTI introduces a RLHF (Reinforcement Learning from Human Feedback)-based optimization that reduces redundant reasoning steps in autoregressive image generation while maintaining or improving output quality. Together, these contributions advance the development of generative systems that are high-quality, computationally efficient, and capable of structured understanding across modalities such as images, videos, and 3D scenes.