Research Challenges and Opportunities for Open Generative Modeling
This dissertation develops methods to make generative modeling more accessible, reliable, and legally grounded across vision, biology, and language. I introduce CommonCanvas, an open latent diffusion pipeline trained solely on Creative-Commons-licensed images, using a lightweight “telephoning” caption synthesis step to reach competitive quality with vastly less data. I then present PlantCaduceus, a genomics foundation model that leverages domain structure to outperform much larger generic models on plant biology tasks. Finally, I apply a principled probabilistic framework to quantify memorization and the extraction of copyrighted training data from large language models, moving beyond coarse averages to surface per-work risk and inform safer curation and deployment. Together, these contributions fuse technical innovation with openness and risk-aware practice to lower barriers and challenge prevailing conventions in modern AI.