LANGUAGE MODELS FOR ROBOT TASK PLANNING WITH HUMAN DEMONSTRATIONS
Two major challenges exist with high-level robot task planning when the goal is under specified: humans have implicit assumptions and preferences they may not articulate when specifying a goal/reward, and visual demonstrations arehard to ground in a form robots can understand. This thesis addresses these challenges leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs) to convert task demonstrations to robot code offline, then adapt to changes planning online. This work is divided into three connected components. Firstly, we developed DEMO2CODE, that generates robot code from demonstrations assumed to be grounded in text form. We focus here on its quality in capturing preferences from real-world demonstrations. Secondly, we delve into grounding visual input in text form with VIDEO2DEMO, focusing on open-vocabulary predicate and action recognition in kitchen tasks. Lastly, we build and deploy an AI task planner to allow collaborative cooking in our MOSAIC framework. We focus here on evaluating the task planner on its safety violations while interacting with participants in our user study. In summary, this thesis focuses on robot task planning and human-robot interaction by using demonstrations for effective code generation, grounding visual information in text, capturing and adhering to preferences of the human, all using generative language models.