Generalizable Learning for Natural Language Instruction Following on Physical Robots
Robot applications in unstructured human-inhabited environments, such as households, assistive scenarios, and collaborative industrial tasks, require a human-robot interaction interface through which humans can instruct robots what to do.Natural language is an accessible, expressive, effective, and efficient interface to specify a wide array of instructions and goals to robots. This thesis presents a modular and interpretable representation learning approach to following natural language instructions on physical robots by mapping raw visual observations to continuous control. The approach reconsiders the full robotics stack, including solving perception, planning, mapping, and control challenges, all with a focus on enabling behavior specification in natural language. The presented approach achieved the first demonstration of a physical robot that could follow natural language instructions by mapping RGB images to continuous control. The thesis also introduces a hierarchical approach to following mobile manipulation instructions, showing promising early results on a virtual mobile manipulation benchmark.