Improving Data Efficiency and Availability for Alignment
AI systems have become increasingly powerful by using increasing amounts of compute in order to implement more and more complex tasks. For example, writing a rules-based search engine that explicitly computes what links a specific user would be interested in for a given query is prohibitively difficult, but machine learning can predict what links a user is most likely to click on. Reinforcement learning from human feedback takes ML further by learning a reward model from preference comparisons to improve an LLM's responses, rather than trying to specify good behavior directly. In both of these settings, human feedback is critical to aligning ML systems for personalization and to perform difficult-to-directly-specify tasks. While compute gets cheaper and cheaper, human attention stays expensive. How can human feedback be more efficient? How can it be more available? This thesis presents several projects which improve data efficiency and data availability for aligning AI systems using human feedback. First, it analyzes the exploration and exploitation tradeoff induced by introducing an explicit cost for observing rewards in a bandit setting. Second, it provides methods for allocating a fixed interaction budget to improve the data efficiency of off-policy learning and evaluation in realistic settings appropriate for search and recommendation systems. Third, it shows how to improve LLM-based assistants using implicit feedback gathered from user interactions. Finally, it outlines preliminary results and promising directions for future work for using human feedback.