Evaluating language models applied to student thinking about experiments
Recent advancements in natural language processing (NLP) have enabled education researchers to analyze text-based open-response data efficiently and at scale. Here we investigate the use of these methods for a variety of education research questions across multiple data sets. First, we compare the application of various language models to perform large-scale identification of experimental skills in students' typed lab notes through sentence-level labeling. We find that fine-tuned higher-resource models often perform better than fine-tuned lower-resource models, but few-shot implementations of higher-resource models do not perform better. Second, we investigate methods to assess the trustworthiness of education claims made using machine-coded data. We propose a four-part method for making such claims with supervised natural language processing, grounded in physics experimental lab practices, including quantification of uncertainty to calibrate measurements. We provide evidence for this method using data from two distinct short response survey questions with two distinct coding schemes, and work through a real-world example of using these practices to machine code a data set unseen by human coders. We then implement both a supervised analysis and an unsupervised analysis to analyze data from a survey that assesses student thinking about measurement in both quantum and classical contexts. In the supervised portion, we train models to apply a single consistent coding scheme across five open-ended questions on this survey, and we build on results from the previous chapter by demonstrating a second method to calibrate measurements. In the unsupervised portion, we perform a cluster analysis on student responses to a question that assesses reasoning about sources of uncertainty in four experimental physics scenarios. We conclude our investigations into methods to automate analysis of open-response questions with a study that does not apply natural language processing methods at all. Rather, this study evaluates closed-response versions of questions that assess students' reasoning compared to open-response versions. We conduct an experiment to test if asking students their reasoning in an open-response version of the question changes their answer to a subsequent closed-response version of the question, and we do not find an effect. In the same experiment, we investigate whether we measure students' experimental reasoning differently on the closed-response version compared to the open-response version. We discuss the differences we find between the two formats, suggesting the open-response and closed-response versions of the questions measure different aspects of student thinking. Together, these analyses form a foundation for education researchers to evaluate a range of both established and emerging methods for conducting research at speed and scale. Simultaneously, these analyses expand the toolkit available to education researchers to assess student thinking about experiments.