“NOTHING TOO SERIOUS”: CORPUS RESOURCES AND METHODS FOR DATA–DRIVEN APPROACHES TO POLARITY SENSITIVITY
This dissertation introduces the Polar Bigrams Resource (PBR), a large-scale corpus-based dataset designed to support data-driven investigations of polarity sensitivity. To address a major challenge for bottom-up processing—the lack of overt indicators of polarity environments—this work employs polarity approximations, paired with aggressive post-processing enabled by the statistical power of large-scale data. A recurring theme of asymmetry emerges, seen in the empirical polarity landscape (positive polarity dominates negative 22:1 in PBR) and in the theoretical structure of polarity licensing (polarity-sensitive items require licensing, but not vice versa). These asymmetries motivate the use of (backwards) asymmetric association measures to quantify the polarity sensitivity of lexical units. The 72 million bigram tokens of PBR are drawn from dependency-parsed corpora, including the novel corpus, Puddin, derived from a portion of The Pile. Contributing roughly 1.4 billion words, Puddin is largely responsible for PBR's empirical scale. Bigram polarity is approximated by defining positive polarity as either absence of a negative or presence of a positive. A secondary sampling method balances polarity sizes by down-sampling positive samples. These methods define four comparison spaces, offering flexibility in analytical design. To support transparency and replication, detailed descriptions of data-processing procedures are provided for both Puddin and PBR. A detailed case study of the adverb exactly illustrates the practical application of PBR. It provides a fine-grained analysis of exactly across comparison spaces and relevant lexical units—exactly alone, bigrams containing exactly, and the most relevant adjectives considered independently—and includes PBR-sourced examples to elucidate the quantitative findings. The patterns uncovered suggest a preliminary theory in which pre-adjectival exactly is contingently polarity sensitive, depending on the scalar semantics of its adjective argument. Together, these methodological contributions and findings offer a new empirical foundation for polarity research and demonstrate the power—and limits—of data-driven approaches to evaluating complex semantic and pragmatic phenomena.