MODELING PERSONAL EXPERIENCES SHARED IN ONLINE COMMUNITIES A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Maria Alexandra Antoniak August 2022 © 2022 Maria Alexandra Antoniak ALL RIGHTS RESERVED MODELING PERSONAL EXPERIENCES SHARED IN ONLINE COMMUNITIES Maria Alexandra Antoniak, Ph.D. Cornell University 2022 Written communications about personal experiences, such as giving birth or reading a book, can be both rhetorically powerful and statistically difficult to model. My research explores unsupervised natural language processing (NLP) models to represent complex personal experiences and self-disclosures commu- nicated in online communities, while also re-examining these models for biases and instabilities. I seek to reliably represent individual experiences within their social contexts and model interpretive dimensions that illuminate both patterns and outliers, while addressing social and humanistic questions. Through this work, I develop a data science practice that emphasizes cross-disciplinary col- laborations and care for datasets and their authors. In this dissertation, I’ll share case studies that highlight both the opportunities and the risks in reusing NLP models for context-specific research questions. BIOGRAPHICAL SKETCH Maria Antoniak was born and raised in Richland, Washington. Prior to her Ph.D., Maria received her B.A. from the University of Notre Dame in the Pro- gram of Liberal Studies, with a minor in the Glynn Family Honors Program. Maria spent one year teaching English at the Ukrainian Catholic University in Lviv, Ukraine, and then completed her M.S. in Computational Linguistics at the University of Washington, where she collaborated with and was supported in her research by the Pacific Northwest National Laboratory. Maria worked as a data scientist for the startup Maana for one year before beginning her Ph.D. in Information Science at Cornell University, where she was advised by David Mimno. During her time as Ph.D. student, she completed research internships at Microsoft Research, Facebook Core Data Science, and Twitter Cortex. In Fall 2022, Maria will join the Allen Institute for Artificial Intelligence in Seat- tle, Washington, as a Young Investigator. iii ACKNOWLEDGEMENTS Thank you to my advisor, David Mimno. I have been incredibly fortunate to work with you, and I am not only a better scientist but a better person for having followed in your footsteps for these six years. At every twist and turn in this journey, you supported me without hesitation, and I will always be grateful for your wisdom, generosity, expertise, and kindness. Thank you to my committee: Lillian Lee, Jeff Rzeszotarski, and Richard Jean So. You inspired me, you gave me room to explore, and you championed me at key moments when I needed your support. Your advice, feedback, and atten- tion were an honor to receive. Thank you to my collaborators, including Karen Levy, Melanie Walsh, LeAnn McDowall, A. Feder Cooper, and Robert Griffin. In particular, thank you to Karen Levy for honest conversations and thoughtful teaching that showed me new perspectives; your brilliance has been a guiding light. Thank you to Melanie Walsh for your creativity and rigor as a coauthor and for your steady friendship. Thank you to my labmates: Moontae Lee, Jack Hessel, Alexandra Schofield, Laure Thompson, Gregory Yauney, Rosamond Thalken, Katherine Lee, and Fed- erica Bologna. I have learned so much from each one of you and was lucky to work, read, and discuss (and enjoy brunch) together. Thank you especially to Alexandra Schofield and Jack Hessel for your mentorship and friendship and for always blazing the trail ahead of me. Thank you also to my extended lab- mates, with whom I shared an office space and long conversations: Justine Zhang, Liye Fu, and Jonathan Chang. Thank you to my many friends made at Cornell: Lauren Kilgour, Sharifa Sultana, Jen Liu, Briana Vecchione, Emily Tseng, Bradi Heaberlin, Malcolm Bare, iv Natalie Tong, Anthony Poon, and many others who journeyed with me through dark winters, late nights, and moments of joy and celebration. I am sure we have many more adventures ahead of us. Thank you also to friends and col- leagues in my “extended cohort” across different institutions, whom I met at conferences, at internships, and on social media, who grew up academically with me and supported me from afar. Thank you to my allies, teachers, and friends in Graduate Women in Science (GWiS) and Graduate Students for Gender Inclusion in Computing (GSGIC). Working with you pushed me to grow in new directions. In particular, thank you to the founding volunteers of GSGIC — Alexa VanHattum, Marianne Aubin Le Quéré, Tegan Wilson, Kate Donahue, Varsha Kishore, Claire Liang, Gregory Yauney, Andrea Cuadra, Griffin Berlstein, Sharifa Sultana, Adelaide Fuller, and Sachi Angle — who worked tirelessly throughout the pandemic to support their community. Thank you to mentors in the past who took a chance on me early in my career. Thank you to Jane Oliensis, Emily Bender, Gina-Anne Levow, Fei Xia, Courtney Corley, Eric Bell, Jason Mackay, and many other academic and indus- try mentors who shared their time and advice with me. Thank you in particular to Women in Machine Learning (WiML), whose activities opened my eyes to research opportunities at Cornell University. Thank you to Cornell faculty members who provided mentorship and guid- ance, including Matthew Wilkens, Marten van Schijndel, Steve Jackson, Yoav Artzi, Mor Naaman, Emma Pierson, and many others. Thank you also to the Cornell NLP group for six years of discussions, debates, and learning. Thank you to staff in Cornell Information Science, especially Barbara Woske, Janeen Orr, Eileen Grabosky, Penny Stewart, and Lou DiPietro, who helped me v many times, both academically and in my efforts to start community initiatives. Finally, thank you endlessly to my family, including my father, Dr. Zenen Antoniak, and my mother, Sherry Dempsey Antoniak, my brilliant brothers and sister, and my many aunts, uncles, and cousins who supported me on this am- bitious path. vi TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction 1 2 Background 7 2.1 Measurement Tools for Unlabeled, Socially-Specific Data . . . . . 7 2.1.1 Topic Modeling for Cultural Analysis . . . . . . . . . . . . 8 2.1.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Self-Disclosures in Online Communities . . . . . . . . . . . . . . . 15 2.3 Personal Narratives . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Sensemaking through Self-Disclosure and Storytelling . . . . . . . 19 3 Data Practices 22 3.1 Upstream and Downstream Research . . . . . . . . . . . . . . . . 22 3.2 Handling Data with Care . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Examples of Data Handling Strategies . . . . . . . . . . . . 26 4 Instability of Unsupervised, Distributional Models 29 4.1 Case Study: Word Embedding Instabilities . . . . . . . . . . . . . 30 4.1.1 Models and Methods . . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 Results Across Models and Settings . . . . . . . . . . . . . 32 4.1.3 Discussion and Impact . . . . . . . . . . . . . . . . . . . . . 37 4.2 Case Study: Comparative Measurements of Bias . . . . . . . . . . 39 4.2.1 Framework of Seed Sources . . . . . . . . . . . . . . . . . . 43 4.2.2 Bias Measurement Methods . . . . . . . . . . . . . . . . . . 46 4.2.3 Seed Features Impact Bias Measurements . . . . . . . . . . 47 4.2.4 Discussion and Impact: Biases All the Way Down . . . . . 52 5 Personal Healthcare Experiences 55 5.1 Healthcare Datasets for Natural Language Processing . . . . . . . 56 5.2 Case Study: Online Childbirth Narratives . . . . . . . . . . . . . . 58 5.2.1 Data Curation . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.2 Narrative Analysis . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.3 Framing of Power . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.4 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . 67 5.2.5 Discussion and Impact . . . . . . . . . . . . . . . . . . . . . 69 vii 6 Personal Reading Experiences 72 6.1 Literary Reception and Online Book Reviews . . . . . . . . . . . . 72 6.2 Collaborative Tagging Systems . . . . . . . . . . . . . . . . . . . . 74 6.3 Literary Genres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4 Case Study: Mapping Literary Genres on LibraryThing . . . . . . 78 6.4.1 Data from LibraryThing . . . . . . . . . . . . . . . . . . . . 78 6.4.2 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . 81 6.4.3 Mapping Methods . . . . . . . . . . . . . . . . . . . . . . . 83 6.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.4.5 Discussion and Impact . . . . . . . . . . . . . . . . . . . . . 96 7 Conclusion 98 viii LIST OF TABLES 4.1 The three settings that manipulate the document order and pres- ence in each corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 The most similar words with their means and standard devia- tions for the cosine similarities between the query word mari- juana and its 10 nearest neighbors (highest mean cosine similar- ity in the FIXED setting. Embeddings are learned from docu- ments segmented by sentence. . . . . . . . . . . . . . . . . . . . . 33 4.3 The 10 closest words to the query term pregnancy are highly vari- able. None of the words shown appear in every run. Results are shown across runs of the BOOTSTRAP setting for the full corpus of the 9th Circuit, the whole document size, and the SGNS model. 35 4.4 Examples of real seed terms used in recent work to measure bi- ases in corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5 Overview of the surveyed seed sources. . . . . . . . . . . . . . . . 43 4.6 When two seed sets are more semantically distinct they are more distinguishable in the resulting geometric subspace. The top ta- ble shows pairs of artificially generated seed sets, ranked by their coherence for WEAT in the NYT dataset. The bottom table shows pairs of seed sets gathered from published papers, ranked by their coherence for WEAT in the WikiText dataset. Scores are av- eraged across 20 bootstrapped samples of the training data, and values are rounded; no coherence scores are exactly 1.0. Higher coherence scores indicate that the seeds pairs were projected far- ther apart in the bias subspace. . . . . . . . . . . . . . . . . . . . . 51 5.1 The bigrams drawn from the post titles associated with the most and least probable stories. Probabilities represent the means of the summed log probabilities of the last ten topic transitions in a story. Lower scores indicate stories with more unusual topic transitions (sequences of events). Results are averaged (mean) across bootstrapped samples of the stories. . . . . . . . . . . . . . 64 5.2 Personas identified in the birth stories collection and the n-grams used to classify the personas. . . . . . . . . . . . . . . . . . . . . . 65 6.1 Examples classifications and surprisal scores. Excerpts are se- lected from the last 100 words of the reviews. Higher surprisal indicates greater confidence in the incorrect label. . . . . . . . . . 93 ix LIST OF FIGURES 4.1 The mean standard deviations across settings and algorithms for the 10 closest words to the query words in the 9th Circuit and NYT Music corpora using the whole documents. Larger varia- tions indicate less stable embeddings. . . . . . . . . . . . . . . . . 34 4.2 The mean Jaccard similarities across settings and algorithms for the top 2 and 10 closest words to the query words in the AskHis- torians corpus. Larger Jaccard similarity indicates more consis- tency in top N membership. Results are shown for the sentence document length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Bias measurements depend on seeds. We calculate the cosine similarities between different seed sets representing women and an averaged upleasantness vector from two embedding models. Results are consistent across seeds for romance review embed- dings, but vary widely between sets for history and biography. We find similar variation even for a pretrained Google News model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Ranking word vectors by cosine similarity with the top princi- ple component vector for the original gender seed pairs (a) ap- pears to identify words representing men and women much bet- ter than random (b). But shuffling the pairing of seed words (c) maintains correlation with gender but to a less clear degree. Re- sults are shown for the NYT corpus with a frequency threshold of 100 and bootstrap resampling. . . . . . . . . . . . . . . . . . . . 50 4.5 Identifying bias is less effective when set pairs are similar. Gen- erated seeds are frequency-controlled nouns from the WikiText dataset. We highlight two sets of gathered seeds; both target similar racial categories but the name-based sets are more simi- lar and explain less variance. We find similar trends for WEAT, coherence, and the other corpora and POS. . . . . . . . . . . . . . 52 5.1 A selection of topics over time. Plots are labeled with the five highest probability words for each topic. Results show the prob- ability for each topic at 10% intervals of story time, averaged across all stories. Error bars show standard deviation across bootstrapped samples of stories. . . . . . . . . . . . . . . . . . . . 61 5.2 Histograms showing the frequencies of persona mentions over story time. Some entities (e.g., author) are consistently more fre- quent than rare entities (e.g., doula). Some frequency patterns are expected while others are surprising (e.g., frequency of we decreases near the middle of the stories). . . . . . . . . . . . . . . 61 x 5.3 Flowchart of the most probable topic transitions (above 0.2%). We removed one orphan node without a parent path leading to the beginning of story (BOS) state. . . . . . . . . . . . . . . . . . . 63 5.4 Most frequent verbs from the power lexicon associated with each persona in the birth stories corpus. Green indicates a positive power contribution, while pink indicates a negative power con- tribution. The cell values indicate the proportion of persona mentions with the given verb and power relationship. . . . . . . 66 5.5 (a) Power scores for each persona. Error bars show standard de- viation over 20 bootstrap samples of the collection. (b) Estimated power of personas (rows) over other personas (columns). The NURSE is consistently framed as more powerful than the other personas, except for the DOULA. . . . . . . . . . . . . . . . . . . . 67 6.1 A mapping of LibraryThing genres. User overlap between genre pairs correlates with book overlap, but there are outliers. Each point represents two genres, and the axes represent the rank of the genre pair, where lower numbers indicate higher ranks and therefore higher overlap. For example, the genre pair classics + animals has a mid-range user overlap rank and a high book overlap rank, indicating that these genres share surprisingly few users given how many books are shared. Pearson correlation be- tween book and user overlap is significant (r = 0.68, p ¡ 0.05). . . 89 6.2 The number of overlapping books and the number of genre mis- classifications of user reviews for each pair of genres. Each point represents a pair of genres in which one is the true tag applied to the review text and one is the predicted tag from our model. As expected, we find a significant relationship using Pearson corre- lation (r = 0.65, p < 0.05) between the book overlap and misclas- sification count, but we highlight outlier genre pairs, e.g., animals and psychology have an unusually high misclassification count given their very low book overlap. . . . . . . . . . . . . . . . . . . 91 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Are tighter communities easier to predict? Are tighter commu- nities more critical? Figure 6.3 shows the target genres plotted along surprisal (the ability of a classifier to predict the genre of a review) and community homogeneity (averaged cosine simi- larities between reviewers’ tagsets). Figure 6.4 shows the target genres plotted along rating and community homogeneity. Gen- res whose reviewers have more similar reading habits tend to also have higher ratings according to a Pearson correlation test (r = -0.60, p ¡ 0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 xi CHAPTER 1 INTRODUCTION Sharing personal stories and experiences can be transformative, persuasive, and powerful, for both the narrators and their audiences. Social movements like #MeToo amass large numbers of personal stories to build solidarity and commu- nity and to make sense of collective experiences. The founder of #MeToo, Tarana Burke, wrote that “We call [this idea] ‘empowerment through empathy’ ... to not only show the world how widespread and pervasive sexual violence is, but also to let other survivors know they are not alone” [Burke, October 15, 2017a,O]. Sharing personal stories can have physical, emotional, and social health bene- fits Pennebaker and Beall [1986], Pennebaker [1997], Merz et al. [2014], Oh and Kim [2016] and can help communities make sense of difficult experiences to- gether [Tangherlini, 2000, Mamykina et al., 2015]. Personal stories are also used by journalists and politicians as a rhetorical device to connect with their audiences and sway public opinion. For example, as part of a larger ProPublica project focusing on the maternal mortality crisis in the U.S., an article focusing on the story of a single neonatal intensive care nurse received wide attention and hundreds of comments [Martin et al., 2017a]. Another article in the series, which gathered stories of 16 women, received an award from the National Association of Black Journalists [Martin et al., 2017b]. Publishing these stories led to an outpouring of 5,000 more personal stories [Gallardo, 2018] and awoke a public consciousness of the crisis that previously published statistics had not. Online, people are eager to read and share their personal experiences with products, books, healthcare, and other topics via reviews, blogs, and social me- 1 dia posts. Online healthcare support communities abound in personal stories and questions, even when these are often repetitions of other posts. These communities provide opportunities for people to share their stories, emotions, and reflections with a sympathetic audience while making sense of these ex- periences as a group [Patel et al., 2019, Genuis and Bronstein, 2017, Young and Miller, 2019]. Sometimes these communities allow people to self-disclose [Jourard and Lasakow, 1958, Jourard, 1971], or share information about them- selves, that they cannot share in other settings [Yang et al., 2019b]. Computational analysis of these stories can help researchers learn from large amounts of online disclosures and narratives. For example, Gallagher et al. [2019] used text classification and network analysis methods to find that dis- closing can encourage others to share more details when disclosing, while Yang et al. [2019b] used machine learning techniques to determine that online health support community members make more negative disclosures in private set- tings. Unsupervised distributional models for text, like topic modeling [Blei et al., 2003] or word embeddings [Mikolov et al., 2013a], can reveal shared struc- ture and biases across large datasets without high up-front costs. The voices in these communities deserve to be heard, and precisely because of their importance, they deserve to be heard reliably. Written communica- tions about personal experiences can be statistically difficult to model, and there might not be a single correct interpretation of these communications. Analyz- ing the sentiment or framing disclosed in personal stories is challenging, and extracting structure from narratives is a complex task involving many subtasks, including entity recognition and linking, event detection and next-event predic- tion, sentiment analysis, etc. Methods addressing theses challenges often rely on 2 labeled datasets which are costly to create and do not necessarily transfer well across specific subdomains. Datasets of personal experiences are often socially specific, with their own context and communication norms, and these datasets can be much smaller than the typical machine learning training dataset. These are fragile samples prone to noise and misreading — not generalizable training datasets or complete populations — and they need to be treated with care, both ethically and statistically. In this thesis, I explore these themes of personal stories, reliability, and online communities through a series of case studies. Methodologically, these studies rely on unsupervised distributional models from natural language processing (NLP), representing complex personal experiences and self-disclosures via lex- ical patterns. These studies focus on stories people tell about themselves and the sharing of those stories in online communities, and they also re-examine computational models for biases and instabilities. The goal of these studies is to reliably represent individual experiences within their social contexts, while modeling interpretive dimensions that illuminate both patterns and outliers. For personal stories in particular, it is not sufficient to discuss this data only in terms of summary statistics but also by zooming in on individual stories, e.g., for close reading, a method from literary criticism that focuses sustained attention on a brief selection of a written work. While NLP researchers often focus on the downstream applications of NLP models, cultural researchers often focus on what might be called upstream ap- plications — that is, on what the training data can reveal about situated cultural questions. My work has considered both of these perspectives and translated between these approaches. How can we use NLP methods reliably and ethically 3 to identify patterns while not losing sight of outliers and uncertainty? Within the field of NLP, I have critically re-examined machine learning measurement methods. I probed models designed for traditional NLP tasks involving large, generic datasets by exploring their results on small, socially-specific datasets that are common in social science and humanities research. My work has shown that popular NLP methods ported to new domains can result in surprising instabilities and biases [Antoniak and Mimno, 2018, 2021]. For example, training a word embedding model on a target corpus and com- paring cosine distances between word vectors can lead to unpredictable results, unless researchers measure the stability of these distances, e.g., by bootstrap- ping the training corpus. Similarly, bias measurements that rely on lexicons of seed terms can produce significantly different results depending on various features of the seed sets. Within computational social science and the digital humanities, I use these same NLP methods for the study of structured personal experiences shared in online communities [Antoniak et al., 2019, 2021, Walsh and Antoniak, 2021]. I identified two main sites for this research: online communities grounded in cul- tural experiences (books, games) and online communities grounded in health- care experiences (childbirth, contraception, pain management). These com- munities situate personal opinions and stories in social contexts of reception, expectation, and judgment, and the shared grounding in each community al- lows clearer illumination of patterns and outliers. My research has highlighted how community members use shared experiences to reframe and redefine es- tablished narratives and hierarchies, revealing lessons for healthcare systems, readership and reception, and careful research practices. 4 In work focused on personal healthcare experiences, for example, I have shown how postpartum people in a specific online community share stories about their birth experiences [Antoniak et al., 2019]. This work shed light on the framing of outlier event sequences and the perceived power hierarchy of medical professionals, family and support people, and the narrator themselves. By taking advantage of the biological and medical structure of these stories, we were able to use simple unsupervised models and lexical methods to extract narrative pathways and measure power relationships. In work focused on online reading communities, my work has highlighted the literary significance of new research resources like Goodreads and Library- Thing [Antoniak et al., 2021, Walsh and Antoniak, 2021]. Before these communi- ties existed, scholars interested in literary reception had to rely mostly on pro- fessional reviews, publishing records, and sales data to try to reconstruct public perceptions of books. My work examined the the collaborative tagging sys- tem that exists on both of these websites, where users can categorize books into genres, subjects, or any categorical system the user wishes, with downstream consequences for physical libraries and bookstores. These user-based tagging systems capture popular perceptions and can remap the literary landscape, but they can also entrench long-standing social hierarchies and modern algorithmic design choices. In this thesis, I will discuss a selection of case studies covering the topics discussed above. First, I will highlight the instabilities in two distributional methods that are commonly used for research applications in cultural analytics, where the reliability of fine-grained measurements are of utmost importance. These methods include ranking of word vectors by cosine similarities after train- 5 ing an embedding model; and bias measurement methods, which rely on sets of seed terms. Then I will move on to two applied case studies, where I employ similar distributional methods to study particular online communities. In the first case study, I will discuss my work on an online birth stories community, and in the second case study, I will discuss my work on online reading commu- nities. Finally, I will conclude with open questions for future work. 6 CHAPTER 2 BACKGROUND 2.1 Measurement Tools for Unlabeled, Socially-Specific Data Unsupervised methods allow researchers to explore their datasets without high up-front costs. In particular, distributional semantic models — which rely on com- parisons of word co-occurrences — can provide fast ways for researchers to explore themes, biases, and other patterns in text data. These models include word embeddings and topic modeling, two of the central methods discussed in this thesis. However, these methods also carry some risks, particularly in settings with small and constrained datasets. In these settings, it becomes even more im- portant to test the reliability and significance of results, as the presence or ab- sence of individual documents can more significantly alter distributional pat- terns learned from the dataset. Without proper evaluation, the lack of ground- truth can result in spurious “story-telling” in which the researcher retroactively fits their theory to the model. These constraints are common in both the humanities and healthcare set- tings discussed later in this thesis. In healthcare, privacy restrictions make it very difficult to create shared datasets, while in the digital humanities, copy- right restrictions are supplemented by the natural boundaries of the object of study. We cannot augment an author’s oeuvre and are constrained to the data that has survived through curation. Existing datasets are often small and labels are limited; it might be expensive or impossible to create reliable gold labels 7 (e.g., there is no single correct summarization of a narrative). Even for tasks where large, labeled datasets do exist, pretrained models have limited ability to generalize to small, domain-focused datasets. In this section, I provide an overview of these methods and situate them within cultural analytics and NLP research. I begin by describing two com- putational methods, topic modeling and word embeddings, which are popular technique in both fields. 2.1.1 Topic Modeling for Cultural Analysis Topic modeling is an unsupervised NLP method that automatically discovers topics or themes in a set of texts and can be used as an exploratory technique leading a researcher to different views of a dataset [Baumer et al., 2017]. La- tent Dirichlet allocation (LDA) is a popular topic model [Blei et al., 2003] that has been used in a wide range of applications in cultural analytics as various as measuring the relationship between ideas over time [Prabhakaran et al., 2016, Tan et al., 2017], exploring the types of disclosures shared during the #MeToo social movement [Mueller et al., 2021], examining gender stereotypes in auto- matically produced stories [Lucy and Bamman, 2021], and tracking changes in societal concerns and views related to COVID-19 [Zamani et al., 2020]. LDA is a generative probabilistic model where each topic in a set of K topics is represented by a probability distribution over the full vocabulary and each document is represented by a probability distribution over the topics. The re- searcher can select K as well as set the Dirichlet priors informing the sparsity of topics for each document and the vocabulary for each topic. This model 8 makes a series of simplifying assumptions, including a bag-of-words assump- tion (word order does not matter), a sparsity assumption (each document is best represented by a few topics), and a mixture assumption (each document is rep- resented by a mixture of topics). Producing coherent topics often requires careful preprocessing of the dataset and consideration of the dataset features and training parameters. I detail some of the key steps in this process below. Training algorithm, document size, and number of topics. Producing co- herent topics requires careful consideration of the training dataset. For small datasets, more coherent results can be found using Gibbs sampling rather than other training procedures. The training document size can also affect results, with smaller documents (e.g., tweets) often producing less coherent results than longer documents (e.g., paragraphs, posts). Selecting the number of topics K re- quires experimentation with each new dataset; different numbers of topics can reveal different types of topics (e.g., a smaller number can produce higher level themes), and there is no one “correct” number of topics, as this depends on the researcher’s goals. Stemming and stop word removal. Stemming (removing all but the morpho- logical root from each token) before training a topic model is mostly redundant and can be harmful [Schofield and Mimno, 2016]. Removing stop words can be done pre- or post-training; the benefit is mostly an aesthetic improvement to the lists of most probable words that are commonly used to interpret the topics after training [Schofield et al., 2017a]. Most importantly, large numbers of duplicates can affect the resulting model. While sometimes the duplicates 9 will be sequestered into a single topic, which can be ignored, at other times the duplicates can overwhelm the model and should be removed before training [Schofield et al., 2017b]. Shifting away from known metadata. Sometimes, LDA will produce topics that redundantly mirror known categories that might already be encoded in a corpus’s metadata. For example, a topic model trained on a dataset of novels might learn topics that correspond to each of the most popular authors. This output is correct — the model is successfully identifying meaningful patterns in the dataset — but it is probably not useful to a researcher who already has author data for each of the novels. To overcome this challenge and bias the model away from known metadata, Thompson and Mimno [2018a] provide a tool that probabilistically subsamples words that appear more frequently with the undesired categories before training. Preprocessing the training data in this way can produce topics that are more useful for exploration and interpretation. Dirichlet priors. Two Dirichlet priors control the sparsity of the topic-word and document-topic distributions. [Wallach et al., 2009] found that tuning the document-topic prior improves the resulting topics, while tuning the topic- word prior is not helpful. Topic model evaluation. Topic model evaluation is difficult, as there is no sin- gle best set of topics for a dataset. Automatic metrics like coherence [Mimno et al., 2011] can provide a general rule-of-thumb but do not always correlate well with human judgments [Hoyle et al., 2021]. Instead, it is best to use man- ual evaluation, such as “intruder” tests, in which a human annotator tries to 10 find the random intruder word among a set of true highest probability words for a specific topic (or an intruder document among the set of documents with highest probability for this topic) [Hoyle et al., 2021]. These tests measure the success of the model in producing themes that are legible to the researcher. 2.1.2 Word Embeddings Word embeddings models map words in a vocabulary to low-dimensional vec- tors. These word vectors can then be used as input to classifiers or compared to one another using different distance or similarity metrics, like cosine distance. These metrics can then be used as a proxy for semantic similarity — or rather, distributional similarity — of the words in this training corpus. In NLP, word em- beddings are often used as features for downstream tasks. Dependency parsing [Chen and Manning, 2014], named entity recognition [Turian et al., 2010, Cherry and Guo, 2015], and bilingual lexicon induction [Vulić and Moens, 2015] are just a few examples where the use of embeddings as features has increased perfor- mance in recent years. Word embeddings are often used as evidence in studies of language and culture. For example, Hamilton et al. [2016] train separate embeddings on tem- poral segments of a corpus and then analyze changes in the similarity of words to measure semantic shifts, and Heuser [2016] uses embeddings to characterize discourse about virtues in 18th Century English text. Other studies use cosine similarities between embeddings to measure the variation of language across geographical areas [Kulkarni et al., 2016, Phillips et al., 2017] and time [Kim et al., 2014]. Researchers have used these similarity scores to measure the biases 11 encoded in the embedding model and to make claims about biases in the train- ing dataset and its authors [Caliskan et al., 2017]. Each of these studies seeks to reconstruct the mental model of authors based on documents. Word embedding models are often described as measuring the semantic sim- ilarity of words. In truth, they measure the distributional similarity of words, which may or may not correlate with semantic similarity. This discrepancy becomes more apparent when considering various weakness of these models, where semantic similarity tests would fail. As with other distributional methods, word embedding models are prone to errors where words are related in context but not in meaning. In this vector space, antonyms will often appear close together, as antonyms are usually very similar to each other in all ways except one and are used in similar contexts [Cruse, 1986]. Word sets (like days of the week or months of the year) will also often appear as close to one another as synonyms, as they are usually used in near identical contexts. Strange errors can also emerge through reporting bias: when the frequency of a descriptive phrase does not match the real-world fre- quency of the things signified by the phrase [Gordon and Van Durme, 2013]. For example, according to a word embedding model, the vector representing sheep might be much closer by cosine distance to the word black rather than white, be- cause sheep are assumed to be white and their color is usually noted for sheep where this is not the case. Word embedding models can also encode human biases, including harmful stereotypes [Bolukbasi et al., 2016a, Caliskan et al., 2017] (partly driven by the reporting biases described above). Many different word embedding models and training algorithms have been proposed. In static models, each word is represented by a single vector, regard- 12 less of the word’s context and how many meanings the word might have. For example, a word like bar will be represented by the same vector in every situ- ation, whether the word is referring to a bar of soap or taking the bar exam. To address this issue, word vectors can instead be extracted from contextualized models like BERT [Devlin et al., 2019] which represent each word’s usage in a specific context. In this dissertation, I focus on static embedding models as these are more commonly used in cultural analytics research. I highlight three of the most pop- ular static embedding models below. LSA. Latent semantic analysis (LSA), an early form of a word embedding model, factorizes a sparse term-document matrix X Deerwester et al. [1990], Landauer and Dumais [1997]. X is factored using singular value decomposition (SVD), retaining K singular values such that X ≈ XK = U Σ VTK K K . The elements of the term-document matrix are weighted, often with TF-IDF, which measures the importance of a word to a document in a corpus. The dense, low-rank approximation of the term-document matrix, XK , can be used to mea- sure the relatedness of terms by calculating the cosine similarity of the relevant rows of the reduced matrix. Word2Vec. A popular model, word2vec, is trained via the skip-gram with neg- ative sampling (SGNS) algorithm Mikolov et al. [2013b], an online algorithm 13 that uses randomized updates to predict words based on their context. In each iteration, the algorithm proceeds through the original documents and, at each word token, updates model parameters based on gradients calculated from the current model parameters. This process maximizes the likelihood of observed word-context pairs and minimizes the likelihood of negative samples. Word2vec can also be approximated through a method similar to LSA. The positive pointwise mutual information (PPMI) matrix, whose cells represent the PPMI of each pair of words and contexts, where PMI is defined as P(w, c) PMI(w, c) = log ; P(w)P(c) PPMI(w, c) = max(PMI(w, c), 0). The PMI matrix is factored using singular value decomposition (SVD) and results in low-dimensional embeddings that perform similarly to GloVe and SGNS Levy and Goldberg [2014]. GloVe. Finally, Global Vectors for Word Representation (GloVe) uses stochas- tic gradient updates but operates on a “global” representation of word co- occurrence that is calculated once at the beginning of the algorithm Pennington et al. [2014]. Words and contexts are associated with bias parameters, bw and bc, where w is a word and c is a context, learned by minimizing the cost function: ∑ L = f (xwc)w⃗ · c⃗ + bw + bc − log(xwc). w,c 14 2.2 Self-Disclosures in Online Communities Self-disclosure is the “process of making the self known to others” [Jourard and Lasakow, 1958]. This can include the sharing of personal opinions, beliefs and values, sentiment and emotion, personal stories, and aspects of one’s iden- tity such as gender, age, or nationality [Montgomery, 1981, Wang et al., 2016, Ravichander and Black, 2018, Altman and Taylor, 1973, Bak et al., 2014, Barak and Gluck-Ofri, 2007, Balani and De Choudhury, 2015]. Self-disclosure and trust are tightly intertwined. On one hand, trust is an important prerequisite for self-disclosure, but on the other hand, trust can also be fostered through self-disclosure [Joinson and Paine, 2007]. According to social penetration theory, different types of self-disclosures are typical of dif- ferent relationship stages; as social bonds grow and become more intimate, more self-disclosures are made [Altman and Taylor, 1973]. Self-disclosures can strengthen social bonds and foster relationship formation [Oswald et al., 2004, Ma et al., 2017b], while group-level disclosures can enhance the trust between group members and their sense of group identity [Galegher et al., 1998, Yang et al., 2019b, Ma et al., 2017b, Joinson et al., 2010]. However, the decision to self-disclose is contextual. For example, Yang et al. [2019b] find that health-related self-disclosure decisions depend on the privacy of a conversation, while prior work has found gender differences in self-disclosure differ by context [Hill and Stull, 1987] and that men and those with wanting to manage impressions self-disclose less [Wang et al., 2016]. Oth- ers have also found self-disclosure to be affected by culture and community norms [Zhao et al., 2012, Li et al., 2018]. In online gaming, in-group behav- 15 iors like frequency of playing together have been found to positively relate to increased self-disclosure [Reer and Krämer, 2014], while in virtual reality (VR) games, context, levels of anonymity, and user control were found to be impor- tant factors in deciding to self-disclose to other players [Maloney et al., 2020]. Many different automatic measurements of self-disclosure have been pro- posed. Capitalizing on linguistic patterns associated with self-disclosure, prior work has experimented with emotion lexicons [Bak et al., 2012, Ravichander and Black, 2018], topic modeling [Bak et al., 2012], and training supervised mod- els on coded conversational or social media datasets [Bak et al., 2012, Wang et al., 2016, Balani and De Choudhury, 2015]. While prior work has also at- tempted to build labeled training sets for self-disclosure detection [Balani and De Choudhury, 2015], the context-dependent nature of many self-disclosures often complicates their use. While pronoun rates alone will not capture specific types of self-disclosure (e.g., sharing emotions vs. opinions), the rate of first person singular pronouns has been found to be a reliable proxy for self-disclosure [Joinson, 2001, Barak and Gluck-Ofri, 2007]. Jaidka et al. [2018] found that first person pronoun use reflects mentions of self, while Ravichander and Black [2018] that first person singular pronouns help identifying self-disclosures with higher precision (and low recall). Making explicit statement about ones identity (e.g., gender, race, relationships) have previously been measured using lexicons of demographic terms and pattern-matching. 16 2.3 Personal Narratives Among the many forms of self-disclosure is telling stories about oneself. Nar- ratives are powerful modes of expression, with physical, emotional, and social benefits for both the narrator and the audience [Pennebaker and Beall, 1986, Pennebaker, 1997, Merz et al., 2014, Oh and Kim, 2016, Tangherlini, 2000]. Nar- ratives can also be useful methods for understanding human behavior and be- liefs [Golsteijn and Wright, 2013]. Personal narratives can be rhetorically per- suasive and difficult to moderate, and like other types of self-disclosure, they can facilitate community trust and sensemaking. However, narratives are a challenging test case for NLP tools, as their modeling needs to represent com- plex and extended interactions between people, objects, environments, affects, and beliefs. Book-length texts pose challenges for a field that often considers dependencies longer than one paragraph “long range.” In a recent overview of NLP approaches to narrative understanding, Piper et al. [2021] formulate narrativity as a scalar construct rather than a binary class; texts can include some or all narrative features (e.g., narrator, audience, sequen- tial actions). Most NLP narrative tasks focus on building abstractions from nar- ratives by extracting these features and measuring relationships between them. These tasks include extracting narrative structure, like scripts, plot units, or nar- rative arcs [Lehnert, 1981, Schank and Abelson, 1977, Chambers and Jurafsky, 2008, Goyal et al., 2010, Chambers and Jurafsky, 2009, Pichotta and Mooney, 2016, Reagan et al., 2016, Flor and Somasundaran, 2017]; discovering connec- tions between characters [Bamman et al., 2013, Iyyer et al., 2016, Lukin et al., 2016]; generating new stories or summaries [Goldfarb-Tarrant et al., 2020, Guan et al., 2020]; and identifying a correct story ending [Mostafazadeh et al., 2016]. 17 As in other areas of NLP, much narrative research tends to fall into shared tasks, where artificial story datasets are used to test a particular technical abil- ity of a system. In NLP, these datasets are sometimes created and often la- beled by crowdworkers. For example, story completion is the goal of the Story Cloze Task, in which the model is given all but the ending of a story and asked to generate or select the appropriate ending [Chambers and Jurafsky, 2008, Mostafazadeh et al., 2016]. These datasets describe brief scenarios intended to isolate a phenomena from more complicated social contexts. Outside of shared tasks, narrative research in NLP also includes corpus- based studies, where researchers use narrative models to learn about a partic- ular dataset and its authors. Underlying corpus-based studies are often curated datasets that range widely and can include fictional works (e.g., novels, fairy- tales) [Jans et al., 2012, Iyyer et al., 2016], news stories [Chambers and Juraf- sky, 2008], biographies [Bamman and Smith, 2014], and personal stories shared orally or on social media [Gordon and Swanson, 2009, Ouyang and McKeown, 2014, Antoniak et al., 2019]. These curated datasets were authored in social con- texts separate from the NLP research study, and they are gathered passively (e.g., scraped without the knowledge or consent of the authors); this oppor- tunistic use case constrains the possible research questions. As with other NLP tasks, many modern narrative methods rely on large, pretrained models. These models in turn rely on massive and undocumented datasets, often scrapes like the aptly-named Pile [Gao et al., 2020], that contain a mixture of documents from unrelated domains, in the hopes that the quantity and diversity of data will generalize to a variety of other domains and tasks during the fine-tuning process. These pretraining datasets are too large to create 18 full datasheet descriptions [Gebru et al., 2018]; even quantifying the number of duplicate documents in these datasets is challenging [Lee et al., 2021], and these datasets can encode human biases [Bender et al., 2021]. While many technical challenges lie ahead in modeling narratives, NLP also faces challenges in dataset design and curation and moving between averaged patterns from models and individual examples and experiences. The differ- ent types of datasets noted above each bring their own challenges. Collected datasets are often used out-of-context; this constrains the possible research questions. Massive and undocumented datasets can contain unmeasured hu- man biases and stereotypes as well as unmeasured domain biases (e.g., if more of the pretraining data is from a particular domain). Short and artificial datasets lack the social grounding of collected datasets — these datasets were not con- structed out of personal motivations — and do not necessarily generalize to longer, more complex stories. In all of these cases, biases from the datasets can also influence NLP models and cause undesired outcomes. 2.4 Sensemaking through Self-Disclosure and Storytelling How do people use self-disclosures and storytelling to make sense of their ex- periences as a community? While many definitions of sensemaking have been posited, it is often described as “processes through which people interpret and give meaning to their experiences” [Lam et al., 2016]. It necessarily involves communication and is an “activity that talks events and organizations into ex- istence,” in part through retrospective storytelling [Weick et al., 2005]. As a process that creates organization [Weick et al., 2005] and relies on collaborative 19 problem solving [Pirolli and Card, 2005], the analysis of sensemaking within online communities falls within the study of computer-supportive cooperative work (CSCW), which examines how people use technical tools and spaces to collaborate. In online healthcare support communities, the individual works to make sense of their healthcare experience, and the community collectively gathers in- formation, compares stories, and makes sense of a shared experience [Mamyk- ina et al., 2015]. Prior work has described how individual users can join a com- munity and move through a transformative process; the collection and organi- zation of information at the center of this process is also how the community builds its own sense of the experience [Genuis and Bronstein, 2017, Patel et al., 2019]. Not only do members of cancer support groups discernibly transition between stages of treatment [Jha and Elhadad, 2010, Wen and Rosé, 2012] but they can also be observed transitioning between community roles [Maloney- Krichmar and Preece, 2002, Yang et al., 2019a]. Users usually begin by seeking information and eventually transition to providing emotional support [Wang et al., 2017, Yang et al., 2019a]. However, these processes are contextual, and dif- ferent communities might rely on different processes more than others [Young and Miller, 2019]. Storytelling and expressive writing about traumatic experiences can have physical, emotional, and social health benefits Pennebaker and Beall [1986], Pennebaker [1997], Merz et al. [2014], Oh and Kim [2016] and have been studied for their role in therapy and the establishment of social norms. In one exam- ple, Arigo and Smyth [2012] find that expressive writing about eating patterns and personal appearance by college-age women either improves a range of out- 20 comes or reduces the risk of further decline. de Moor et al. [2008], on the other hand, find no benefits in breast cancer survivors. Tangherlini [2000] argues that the storytelling of paramedics forms the culture and organizational structure of their jobs. Through these stories, the paramedics both work through trauma of difficult experiences and negotiate their place in a hierarchy of workers (doctors, nurses, police officers, other paramedics). Likewise, patients tell stories about their experiences that cast medical professionals into sets of roles, which physi- cians need to understand to interact with the patient effectively [Suss, 2014]. In both the humanities and healthcare, individuals and communities try to make sense of very complicated processes. Storytelling can help narrators as- similate traumatizing experiences [Tangherlini, 2000] and can transfer impor- tant information to others without firsthand experience [Bietti et al., 2019]. On the one hand are bureaucratic healthcare systems and the physical and psycho- logical costs and traumas of experiencing health issues. In these medical set- tings, sensemaking is “distributed across the healthcare system” and decisions about individual cases are processes of coordination and information distribu- tion [Weick et al., 2005]. On the other hand are novels, poems, games, and other cultural objects, each of which might require community sensemaking even as each is also a piece of a larger sensemaking puzzle, which allows its audience to more easily manipulate the ideas contained in the object and draw connec- tions between them. The digital humanities itself can be viewed as “interpretive endeavours... which seek not to constrain meaning, but to guarantee its multi- plicity” and to reveal “the hidden details of pattern formation” [Ramsay, 2011]. 21 CHAPTER 3 DATA PRACTICES 3.1 Upstream and Downstream Research Work in NLP tends to fall into two large categories of research goals: downstream and upstream research. In applied machine learning and industry settings, the downstream setting is standard. In this setting, some set of authors write down a set of texts, which may or may not encode their internal biases. These texts are curated and gath- ered into a corpus, which might be missing documents or altered by other au- thors or curators. This dataset is then used by NLP researchers train a model for some downstream task, e.g., predicting which advertisement to display to a user based on the text contained in the user’s emails. Since the focus is on the downstream task, the training corpus is only of interest insofar as it generalizes to the downstream task. The downstream setting has become particularly prominent in the modern pre-training and fine-tuning workflow, where a model is first trained on a very large dataset and then fine-tuned on whichever specific dataset is of interest downstream. The pre-training dataset is usually composed of a smorgasbord of internet and book data, with Wikipedia, Reddit, and data from other on- line communities mashed together into a single, super-sized corpus [Gao et al., 2020, Bandy and Vincent, 2021]. This approach, using pre-trained models with the transformer architecture, yields state-of-the-art results for the majority of traditional NLP tasks like parsing, named entity recognition, and classification 22 [Devlin et al., 2019]. However, the size of these datasets can obstruct attempts to document them, and researchers are left with unmeasured risks of memoriza- tion, encoding biases into their models, and other possible harms. The upstream, corpus-focused approach is distinct from these traditional NLP settings. When we move from this downstream focus to a scenario in which the training corpus itself becomes the object of study — when researchers want to learn about the authors of a specific corpus — traditional NLP practices are “flipped.” This use case might require re-analysis of methods that were de- signed for downstream, rather than upstream, settings. In particular, the exten- sion of traditional machine learning techniques to small, topic-specific corpora requires new understandings of the limitations of these methods. The corpus is at once both the central object of study and a small and unreliable sample whose weaknesses must be evaluated. The upstream setting is common for researchers working in the digital hu- manities, computational social science, and cultural analytics. These disciplines have strong traditions of curation, but these critical methods have not always been applied to the datasets underlying many NLP tools and methods. For ex- ample, researchers increasingly use embeddings [Garg et al., 2018, Knoche et al., 2019, Hoyle et al., 2019] and other lexicon-based methods [Saez-Trumper et al., 2013, Fast et al., 2016, Rudinger et al., 2017] to provide quantitative answers to otherwise elusive political and social questions about the biases in a corpus and its authors. This work often involves comparing bias measurements across different corpora, which requires reliable, fine-grained measurements. An example that highlights the contrast between the downstream-centered and corpus-centered perspectives is the exploration of implicit bias in word em- 23 beddings. Researchers have observed that embedding-based word similarities reflect cultural stereotypes, such as associations between occupations and gen- ders Bolukbasi et al. [2016a]. From a downstream-centered perspective, these stereotypical associations represent bias that should be filtered out before using the embeddings as features. In contrast, from a corpus-centered perspective, implicit bias in embeddings is not a problem that must be fixed but rather a means of measurement, providing quantitative evidence of bias in the training corpus. Models trained on a dataset may appear to measure properties of language, they in fact only measure properties of the training corpus, which could suffer from several problems. Sources could be missing or over-represented, typos and other lexical variations could be present, and, as noted by Goodfellow et al. [2016], “Many datasets are most naturally arranged in a way where successive examples are highly correlated.” Smaller corpora can result in more instability in measurements (as I show in Chapter 4), but small corpora are common in cultural analytics. Often, it is impossible to supplement the corpus (e.g., we cannot create more 18th Century English books or change their topical focus). In summary, the training corpus is merely a sample of the authors’ language model Shazeer et al. [2016]; it is a fragile artifact, whose study should take into account the instability of measurements deriving from its documents. Access to healthcare data is limited by both privacy and proprietary con- straints. Records of doctor-patient interactions in the form of medical notes are particularly difficult to access. Medical language models usually rely on PubMed, the Unified Medical Language System (UMLS) [Bodenreider, 2004], and similar resources composed of biomedical publications [Pyysalo et al., 2013, 24 Chiu et al., 2016]. Models trained on these resources better represent academic language than language used by patients, but the more informal conversations between doctors and patients are also not easily modeled by standard conver- sational tools because of the specialized vocabulary and formalized structure of these conversations. Research is then constrained to whatever limited resource is available, often without control for demographic or other variables if the data comes from online sources. Likewise in the humanities, corpora are often fixed, small, and specialized. Literary vocabulary, sometimes from different eras, does not necessarily map to popular models trained on contemporary news and in- ternet data. Online datasets, like book reviews, are not always accessible nor is it often possible to control for demographics. In both cases, while we would like to ask cultural questions about the authors of the data, we also have to recognize the data is an incomplete and fragile sample. 3.2 Handling Data with Care Applications of machine learning to the humanities and healthcare reveal dif- ferent ends of the ethics and privacy spectrum. On the healthcare side, the huge risk of patient harm requires that we prioritize privacy, carefully removing all identifying information and not sharing data when anonymization is impossi- ble. Medicine and related fields with direct user studies have a long history of participant abuse, which has led to formalized frameworks to protect par- ticipants, such as the framework presented in the Belmont Report [National Commission for the Proptection of Human Subjects of Biomedicaland Behav- ioral Research, 1978]. On the humanities side, the same concerns can lead to very different ethical practices. Often, the people being studied are content cre- 25 ators, whether as novelists, poets, writers of serialized fan fiction, or online book reviewers. In these cases, privacy concerns are still valid — for example, au- thors of fan fiction might reveal intimate details about themselves — but it is often just as important to consider the artistic rights of the authors, respecting their contributions by giving them credit for their creations [Bruckman, 2002]. These tensions between protecting the authors and giving credit to the au- thors of a dataset lead to a series of choices for any cultural analytics project. These decisions are contextual, and researchers can weigh competing concerns by using frameworks like the Belmont Report, with its three guiding principles (respect for persons, beneficence, and justice) [National Commission for the Prop- tection of Human Subjects of Biomedicaland Behavioral Research, 1978]. There are also guided activities like Real ML [Smith et al., 2022] that can aid researchers in making these decisions for a particular dataset and use case. 3.2.1 Examples of Data Handling Strategies The following suggestions are not meant to be a comprehensive guide to data handling in cultural analytics researcher but rather examples of how the Bel- mont Report principles can influence specific data handling decisions. These particular examples are drawn from decisions made during the work described in Chapters 4 and 5 in this thesis. For an extended case study in this kind of decision-making process, I recommend the descriptions in Lundberg et al. [2019]. 26 Model and dataset documentation. Sharing a complete dataset of the col- lected data would allow other researchers to directly reproduce and evaluate research results as well as train their own computational models. Reproduction is particularly important for cultural analytics studies as the self-selection of the authors can contribute to biases in the dataset [Janssens and Kraft, 2012]. The lack of documentation for large models and dataset in NLP has led to a series of frameworks to encourage better documentation practices, including Datasheets [Gebru et al., 2018] and Model Cards [Mitchell et al., 2019]. However, copying and storing data removes it from the control of its au- thors, and this can undermine a study’s respect for persons. One possible method to support both reproducibility and privacy is to release the URLs of the data points, rather than the content (as in the Twitter API terms, which allow shar- ing of Tweet IDs but not the Tweet content). This maintains the user’s ability to delete the content at any time and remove it from the dataset, while also allow- ing researchers to directly compare their results. However, in cases of possible severe harms, even releasing the URLs might not be desireable. Publishing quotations. In NLP and cultural analytics research, it is common to provide quotations from the target dataset, to describe the dataset or as exam- ples of a particular phenomenon. Quotations are useful for readers and partic- ularly important when describing results of unsupervised methods, where the researchers use qualitative evaluations to make modeling decisions (e.g., exam- ining the documents with highest topic probability to determine a topic model’s coherence). However, quotations can carry significant risks to the dataset’s au- thors if re-identified [Dym and Fiesler, 2020]. Both personal preferences (some authors prefer to be named as authors, while others prefer anonymity) and con- 27 textual concerns for a specific project and dataset need to be considered before providing quotations. One tactic that allows for data analysis without compromising the privacy of the authors is to only share paraphrased examples from the dataset [Bruckman, 2002, Yang et al., 2019b]. This both conceals private and/or identifying informa- tion and prevents the audience from searching for exact string matches in order to identify the source story. Another option is to contact each author whom the researchers would like to quote and ask each author for their quotation preferences (e.g., no quotation, quotation with attribution, quotation without attribution). Contacting the au- thors before publications honors the authors’ wishes with regard to privacy and also grants them agency in how their creative work is presented in published work. This option respects the contributions of the authors as “amateur artists” who should be given credit for their creative work when desired [Bruckman, 2002]. Sharing findings with the community. To support the Belmont Report’s prin- ciple of justice, researchers should consider sharing their findings directly with the studied community. Studies of Twitter users [Fiesler and Proferes, 2018] and online fandom participants [Dym and Fiesler, 2020] have found that users have varying levels of comfort with researchers using their data and are often unaware of how and why their data is being used for research. Research that ostensibly is conducted to support a specific community should communicate those benefits to the community, e.g., through public facing blog posts and out- reach to the community. 28 CHAPTER 4 INSTABILITY OF UNSUPERVISED, DISTRIBUTIONAL MODELS Unsupervised methods allow researchers to answer questions about their datasets without high up-front costs. In cultural analytics, datasets are often constrained in size, and labeling can be expensive and difficult, particularly for humanities and healthcare data. This brittleness requires additional reliability tests to avoid “story-telling” where researchers retroactively fit theories to mod- els. In this upstream approach, the small, topic-specific corpus is at once both the central object of study and a small and unreliable sample whose weaknesses must be evaluated, particularly when using measurement tools designed for in- dustry questions and large datasets, not cultural research questions and small datasets. In the two studies presented in this chapter, I re-examine unsupervised, dis- tributional models which have been ported from the NLP community to other fields. Ranked lists of words based on vector similarity are frequently used as evidence in digital humanities and computational social science, but I find that embedding-based word similarities are unreliable, especially on small datasets [Antoniak and Mimno, 2018]. Properties of the training corpus, such as the pres- ence or absence of specific documents can alter both similarities and rankings, and training on small and idiosyncratic corpora significantly exacerbates these problems. Considering embedding-based bias measurement methods, I find that the lexicons used for bias measurement can themselves encode linguistic and social biases [Antoniak and Mimno, 2021]. Bias measurements methods, originally developed to examine models used for downstream NLP tasks, are increasingly being used by scholars in the humanities and social sciences to ex- 29 plore the biases of the training dataset’s authors. By probing sets of artificial seeds, I demonstrate that the order and similarity of these terms can signifi- cantly affect results, and through a survey of prior work and critical reading of gathered seed terms, I present a framework of common seed sources and po- tential biases, finding that popularly reused seeds often contain the very stereo- types they seek to measure. 4.1 Case Study: Word Embedding Instabilities NLP research in word embeddings has so far focused on a downstream-centered use case, where the end goal is not the embeddings themselves but perfor- mance on a more complicated task. For example, word embeddings are of- ten used as the bottom layer in neural network architectures for NLP [Bengio et al., 2003, Goldberg, 2017]. In contrast, other researchers take a corpus-centered approach and use relationships between embeddings as direct evidence about the language and culture of the authors of a training corpus [Bolukbasi et al., 2016a, Hamilton et al., 2016, Heuser, 2016]. Unlike the downstream-centered approach, the corpus-centered approach is based on direct human analysis of nearest neighbors to embedding vectors, and the training corpus is not simply an off-the-shelf convenience but rather the central object of study. Disentangling these separate use cases is vital in determining the proper use and evaluation of word embedding models. Variables such as the stochastic nature of the training algorithm, random ini- tialization of the word vectors, sub-sampling of frequent tokens, and properties of the corpus itself could cause these cosine similarities to be unreliable. We hy- pothesized that training on small and potentially idiosyncratic corpora (which 30 are common in computational social science) can exacerbate these problems and lead to highly variable estimates of word similarity. 4.1.1 Models and Methods To test this hypothesis, we trained sets of 50 word embedding models for each of four training algorithms, three datasets, two document sizes (sentence and full document), and three corpus settings. Algorithms included latent semantic analysis (LSA) [Deerwester et al., 1990, Landauer and Dumais, 1997], global vectors (GloVe) [Pennington et al., 2014], skip-gram with negative sampling (SGNS) [Mikolov et al., 2013b], and positive pointwise mutual information (PPMI) [Levy and Goldberg, 2014]). We curated three datasets from The New York Times Music and Sports sections, Reddit AskScience and AskHistorians, and the 4th and 9th U.S. Federal Courts of Appeals. Most importantly, we manipu- lated the corpus in three ways: FIXED (same documents in the constant order), SHUFFLED (same documents in random order), and BOOTSTRAP (documents sampled with replacement). Setting Method Tests variability due to... Run 1 Run 2 Run 3 Fixed Documents in consistent order algorithm (baseline) A B C A B C A B C Shuffled Documents in random order document order A C B B A C C B A Bootstrap Documents sampled with replacement document presence B A A C A B B B B Table 4.1: The three settings that manipulate the document order and presence in each corpus. For each corpus, we select a set of 20 relevant query words from high prob- ability words from an LDA topic model [Blei et al., 2003] trained on that corpus with 200 topics. We calculate the cosine similarity of each query word to the 31 other words in the vocabulary, creating a similarity ranking of all the words in the vocabulary. We calculate the mean and standard deviation of the cosine sim- ilarities for each pair of query word and vocabulary word across each set of 50 models. From the lists of queries and cosine similarities, we select the 20 words most closely related to the set of query words and compare the mean and standard deviation of those pairs across settings. We calculate the Jaccard similarity be- tween top-N lists to compare membership change in the lists of most closely related words, and we find average changes in rank within those lists. We ex- amine these metrics across different algorithms and corpus parameters. 4.1.2 Results Across Models and Settings We begin with a case study of the framing around the query term marijuana. One might hypothesize that the authors of various corpora (e.g. judges of the 4th Circuit, journalists at the NYT, and users on Reddit) have different perceptions of this drug and that their language might reflect those differences. Indeed, after qualitatively examining the lists of most similar terms (see Table 4.2), we might come to the conclusion that the allegedly conservative 4th Circuit judges view marijuana as similar to illegal drugs such as heroin and cocaine, while Reddit users view marijuana as closer to legal substances such as nicotine and alcohol. However, we observe patterns that cause us to lower our confidence in such conclusions. Table 4.2 shows that the cosine similarities can vary significantly. We see that the top ranked words (chosen according to their mean cosine simi- larity across runs of the FIXED setting) can have widely different mean similari- 32 4th Circuit NYT Sports LSA LSA heroin fixed criticized fixed cocaine shuffled tested shuffled distribution bootstrap testing bootstrap crack cocaine methamphetamine several powder violent distributing involving oxycodone substance manufacture reserved distribute steroids 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.45 0.50 0.55 0.60 0.65 0.70 Cosine Similarity Cosine Similarity SGNS SGNS heroin fixed cocaine fixed oxycodone shuffled drug shuffled cocaine bootstrap nandrolone bootstrap methamphetamine steroid pills counseling drugs alcohol crack urine narcotics substance powder abuse cigarettes testing 0.70 0.75 0.80 0.85 0.90 0.60 0.65 0.70 0.75 0.80 0.85 Cosine Similarity Cosine Similarity GloVe GloVe cocaine fixed cocaine fixed heroin shuffled procedures shuffled kilograms bootstrap smoking bootstrap crack testing distribute addiction drugs purposes grams steroid smoked blaming growing suspensions possession positive 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.2 0.3 0.4 0.5 0.6 0.7 Cosine Similarity Cosine Similarity PPMI PPMI cocaine fixed cocaine fixed heroin shuffled drug shuffled powder bootstrap alcohol bootstrap crack tested grams positive kilogram substance paraphernalia drugs kilograms crack hydrochloride testing methamphetamine steroid 0.75 0.80 0.85 0.90 0.5 0.6 0.7 0.8 0.9 Cosine Similarity Cosine Similarity Table 4.2: The most similar words with their means and standard deviations for the cosine similarities between the query word marijuana and its 10 nearest neighbors (highest mean cosine similarity in the FIXED setting. Embeddings are learned from documents segmented by sentence. ties and standard deviations depending on the algorithm and the three training settings, FIXED, SHUFFLED, and BOOTSTRAP. 33 Most Similar Words Most Similar Words Most Similar Words Most Similar Words Most Similar Words Most Similar Words Most Similar Words Most Similar Words As expected, each algorithm has a small variation in the FIXED setting. For example, we can see the effect of the random SVD solver for LSA and the effect of random subsampling for PPMI. We do not observe a consistent effect for document order in the SHUFFLED setting. Most importantly, these figures reveal that the BOOTSTRAP setting causes large increases in variation across all algorithms (with a weaker effect for PPMI) and corpora, with large standard deviations across word rankings. This indi- cates that the presence of specific documents in the corpus can significantly af- fect the cosine similarities between embedding vectors. GloVe produced very similar embeddings in both the FIXED and SHUFFLED settings, with similar means and small standard deviations, which indicates that GloVe is not sensitive to document order. However, the BOOTSTRAP setting caused a reduction in the mean and widened the standard deviation, indicating that GloVe is sensitive to the presence of specific documents. Standard Deviation in the 9th Circuit Corpus Standard Deviation in the NYT Music Corpus 0.07 0.12 fixed fixed 0.06 shuffled shuffled0.10 bootstrap bootstrap 0.05 0.08 0.04 0.06 0.03 0.04 0.02 0.01 0.02 0.00 0.00 LSA SGNS GloVe PPMI LSA SGNS GloVe PPMI ALGORITHM ALGORITHM Figure 4.1: The mean standard deviations across settings and algorithms for the 10 closest words to the query words in the 9th Circuit and NYT Music corpora using the whole documents. Larger variations indicate less stable embeddings. These patterns of larger or smaller variations are generalized in Figure 4.1, which shows the mean standard deviation for different algorithms and settings. 34 STANDARD DEVIATION STANDARD DEVIATION We calculated the standard deviation across the 50 runs for each query word in each corpus, and then we averaged over these standard deviations. The re- sults show the average levels of variation for each algorithm and corpus. We observe that the FIXED and SHUFFLED settings for GloVe and LSA produce the least variable cosine similarities, while PPMI produces the most variable cosine similarities for all settings. The presence of specific documents has a signifi- cant effect on all four algorithms (lesser for PPMI), consistently increasing the standard deviations. Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 viability fetus trimester surgery trimester pregnancies pregnancies pregnancies surgery visit surgery occupation abortion gestation visit therapy incarceration viability abortions kindergarten tenure pain visit abortion fetus viability workday hospitalization arrival tenure gestation headaches abortions neck pain visit surgery pregnant hernia headaches headaches abortions expiration abortion summer trimester birthday pregnant sudden pain suicide experiencing neck birthday fetal bladder abortion medications tenure fetus Table 4.3: The 10 closest words to the query term pregnancy are highly variable. None of the words shown appear in every run. Results are shown across runs of the BOOTSTRAP setting for the full corpus of the 9th Circuit, the whole document size, and the SGNS model. We turn to the question of how this variation in standard deviation affects the lists of most similar words. Are the top-N words simply re-ordered, or do the words present in the list substantially change? Table 4.3 shows an example of the top-N word lists for the query word pregnancy in the 9th Circuit corpus. Observing Run 1, we might believe that judges of the 9th Circuit associate preg- nancy most with questions of viability and abortion, while observing Run 5, we might believe that pregnancy is most associated with questions of prisons and family visits. Although the lists in this table are all produced from the same 35 corpus and document size, the membership of the lists changes substantially between runs of the BOOTSTRAP setting. Variation in Top 2 Words Variation in Top 10 Words 0.9 0.9 fixed fixed 0.8 shuffled 0.8 shuffled 0.7 bootstrap 0.7 bootstrap 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 LSA SGNS GloVe PPMI LSA SGNS GloVe PPMI ALGORITHM ALGORITHM Figure 4.2: The mean Jaccard similarities across settings and algorithms for the top 2 and 10 closest words to the query words in the AskHistorians corpus. Larger Jaccard similarity indicates more consistency in top N membership. Re- sults are shown for the sentence document length. These changes in top-N rank are shown in Figure 4.2. For each query word for the AskHistorians corpus, we find the N most similar words using SGNS. We generate new top-N lists for each of the 50 models trained in the BOOTSTRAP setting, and we use Jaccard similarity to compare the 50 lists. We observe sim- ilar patterns to the changes in standard deviation in Figure 4.2; PPMI displays the lowest Jaccard similarity across settings, while the other algorithms have higher similarities in the FIXED and SHUFFLED settings but much lower similar- ities in the BOOTSTRAP setting. We display results for both N = 2 and N = 10, emphasizing that even very highly ranked words often drop out of the top-N list. 36 JACCARD SIMILARITY JACCARD SIMILARITY 4.1.3 Discussion and Impact The most obvious result of our experiments is to emphasize that embeddings are not even a single objective view of a corpus, much less an objective view of lan- guage. The corpus is itself only a sample, and we have shown that the curation of this sample (its size, document length, and inclusion of specific documents) can cause significant variability in the embeddings. Happily, this variability can be quantified by averaging results over multiple bootstrap samples. We can make several specific observations about algorithm sensitivities. In general, LSA, GloVe, SGNS, and PPMI are not sensitive to document order in the collections we evaluated. This is surprising, as we had expected SGNS to be sensitive to document order and anecdotally, we had observed cases where the embeddings were affected by groups of documents (e.g. in a different language) at the beginning of training. However, all four algorithms are sensitive to the presence of specific documents, though this effect is weaker for PPMI. This sensitivity to the presence of specific documents leads us to a more nu- anced understanding of the training corpus as a fragile artifact. The corpus has been curated over time and might be biased in the representation of its doc- uments; documents might be missing or over-represented, and they might be ordered according to some scheme, perhaps chronologically or thematically [Goodfellow et al., 2016]. While word embeddings may appear to measure properties of language, they in fact only measure properties of a curated cor- pus, and the training corpus is merely a sample of the authors’ language model [Shazeer et al., 2016]. Although PPMI appears deterministic (due to its pre-computed word- 37 context matrix), we find that this algorithm produced results under the FIXED ordering whose variability was closest to the BOOTSTRAP setting. We attribute this intrinsic variability to the use of token-level subsampling. This sampling method introduces variation into the source corpus that appears to be compa- rable to a bootstrap resampling method. Sampling in PPMI is inspired by a similar method in the word2vec implementation of SGNS [Levy et al., 2015]. It is therefore surprising that SGNS shows noticeable differentiation between the BOOTSTRAP setting on the one hand and the FIXED and SHUFFLED settings on the other. The use of embeddings as sources of evidence needs to be tempered with the understanding that fine-grained distinctions between cosine similarities are not reliable and that smaller corpora and longer documents are more susceptible to variation in the cosine similarities between embeddings. When studying the top-N most similar words to a query, it is important to account for variation in these lists, as both rank and membership can significantly change across runs. Therefore, we emphasize that with smaller corpora comes greater variability, and we recommend that practitioners use bootstrap sampling to generate an ensemble of word embeddings for each sub-corpus and present both the mean and variability of any summary statistics such as ordered word similarities. We leave for future work a full hyperparameter sweep for the three algo- rithms. While these hyperparameters can substantially impact performance, our goal with this work was not to achieve high performance but to examine how the algorithms respond to changes in the corpus. We make no claim that one algorithm is better than another. 38 4.2 Case Study: Comparative Measurements of Bias There has been increasing concern in the NLP community over bias and stereo- types contained in models and how these biases can trickle downstream to prac- tical applications, such as serving job advertisements. In particular, there has been much recent scrutiny of word representations, with many studies find- ing harmful associations encoded in embedding models. Combating such bi- ases requires measuring the bias encoded in a model so that researchers can es- tablish improvements, and many variants of embedding-based measurement techniques have been proposed [Bolukbasi et al., 2016b, Caliskan et al., 2017, Manzini et al., 2019]. These measurements have had the additional upstream benefit of providing computational social science and digital humanities scholars with a new means of quantifying bias in datasets of social, political, or literary interest. Researchers increasingly use embeddings [Garg et al., 2018, Knoche et al., 2019, Hoyle et al., 2019] and other lexicon-based methods [Saez-Trumper et al., 2013, Fast et al., 2016, Rudinger et al., 2017] to provide quantitative answers to otherwise elu- sive political and social questions about the biases in a corpus and its authors. This work often involves comparing bias measurements across different cor- pora, which requires reliable, fine-grained measurements. Despite the diversity of bias measurement methods, they all rely on lexicons of seed terms to specify stereotypes or dimensions of interest. These lexicons are short lists of terms meant to represent the groups being studied. Most often, they are curated sets that represent pairs of demographic groups and attributes; for example, popularly used sets for gender include men = [he, man, father, ...] and 39 Target Concept Highlighted Seeds Unpleasant divorce, jail, poverty, cancer, ... African American Tanisha, Tia, Lakisha, Latoya, ... Domestic Work mom, mum, ... Ugliness fat, chubby, obese, fatty, overweight, disformed, disfigured, wrinkle, wrinkled, ... Table 4.4: Examples of real seed terms used in recent work to measure biases in corpora. women = [she, woman, mother, ...] and researchers search for associations between these gender seeds and attribute seed sets such as occupations or positive and negative adjectives. If researchers can establish a stronger relationship between one set of gender seeds and one set of occupations, this is used as evidence of gender bias in the training corpus. Binary gender pairs are relatively easy to construct in English because of gender-marked pronouns, but in the case of other groups, such as racial and class-based groups, constructing sets of seed terms is more challenging. Prior work has used crowd-sourcing or predefined lists from the social sciences, but these lists can contain their own biases and their coverage on the target dataset is not reliable. For example, when measur- ing levels of racial bias in a corpus, prior work often uses manually created sets of “European-American” and “African-American” first names. I show exam- ples of these seeds in Table 4.4. In prior work, these seeds sets are often not clearly documented. Their sources, their validation, and sometimes even their contents are left to the reader’s imagination. The rationale for choosing specific seeds is often unclear; sometimes seeds are crowd-sourced, sometimes hand-selected by researchers, and sometimes drawn from prior work in the social sciences. The impact of 40 the seeds is not well-understood, and many previous seed sets have serious limitations. As shown in Table 4.4, the seeds used for bias measurement can themselves exhibit several types of cultural and cognitive biases (e.g., reductive definitions), and in addition, linguistic features of the seeds (e.g., frequency) can affect bias measurements [Ethayarajh et al., 2019]. Though they are often re- used, the suitability of these seeds to novel corpora is uncertain, and while eval- uations sometimes include permutation tests, distinct sets of seeds are rarely compared. Word frequency and distribution can directly affect results of both count- based methods [Gordon and Van Durme, 2013, Kuang, 2016] and embedding- based methods [Ethayarajh et al., 2019]. Many methods rely on cosine simi- larities between word vectors, which can be affected by factors such as part of speech and training domain match [Antoniak and Mimno, 2018, Wendlandt et al., 2018]. These problems are not hypothetical — many of the seed sets and tools discussed in this paper are actively used in industry and research — and the stakes are high. Accurate bias measurements can help to improve the fair- ness of applications built on NLP models, and inaccurate bias measurements can provide fodder for critics seeking to deny any evidence of prejudices. In particular, we focus on the stability of bias measurements based on seed terms in the upstream use case, when these measurements are used to make compar- isons across corpora. Although in the NLP community it is common to rely on a single pre-trained source model as a starting point, many researchers outside NLP use the same unsupervised methods as a means of extracting bias informa- tion from one or more specific collections of interest to answer specific social and humanist questions. Seeds developed in one context can and are easily re-used in other contexts, but evaluation and validation remain necessary precursors to 41 relying on seeds for sensitive measurements. woman, women, she, her, her,... (Kozlowski et al 2019) sister, female, woman, girl, daughter,... (Caliskan et al 2017) woman, girl, she, mother, gal,... (Bolukbasi et al 2016) woman, girl, mother, daughter, sister,... (Hoyle et al 2019) lady, nun, heroine, actress, businesswoman,... (Zhao et al 2018) baker, counselor, nanny, librarians, socialite,... (Zhao et al 2018) 0.0 0.2 0.4 Similarity to Unpleasantness Vector romance history + biography Figure 4.3: Bias measurements depend on seeds. We calculate the cosine simi- larities between different seed sets representing women and an averaged upleas- antness vector from two embedding models. Results are consistent across seeds for romance review embeddings, but vary widely between sets for history and biography. We find similar variation even for a pretrained Google News model. Figure 4.3 presents a motivating example, showing the instability of mea- surements using seeds. In this example, we imagine a digital humanities scholar interested in measuring whether women are portrayed more negatively in dif- ferent genres of book reviews. Perhaps this scholar hypothesises that women are portrayed more negatively in history and biography reviews than in ro- mance reviews, since more romance reviewers are women. As in the WEAT test, each seed is plotted according to its cosine similarity to an averaged unpleasant- ness vector [Caliskan et al., 2017]. Green indicates a subspace trained on romance book reviews, while purple indicates a subspace defined for history book re- views; our imaginary scholar is interested in whether these boxes overlap or are separated from each other. Following prior work, our imaginary scholar exper- iments with different seeds sets that represent women, all of which appear rea- 42 Seeds sonable at first glance for the target book reviews dataset. However, the scholar finds that for some sets of seeds representing women, no significant difference is visible, while for other sets, there are much larger differences. Depending on which set the researcher chose, they would draw significantly different conclu- sions when comparing biases across datasets. And perhaps even worse — if the scholar had relied on only one set and not compared many distinct sets, they would not have realized the differences in results. 4.2.1 Framework of Seed Sources I gather and document seed sets from recent papers, creating a framework of common seed sources. Table 4.5 shows the number of papers in our survey that used seeds from each of the sources in our framework. Each paper (and each individual seed set) can draw from more than one source. The most commonly used sources are seeds derived from the target corpus, seeds re-used from prior bias measurement research, and seeds borrowed from the social sciences. Corpus-Derived 7/18 papers Re-Used 7/18 papers Borrowed from Social Sciences 6/18 papers Curated 5/18 papers Adapted from Lexical Resources 3/18 papers Crowd-Sourced 2/18 papers Population-Derived 2/18 papers Table 4.5: Overview of the surveyed seed sources. Borrowed from Social Sciences. One of the most common seed sources is prior work in psychology and other social sciences. Researchers often borrow these in an effort to either replicate results or build confidence from previously validated work. For example, Caliskan et al. [2017] validate prompts from 43 the Implicit Association Test [Greenwald et al., 1998], while Garg et al. [2018] and Hoyle et al. [2019] use personality traits from Williams and Bennett [1975], Williams and Best [1977, 1990]. In another example, Bhatia et al. [2018] use sub- sets of personality traits from Goodwin et al. [2014] to measure stereotyping of political candidates. Sometimes the seeds appeal for validity via highly cited resources, like LIWC [Pennebaker et al., 2001], despite critiques about unrelia- bility [Panger, 2016, Forscher et al.]. In some cases, relying on prior work has the benefit of human validation. However, validation in one setting does not guarantee validation in another; biases can be context-dependent. Borrowing seeds does not absolve researchers from examining and validating seeds. Crowd-Sourced. Researchers can also use the crowd to generate and annotate seed sets. For example, Fast et al. [2016] use Mechanical Turk to validate the in- clusion of terms in their seed sets; the final terms are then included in packaged code for researchers and practitioners. Kozlowski et al. [2019] use Mechanical Turk to gather ratings of items scaled along gender, race, and class. Crowd- sourcing can aid in gathering contemporary associations and stereotypes. How- ever, controlling for crowd demographics can be difficult, and crowd-sourcing can result in alarming errors, in which popular stereotypes are hard-coded into the seeds (as in Table 4.4). Population-Derived. Seed sets can be derived from government-collected population datasets. Popular sources include U.S. census data [Bolukbasi et al., 2016b, Caliskan et al., 2017], the U.S. Bureau of Labor Statistics [Caliskan et al., 2017], and the U.S. Social Security Administration [Garg et al., 2018]. These sources are usually used to gather names and occupations common to certain 44 demographic groups (e.g., to gather lists of “European American” and “African American” names). These sources tend to be U.S.-centric, though the training data for the embedding does not always match (e.g., large Wikipedia datasets are not guaranteed to have U.S. authors). Reliance on these sources is partic- ularly vulnerable to reductive definitions of the target concepts—e.g., gender [Keyes, 2017]—and assumes a level of trust and representation in the data col- lection that might not exist evenly across groups. Adapted from Lexical Resources. Some seed sets are drawn from existing dic- tionaries, lexicons, and other public resources, such as SemEval tasks [Zhao et al., 2018] and ConceptNet [Fast et al., 2016]. Pre-packaged sentiment lexi- cons are a popular source [Saez-Trumper et al., 2013, Sweeney and Najafian, 2019]; these lexicons include the Affective Norms for English Words (ANEW) [Bradley and Lang, 1999] and negative/positive sentiment words from Hu and Liu [2004]. These seeds have the advantage of previous rounds of validation, but this does not guarantee validity for new domains. Corpus-Derived. Given a corpus of interest, quantitative methods can be used to extract seed terms from the corpus. For example, Saez-Trumper et al. [2013] use sorted lists of named entities extracted from a target dataset to create seed sets for personas of interest. Similarly, Sweeney and Najafian [2019] extract high frequency identity terms from a Wikipedia corpus. These methods have the advantage of ensuring high frequency terms in the target dataset. However, they pose similar risks to crowd-sourcing; unless an extra round of cleaning and curation is completed by the researchers, terms with unintended effects can be included in the seed sets. 45 Curated. Seed sets are sometimes hand-selected by the authors, usually after close reading of the corpus of interest. For example, Rudinger et al. [2017] hand- select a set of seed terms that correspond to a set of demographic categories of interest, and Joseph et al. [2017] hand-select a set of identity seeds based on their frequency in a Twitter dataset. Often, even when papers rely on other seed sources, manual curation is included as a step in the seed creation process. Hand-curation can result in high precision seeds, but this method relies on the authors’ correction for their own social biases. Re-Used. Finally, many papers rely on prior bias measurement research for seed terms. The most popular sources in our survey include early papers on bias in embeddings such as Bolukbasi et al. [2016b] and Caliskan et al. [2017]. This repetition means that the seeds are tested on many different datasets, but they should not be trusted without validation; there can be mismatches in frequency and contextual meaning between datasets. 4.2.2 Bias Measurement Methods Word Embedding Association Test. Given a set of embedding vectors w, the Word Embedding Association Test (WEAT) [Caliskan et al., 2017] defines a vec- tor based on the difference between the mean vector of the two target sets, and then measures the cosine similarity of a set of attribute words to that vector. The strength of the association between the target sets X and Y , and the sets of attributes, A, and B, is given by ∑ ∑ s(X,Y, A, B) = s(x, A, B) − s(y, A, B) x∈X y∈Y 46 where s(w, A, B) is equal to the difference in average cosine similarities between a query w and each term in A and w and each term in B. To test whether the resulting difference s(X,Y, A, B) is significant, this result is compared to the same function applied to randomly permuted sets drawn from X and Y . Caliskan et al. [2017] use WEAT to measure stereotypical associations between sets of targets and attributes, where, for example, the target terms might be arts and sci- ence terms, and the attribute terms might be terms representing men and women terms. Principal Component Analysis. The principal component analysis (PCA) method tests how much variability there is in the difference vectors between pairs of word vectors [Bolukbasi et al., 2016b]. If the vector difference between pairs of seed terms can be approximated well by a single constant vector c, then this vector represents a bias subspace. In this case, the subspace is simply a one dimensional vector, though this process could be extended to more dimensions. For each pair of embedding vectors corresponding to one seed word from set X and one from set Y , Bolukbasi et al. [2016b] calculate the mean vector of those two vectors and then include the two resulting half vectors from that mean to the two seed vectors as columns in the input matrix. 4.2.3 Seed Features Impact Bias Measurements Reductive Definitions. The seeds can be reductive and essentializing, codi- fying life experiences into traditional categories. Using names as placeholders for concepts like race [Nguyen et al., 2014, Sen and Wasow, 2016] or reducing gender to a binary with two extremes [Bolukbasi et al., 2016b, Caliskan et al., 47 2017] can create a distorted view of the source data. Sometimes these are sim- plifying assumptions, made in an effort to measure biases that would otherwise go unexamined. However, these decisions run the risk of further entrenching these category definitions—e.g., see discussions in Keyes [2017], Larson [2017] for the mistakes and harms that can be caused by mapping names to genders— and these trade-offs should be evaluated and documented. More broadly, re- cent work has critiqued NLP and ML bias research for not successfully con- necting with the literature in sociology and critical race studies [Hanna et al., 2020, Blodgett et al., 2020]. Engaging with this literature would provide a better foundation for decision-making about seed sets and provide context for future researchers. Imprecise Definitions. If the target concept is not well-defined, the resulting seed terms can be too broad and include multiple concepts, risking the creation of confounded or circular arguments. Similarly, the unexamined use of pre- existing sets and over-reliance on the category labels from prior work can result in a series of errors. The seeds can contain confounding terms (e.g., in Table 4.4, unpleasant contains “cancer” which in some datasets might be more prevalent for certain demographic groups) or terms from the target group (e.g., domestic work includes the gendered terms “mom” and “mum”). Similarly, the seeds can manifest cultural stigmas: for example, including “fat” and “wrinkled” in an ugliness category [Fast et al., 2016] results in a seed set that itself contains stereotypes. These stigmas are harmful and can interact with other demographic fea- tures like gender or age [Puhl and Heuer, 2009], and unless their inclusion is intentional, they can accidentally inflate measurements towards certain groups. 48 Rather than probing for a ugliness subspace, the social stereotypes could force an unintended age-based comparison. Predicting all such errors is impossible, and there can be cases where researchers intentionally include such terms (e.g., to capture a particular stereotype)—but this should be a conscious decision by each researcher using the seeds, and at a minimum, researchers should clearly define their target concepts. Lexical Factors. Prior work examining seeds has shown that the frequency and part of speech of seeds can affect the resulting bias measurements. Etha- yarajh et al. [2019] show that the WEAT test requires that the paired seeds occur at similar frequencies and that seed sets can be manipulated to produce certain measurements. Brunet et al. [2019] explore the effects of perturbing the training corpus, finding that (1) second-order neighbors to the seeds can have a strong impact on the bias measurement effect size and (2) effects are stronger for rarer words. Using contextual embeddings, Sedoc and Ungar [2019] show that differ- ent classes of words (e.g., names vs. pronouns) can result in different bias sub- spaces and that sometimes these subspaces represent an unintended dimension (e.g., age instead of gender). Despite these documented sources of variation, few seed sets are evaluated for frequency or word class. Set Size and Alignment. The number of seeds included in each set can affect the resulting bias subspace; Kozlowski et al. [2019] find small increases in per- formance when using more seed pairs. The alignment of the seeds in matched sets (i.e., the ordering or pairing of seeds in one set with seeds in another set) can also affect the bias subspace. In the PCA method, each term in one seed set is explicitly linked to a single term in the other seed set. The specific align- 49 ment between paired words matters; altering the pairing can result in dramat- ically different results, even for cases like gender, which is marked in English. However, we observe conscious pairings of seeds only for obvious cases, and sometimes “obvious” pairings produce subspaces that explain less variance. herself 0.50 likelihood 0.36 outcomes 0.26 ms 0.49 eurozone 0.34 son 0.26 her 0.49 incentive 0.34 father 0.26 she 0.41 downturn 0.31 mother 0.26 pregnant 0.40 setback 0.30 aunt 0.25 pitching -0.36 photographed -0.39 potentially -0.19 baseball -0.36 tales -0.41 male -0.19 syndergaard -0.38 hood -0.42 hood -0.29 himself -0.39 garcia -0.45 garcia -0.29 his -0.42 danced -0.59 md -0.39 (a) (b) (c) Figure 4.4: Ranking word vectors by cosine similarity with the top principle component vector for the original gender seed pairs (a) appears to identify words representing men and women much better than random (b). But shuf- fling the pairing of seed words (c) maintains correlation with gender but to a less clear degree. Results are shown for the NYT corpus with a frequency threshold of 100 and bootstrap resampling. Figure 4.4 shows that when we used the ordered gender pairs, the ranked words roughly divide into groups correlated with gender, while if we use shuf- fled pairs, the lists of high and low ranked words are not as easily distinguish- able as masculine or feminine. We find an opposite effect social class pairs [Ko- zlowski et al., 2019]; when we shuffle, we find a subspace that explains more variance than the explicitly ordered pairs (e.g., “richest”-“poorest”). We find similar differences when testing some seed sets that lack intuitive pairings, e.g., the matched pleasantness and unpleasantness seeds [Caliskan et al., 2017] and the matched Christianity and Islam seeds [Garg et al., 2018]. Order does not always affect the subspace — e.g, we found no significant 50 difference when shuffling sets of names — but we have shown that it can affect the subspace, and so to build confidence in measurements, testing is required. Coherence Generated Seed Set A Generated Seed Set B 1.000 distinctions, similarities, friction, parallels, similarity murder, rape, manslaughter, felony, assault 1.000 mile, miles, yards, yard, feet example, instance, purposes, explanation, short- hand 1.000 shop, restaurant, kitchen, cafe, store sports, soccer, football, competitions, basketball ... ... ... 0.711 ambush, bombardment, escalation, altercation, corruption, terrorism, graft, bribery, abuses militiamen 0.689 entrance, terrace, subway, cafe, lawn courtside, bamboo, freeway, shorts, sailboat 0.552 sticks, onions, tops, banana, mozzarella potatoes, onions, lemon, herbs, meats Coherence Gathered Seed Set A Gathered Seed Set B 0.933 CAREER: executive, management, professional... FAMILY: home, parents, children, family, cousins... 0.910 ASIAN: asian, asian, asian, asia, china... CAUCASIAN: caucasian, caucasian, white, amer- ica... 0.909 FEMALE: sister, mother, aunt, grandmother... MALE: brother, father, uncle, grandfather, son... ... ... ... 0.375 FEMALE: countrywoman, sororal, witches... MALE: countryman, fraternal, wizards, manser- vant... 0.110 NAMES ASIAN: cho, wong, tang, huang, chu... NAMES CHINESE: chung, liu, wong, huang, ng... 0.050 NAMES BLACK: harris, robinson, howard... NAMES WHITE: harris, nelson, robinson... Table 4.6: When two seed sets are more semantically distinct they are more dis- tinguishable in the resulting geometric subspace. The top table shows pairs of artificially generated seed sets, ranked by their coherence for WEAT in the NYT dataset. The bottom table shows pairs of seed sets gathered from published pa- pers, ranked by their coherence for WEAT in the WikiText dataset. Scores are averaged across 20 bootstrapped samples of the training data, and values are rounded; no coherence scores are exactly 1.0. Higher coherence scores indicate that the seeds pairs were projected farther apart in the bias subspace. Set Similarity. By sampling random seed sets we find that it is more diffi- cult to represent the variance of seed sets that are too close together. Figure 4.5 shows that set similarity (cosine similarity between the set mean vectors) is significantly correlated with explained variance for generated sets (Pearson r = −0.67, p < 0.05). We highlight two comparisons between gathered sets intended to measure racial bias that explain different degrees of variance. Syn- thetic pairings generally explain more variance than pairings of gathered sets 51 of equal similarity, although for gathered sets we cannot control for POS and frequency. Table 4.6 shows the generated seed sets ranked by coherence, where higher scores indicate that the bias subspace was able to separate the seed sets. Similar seed sets and sets with duplicates (e.g., the pairing in the table in which both generated sets contain food terms) have low coherence scores. 0.8 0.6 0.4 Black vs Source White Roles 0.2 generated gathered Black vsWhite Names −0.2 0.0 0.2 0.4 0.6 0.8 Set Similarity Figure 4.5: Identifying bias is less effective when set pairs are similar. Generated seeds are frequency-controlled nouns from the WikiText dataset. We highlight two sets of gathered seeds; both target similar racial categories but the name- based sets are more similar and explain less variance. We find similar trends for WEAT, coherence, and the other corpora and POS. 4.2.4 Discussion and Impact: Biases All the Way Down Almost all recent work on bias measurement relies on sets of seed terms to ground cultural concepts in language. These tools are often used to support urgent appeals for fairness and accountability in machine learning systems. If we do not pay attention to the seeds, these methods will lack foundation and the claims they support will be left open to criticism and dismissal. Seeds and their rationales need to be tested and documented, rather than hidden in code or copied 52 Explained Variance without examination. Some of the risks discussed here may seem obvious in retrospect, but our lit- erature survey suggests there are widely varying levels of evaluation and doc- umentation in recent published work. Rationales for picking sources or seeds are not always explained, or the reader is left to assume that prior work has ad- equately validated the seeds. Tests for frequency, semantic similarity, and other features are rare or non-existent, and clear definitions and discussion of limi- tations are often missing. Permutation tests are sometimes used (e.g., Caliskan et al. [2017]), but these do not account for seeds outside of those already se- lected. Significantly different results can be found using alternative seeds sets for the same target concept, and fine-grained comparisons require validation on multiple sets. We faced a number of challenges in gathering 178 seed sets from prior work. Sometimes seeds are shared online at an undocumented location and sometimes hard-coded into code repositories; this can significantly obscure the seeds from public view, which is troubling for tools intended for wide use on sensitive top- ics. Documentation is often scattered across locations, and in more than one case, we found contradictions between different sources for a single project. In one case, we were unable to find the full list of seeds used in the paper, and in several cases, it was unclear which seed sets were used for which experiments. While some authors went to commendable lengths to document their materials, there is a need for more consistent and transparent documentation. We recommend that researchers carefully trace the origins of seed sets, with attention to the risks associated with the origin type. We also recommend that researchers examine seed features. POS, frequency, semantic similarity, and 53 pairing order can significantly affect the results of bias measurements. Seeds should be both examined manually and tested; importantly, they should be compared to alternative seeds with different attributes. To assist this we release a compilation of 178 seed sets from prior work. These tests are particularly im- portant when comparing biases across datasets. Finally, researchers should doc- ument all seeds and the rationales underlying their design, including concept definitions. We add to recent calls for better documentation and problem spec- ification in machine learning [Bender and Friedman, 2018, Gebru et al., 2018, Mitchell et al., 2019, Blodgett et al., 2020] and in studies of social biases in tech- nology [Olteanu et al., 2019]. Specifically, when the seeds intentionally encode harmful stereotypes or slurs, it can be beneficial to include a trigger warning or not highlight the seeds in the paper; however, full seed lists should always be accessible, not hard-coded, with unique labels matched to experiments. Ultimately, our goal is not to eliminate a problem but to illuminate it:1 to help practitioners think through the potential risks posed by seed sets used for bias detection. We encourage thoughtful, critical studies, but we observe a trend in which seed sets are used in new research and applications simply because they have been used in prior published work, without additional vetting. Research precedents can take on a life of their own and we have a responsibility to explore and document possible sources of error. We believe that seed sets can be useful and are probably unavoidable, but that no technical tool can absolve researchers from the duty to choose seeds carefully and intentionally. 1“All problems can be illuminated; not all problems can be solved.” – Ursula Franklin (quoted by M. Meredith via Olteanu et al. [2019] in http://bb9.berlinbiennale.de/ all-problems-can-be-illuminated-not-all-problems-can-be-solved/) 54 CHAPTER 5 PERSONAL HEALTHCARE EXPERIENCES Healthcare systems in the U.S. face a range of important challenges, includ- ing providing equitable care, addressing physician burnout, and discovering causes and cures for understudied conditions. NLP methods can contribute to addressing these challenges by measuring statistical patterns across large collec- tions of text data, e.g., by extracting and linking medical entities like medication names or by generating or parsing EHR notes. Much healthcare research in NLP focuses on either EHR data or biomedical academic publications, leaving out patients’ direct narration of their own experiences (as well as the experiences of those who experience health issues but are not patients). If we could statistically harness the emotional and narrative dimensions of personal healthcare stories, then we could better identify benefits and harms to patients. In a study of an online healthcare support community, I modeled narrative patterns and power hierarchies in birth stories [Antoniak et al., 2019]. These stories share interactions, emotions, decision-making, and deeply personal re- flections and reframings of a medical experience that can sometimes be trauma- tizing. Using topic modeling to discover themes with probabilistic connections, I found diverging narrative pathways (medicalized and unmedicalized) as well as outlier stories, whose event sequences are unlikely in this community and which tend to be framed by the authors as “traumatic” and “unplanned” but with “happy endings.” By parsing the stories and using a lexicon of verbs anno- tated with directional power labels, I found that the authors frame themselves as having the least amount of power, except for the baby, and frame the midwife and doula, who often function as advocates for the pregnant person, as having 55 very high levels of power. This highlights the complicated social history and legal standing of doulas in the U.S. healthcare system. I discuss this case study in more detail below. 5.1 Healthcare Datasets for Natural Language Processing For healthcare research that relies on large quantities of data, a constant chal- lenge is obtaining training and test data that matches the intended use case and domain, properly accounts for data ethics and privacy constraints, and does not incorrectly privilege one viewpoint (e.g., medical professionals) over another (e.g., patients). Electronic health records (EHR). Much NLP research for healthcare focuses on electronic health records (EHR) or electronic medical records (EMR) data [Demner-Fushman et al., 2020, Rumshisky et al., 2020]. These patient-specific records include information about medications, billing, laboratory tests, vital signs, study reports, procedures, and free text notes describing a patient and/or appointment. While a few large EHR datasets, like MIMIC-III [Johnson et al., 2016], are accessible to researchers, these datasets represent snapshots of specific hospitals, locations, and people. They also lead to increased research attention for extracting structured data based on idiosyncracies in the EHR format, e.g. extractin ICD codes [Zhang et al., 2020]. While these are important records of medical appointments and other interactions between clinicians and patients, they are told exclusively through the clinician’s point of view. 56 Biomedical research publications. Another popular healthcare data source is biomedical research publications [Demner-Fushman et al., 2020]. For example, the largest category of papers included in the Semantic Scholar Open Research Corpus (S2ORC) dataset are medical studies [Lo et al., 2020]; the Cord-19 dataset includes papers related to COVID-19 from PubMed, the World Health Orga- nization, bioRxiv and medRxiv [Wang et al., 2020b]; and Percha and Altman [2018] annotate a dataset of Medline abstracts. These academic datasets give rise to their own specific set of tasks, mostly aimed at building knowledge graphs from the included papers [Percha and Altman, 2018]. These datasets are par- ticularly useful for drawing connections between seemingly unrelated research topics, but like EHR datasets, they leave out the patient’s voice. Online healthcare support communities. Online health communities (e.g., fo- rums focused on supporting people with a particular health condition) allow patients to give and receive emotional and informational support Yang et al. [2019a,b] and to share expressive writing about sometimes traumatic experi- ences Ma et al. [2017a]. These communities range in topic from men’s infertility [Patel et al., 2019] to cancer [Yang et al., 2019b] to mental health [Chancellor et al., 2019]. Compared to social networks or real-life support groups and con- versations, disclosure on a public online forum can feel more private due to the anonymity of the users, physical privacy of the users (emotional reactions can be hidden), and a sympathetic, knowledgeable audience [Gold et al., 2012]. These communities allow people with similar healthcare concerns to exchange advice and research, share stories and experiences, and organize and advocate for themselves. 57 5.2 Case Study: Online Childbirth Narratives Birth stories are narratives of real experiences giving birth, often written with great medical and emotional detail and (in recent years) publicly posted on fo- rums, blogs, and video-sharing websites. Birth stories are interesting from both a computational and a healthcare perspective. On the computational side, birth stories are an ideal test set for narrative analysis, a task which frequently suffers from lack of datasets that are both realistic but not overly challenging. While no two birth stories are the same, most stories include common sequences of events and common sets of personas, making them ideal testbeds for modeling narratives. On the healthcare side, the motivations behind writing birth stories are com- plex. Possible motivators include writing as a form of self-tracking and moni- toring [Epstein et al., 2017], asserting agency and disrupting cultural norms and society’s surveillance of pregnant people [Tangherlini, 2000], and distrust of a medical profession that is biased in responding to women’s self-reported pain [Hoffmann and Tarzian, 2001, Chen et al., 2008]. These suggest several lines of inquiry, including measurement of portrayed power of different actors in the birth stories. 5.2.1 Data Curation We collect 2,847 birth stories from the social website Reddit. While birth sto- ries exist in many venues and forms, we choose to focus on Reddit for its accessibility and well-studied communities. These stories were posted pub- 58 licly from February 16th, 2011 to February 28th, 2018 on the subreddit (forum) r/BabyBumps (all data available up to the date of collection). r/BabyBumps is a fo- rum intended to be a “place for pregnant redditors, those who have been preg- nant, those who wish to be in the future, and anyone who supports them.” This community includes a wide range of posts related to pregnancy and birth, in- cluding humor, requests for advice, rants and venting, recommendations, and journaling posts (e.g., bump and ultrasound photos, summaries of doctor ap- pointments, birth announcements and stories). The community rules explicitly instruct members to post detailed birth stories rather than only photos and one line descriptions. We perform two rounds of filtering: first, we select the posts that contain the n-gram “birth story” in the title, and second, we remove 348 posts that contain fewer than 500 words. We remove these short posts as a second step of data cleaning as many of these shorter posts are either not birth stories or are only parts of birth stories published in installments. We do not include comments, upvotes, or other interactions in our analysis; only the parent posts, containing the stories, are included. This set of stories constitutes a small sample of birth stories posted online, and an even smaller sample of all birth stories, told or untold. Because the sto- ries were posted anonymously to a forum, we do not have demographic data for the authors of the stories. All stories are written in English, the majority appear to take place in western and developed countries, and the authors have access to the internet and to Reddit. As a first research foray into the computational study of birth stories, we expect not that the patterns observed in r/BabyBumps will generalize to all other birth stories, but that we will provide evidence of the 59 value and research interest of all birth stories. 5.2.2 Narrative Analysis The stories range in length from a minimum of 500 words (our selected cutoff) and a maximum of 6,057 words. Despite this difference in word count, the sto- ries usually begin with the same events (arrival of the due date, water breaking, contractions starting) and end with the same events (birth and weighing of the baby, breastfeeding, leaving the hospital), though a few outliers break off in the middle of the story (e.g., sharing the story in installments). In order to compare these sequences of events across all the stories in the dataset, we divide each story into ten equal sections, and we then use these sections to calculate statis- tics of interest averaged over all the stories for the corresponding section. We refer to these sequences of normalized sections as story time. We find that simple methods are sufficient to identify readily interpretable events and event sequences, or scripts. We train a latent Dirichlet allocation (LDA) Blei et al. [2003] topic model with 50 topics on the birth stories collection, using 100 word chunks as the training documents. We then divide each story into 10 equal segments and plot the distributions of the topics over the resulting normalized story time. To establish additional validity, we can compare these topics to descriptions of the birth process from health care providers such as the Mayo Clinic.1 We calculate the probability of transitioning between topics by finding the most probable topics for each segment of text, counting the number of times each 1https://www.mayoclinic.org/healthy-lifestyle/labor-and-delivery/ in-depth/stages-of-labor/art-20046545 60 nurse sleep night hours rest slept 0.022 400 0.020 200 0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Story Time Story Time doula water broke fluid break broken 50 0.025 0.020 0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Story Time Story Time midwife hospital home car bag drive 0.025 200 0.020 100 0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Story Time Story Time pitocin contractions started hours start doctor 0.025 500 0.020 250 0.015 0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Story Time Story Time epidural pain relief anesthesiologist meds baby 0.025 1000 0.020 500 0.015 0.2 0.4 0.6 0.8 1.0 0 Story Time 0.0 0.2 0.4 0.6 0.8 1.0Story Time push pushing head pushed pushes anesthesiologist 0.03 100 0.02 0.2 0.4 0.6 0.8 1.0 0 Story Time 0.0 0.2 0.4 0.6 0.8 1.0 Story Time cord crying chest cry cut we 0.025 0.020 1000 0.015 0.2 0.4 0.6 0.8 1.0 0 Story Time 0.0 0.2 0.4 0.6 0.8 1.0 Story Time breastfeeding day milk days feeding author 0.03 10000 0.02 5000 0.2 0.4 0.6 0.8 1.0 Story Time 0 0.0 0.2 0.4 0.6 0.8 1.0 Story Time Figure 5.1: A selection of topics over time. Plots are labeled with the Figure 5.2: Histograms showing five highest probability words for the frequencies of persona mentions each topic. Results show the prob- over story time. Some entities (e.g., ability for each topic at 10% inter- author) are consistently more fre- vals of story time, averaged across quent than rare entities (e.g., doula). all stories. Error bars show stan- Some frequency patterns are ex- dard deviation across bootstrapped pected while others are surprising samples of stories. (e.g., frequency of we decreases near the middle of the stories). 61 Topic Probability Topic Probability Topic Probability Topic Probability Topic Probability Topic Probability Topic Probability Topic Probability Persona Frequency Persona Frequency Persona Frequency Persona Frequency Persona Frequency Persona Frequency Persona Frequency Persona Frequency pair of topics occurs in neighboring text segments, and normalizing by the total number of transitions. We compare the learned topic transitions to the Mayo Clinic’s document, which describes a “normal” birth path with few deviations. Using the learned transition probabilities between topics, we identify outlier stories containing less probable transitions. We rank the stories according to the summed log probabilities of the transitions of the last five 100 word chunks (the same size used for the topic model training). We limit the number of transitions in each story to five to control for the varying lengths of the stories, and we choose the last five transitions under the hypothesis that the story endings are more likely to display variation in sequences of events than the story beginnings (e.g., stories that end in emergency surgeries or unexpected trips to the hospi- tal). For each bigram in the post titles (where authors frequently include tags or labels that frame and situate the story within the community), we average the log probabilities for all stories that include that ngram in the post title. This allows us to measure the correlation between the outlier stories and the authors’ framing of the stories. 62 Figure 5.3: Flowchart of the most probable topic transitions (above 0.2%). We removed one orphan node without a parent path leading to the beginning of story (BOS) state. The discovered sequences of events, persona patterns, and story likelihoods describe the narrative norms and expectations of the r/BabyBumps community. These norms are informed by the biological process of birthing and the medi- cal organization and standardizing of procedures; but the specific sequence of events that the authors in r/BabyBumps choose to highlight also arise from this online health community’s particular expectations and priorities. By calculating the transition probabilities between the stories, we construct 63 a flowchart that maps the community’s understanding of event sequences. A topic about cutting the cord is often immediately followed by a topic about birth weight and length. A set of diverging pathways emerge in the flowchart, in which an unmedicalized set of events is mirrored by its medicalized version (e.g., in- stead of a topic about pushing, a topic about c-sections). While many of these events are familiar from official medical documentation, some of the events (e.g., packing and going to the hospital, a meta-topic about birth stories) are particular to the community and the authors’ points of view. Finally, we use these common narrative arcs to identify stories that break the mold with less probable topic sequences. These stories tend to be labeled by the authors as “traumatic” and “unplanned” yet are also often labeled with the term “happy ending.” Highest Lowest Probability Bigram Probability Bigram -34.19 positive medicated -36.05 story warning -34.27 positive hospital -36.11 unplanned c -34.30 med free -36.13 slightly traumatic -34.52 positive induction -36.27 natural birth -34.53 story ftm -36.40 belated birth -34.73 vaginal delivery -36.42 positive unmedicated -34.77 story hospital -36.42 emergency c -34.83 weeks pp -36.53 trigger warning -34.85 hour labor -36.60 induction epidural -34.88 super long -36.84 happy ending Table 5.1: The bigrams drawn from the post titles associated with the most and least probable stories. Probabilities represent the means of the summed log probabilities of the last ten topic transitions in a story. Lower scores indicate stories with more unusual topic transitions (sequences of events). Results are averaged (mean) across bootstrapped samples of the stories. The strengths of individual topics over time tracks with our expectations of the stories, confirming that the dataset contains enough shared structure to be easily detectable. For example, we can see in Figure 5.1 that water breaking hap- 64 pens near the beginning of the story; a topic about contractions starting and/or pitocin (a medication that induces labor) being administered peaks around 30% into the story; there is little sleep in the middle of the story; photos are shared sometimes at the beginning of the story but usually near the end of the story. We note that the model was trained with no information about word position within stories, but topics nevertheless exhibit strong temporal clustering. 5.2.3 Framing of Power To examine the framing of power hierarchies in these stories, we employ a lex- icon over verbs annotated with directional power labels [Sap et al., 2017]. By parsing the stories, we can identify each persona as the subject or object of these annotated verbs and calculate an average power score both for individuals and for dyads. Persona N-Grams Total Mentions Stories Contain- Average Men- ing Mentions tions Per Story AUTHOR I, me, myself 210,795 2,846 74.0 We we, us, ourselves 24,757 2,764 8.7 BABY baby, son, daughter 14,309 2,668 5.0 DOCTOR doctor, dr, doc, ob, obgyn, gyne- 10,025 2,262 3.5 cologist, physician PARTNER partner, husband, wife 8,998 2,006 3.2 NURSE nurse 7,080 2,012 2.5 MIDWIFE midwife 4,069 886 1.4 FAMILY mom, dad, mother, father, 3,490 1,365 1.2 brother, sister ANESTHESIOLOGIST anesthesiologist 1,398 876 0.5 DOULA doula 896 256 0.3 Table 5.2: Personas identified in the birth stories collection and the n-grams used to classify the personas. As expected, the baby is very often framed as the last powerful person in these stories while the medical professionals (nurse, doctor, anesthesiologist) 65 are framed as powerful. More surprisingly, the authors frame themselves as have the least amount of power, except for the baby, and frame the midwife and doula, who often function as advocates for the author, as having very high levels of power. Examining pairs of personas, we find that the doula is the only persona portrayed with more power when paired with the nurse, highlighting the complicated social history and legal standing of doulas in the U.S. healthcare system. Author Baby Doctor Nurse Doula get_subj 0.0404 do_subj 0.0129 call_obj 0.0760 check_subj 0.0499 call_obj 0.1412 know_subj 0.0219 want_obj 0.0113 check_subj 0.0271 call_obj 0.0382 suggest_subj 0.0357 start_subj 0.0215 get_subj 0.0110 decide_subj 0.0221 ask_obj 0.0272 get_subj 0.0306 do_subj 0.0199 start_subj 0.0081 get_subj 0.0159 give_subj 0.0171 hold_subj 0.0144 push_subj 0.0123 drop_subj 0.0072 do_subj 0.0145 get_subj 0.0149 keep_subj 0.0128 make_subj 0.0070 make_subj 0.0063 show_subj 0.0116 keep_subj 0.0147 show_subj 0.0122 wake_subj 0.0069 turn_subj 0.0055 break_subj 0.0105 hold_subj 0.0116 remind_subj 0.0117 keep_subj 0.0065 show_subj 0.0027 give_subj 0.0091 start_subj 0.0113 give_subj 0.0101 decide_subj 0.0059 enjoy_obj 0.0022 start_subj 0.0091 do_subj 0.0092 start_subj 0.0090 end_subj 0.0038 decide_subj 0.0021 make_subj 0.0088 bring_subj 0.0091 make_subj 0.0080 get_obj -0.0028 give_obj -0.0053 wait_subj -0.0014 leave_obj -0.0016 massage_subj -0.0022 believe_subj -0.0031 catch_obj -0.0057 like_subj -0.0014 believe_subj -0.0018 cheer_subj -0.0024 wait_subj -0.0034 need_subj -0.0072 mention_subj -0.0026 bring_obj -0.0018 fill_subj -0.0025 lose_subj -0.0035 feed_obj -0.0076 reach_subj -0.0029 mention_subj -0.0023 recognize_subj -0.0027 hear_subj -0.0057 bring_obj -0.0089 explain_subj -0.0036 explain_subj -0.0048 alarm_obj -0.0028 call_subj -0.0069 put_obj -0.0113 offer_subj -0.0039 offer_subj -0.0052 warn_obj -0.0028 ask_subj -0.0097 deliver_obj -0.0196 call_subj -0.0089 want_subj -0.0060 join_subj -0.0057 need_subj -0.0117 push_obj -0.0240 ask_subj -0.0133 call_subj -0.0174 get_obj -0.0141 check_obj -0.0137 hold_obj -0.0306 get_obj -0.0136 get_obj -0.0195 ask_subj -0.0171 want_subj -0.0266 get_obj -0.0369 want_subj -0.0152 ask_subj -0.0238 hire_obj -0.0438 Figure 5.4: Most frequent verbs from the power lexicon associated with each persona in the birth stories corpus. Green indicates a positive power contri- bution, while pink indicates a negative power contribution. The cell values indicate the proportion of persona mentions with the given verb and power relationship. 66 baby 0.8 author partner 0.4 anesthes. nurse 0.0 family baby author doctor partner −0.4 anesthes. we nurse family doula doctor −0.8 we midwife doula midwife −0.1 0.0 0.1 0.2 0.3 Power (a) (b) Figure 5.5: (a) Power scores for each persona. Error bars show standard devi- ation over 20 bootstrap samples of the collection. (b) Estimated power of per- sonas (rows) over other personas (columns). The NURSE is consistently framed as more powerful than the other personas, except for the DOULA. 5.2.4 Ethical Considerations The dataset of unsolicited birth stories is a valuable resource for analysis of an online health community, and we encourage further work on this and other medical narrative datasets which prioritize the patient’s voice. Many birth sto- ries recount experiences in which the pregnant person felt that they were not empowered, and drawing attention to these voices can benefit the broader com- munity of pregnant people. However, we caution practitioners to handle this data with care. Research value, even for the community of pregnant people, and reproducibility must be balanced against the specific ethical concerns sur- rounding publicly shared medical data [Janssens and Kraft, 2012, Vayena et al., 2015, Abbott et al., 2019]. We identify a series of tensions inspired by both prior 67 Persona baby author partner anesthes. nurse family doctor we doula midwife work on ethical use of online medical data and the three guiding principles of the Belmont Report, as discussed in Chapter 3. While the stories posted to Reddit are already public, copying that data pre- vents the authors from removing or editing their stories. Birth stories contain extremely personal medical, interpersonal, and emotional information not just about the author but also about the baby, who cannot consent to this public sharing. While the authors posted their stories to a public venue, our own ex- ploration of their motivations (e.g., resistance against surveillance, regaining power) does not include providing material for researchers. Due to the very sensitive nature of the dataset, we choose to not release ei- ther the dataset or the URLs and instead prioritize the authors’ protection. To support the replication of our results on other birth stories datasets, we release our labeling pipeline (e.g., n-grams used for labeling, pre-processing steps). We follow the recommendations in Bruckman [2002] and Yang et al. [2019b] to mask the stories by only providing paraphrases rather than exact text snippets in all of the examples highlighted in this paper, which minimizes the possible identi- fication of and harm to the authors. These potential harms are weighed against our hope that the results of our study will aid in the understanding of mod- ern experiences of pregnancy and birth and spur further research that centers the pregnant person’s voice. We also shared the final paper and a public-facing blogpost with the r/BabyBumps community to support the principle of respect for persons. We strongly recommend that researchers using this data follow these same privacy-motivated guidelines. 68 5.2.5 Discussion and Impact We view this study as a “close reading” via a computational lens of a specific online community. This computational reading allows us to discover patterns and outliers within our target community and also to prove the feasibility of similar research on other communities that share medical stories. We do not claim that the patterns discovered here will hold for all other collections of birth stories; instead, we claim that such analysis can a) provide specific, statistical evidence of patterns suggested or observed in prior work and b) prompt further research in similar communities. Narrative analysis is a challenging task in natural language processing, partly because of the difficulty of creating datasets that are realistic but not too complex for current models. We suggest that birth stories hit a sweet spot be- tween formulaic, artificial datasets and complex, organic datasets; they are or- ganically created (written spontaneously by the authors), share narrative struc- ture and are constrained by topic despite each story being unique, are plentiful enough to act as training data for machine learning models, and suggest real- world motivations for their analysis. We successfully identify sentiment, topic and person-based patterns that demonstrate the recoverable narrative qualities of birth stories. By uncovering this shared structure, we discovered events not described in medical literature that are nevertheless important to the authors in this commu- nity. We also use the learned topic patterns to discover outlier stories whose sequences of events do not match those of the community’s expectations, and we examine the framing of these stories. The authors of these stories emphasize unexpected events, negative or triggering experiences, and “happy endings” in 69 their titling and framing of the posts. These results suggest that a lack of control is associated with negative emotions, as found in Bylund [2005], and that in this community, reframing of these unexpected events around “happy endings” is common. Many of our methods rely on averaging across bootstrapped sets of stories in order to recover overarching themes in this community. We recognize that the “average” story is not representative of every story in this community. We ex- plore the question of outlier stories when we use the learned pattern of events to identify stories whose sequences of events are unusual. The authors often label these stories as unexpected, which confirms our interpretation of these out- liers as stories that fall outside the expectations of the community. In this sense, discovering archetypes through averaging is what allows us to discover by com- parison stories in the minority. For medical professionals, these results could help inform care decisions and priorities. While doctors, nurses, and other medical professionals observe hundreds of births every month, they do not observe these births through the pregnant person’s eyes, and their interactions with the pregnant person are me- diated by their power differential (the medical professional’s technical knowl- edge, their place in the hierarchy of the hospital setting, their gender, race, age, and education level, etc.). Birth stories allow these professionals to view a rou- tine procedure through the fresh eyes of a person who is experiencing preg- nancy and birth for perhaps the first time and to discover areas that could be improved for the patient, like the events (e.g., traveling to and arriving at the hospital) that are prominent in birth stories but not usually highlighted in med- ical documentation of births. Postpartum depression can be alleviated through 70 attention to the patient’s emotional needs and feelings of agency during the birth Callister [2004], Stewart and Vigod [2016], and our work has highlighted that these needs sometimes go unmet within this community. While the experiences of all pregnant people are valuable, we particularly highlight the importance of listening to underrepresented perspectives. While we were not able to control for race, education level, or other demographic vari- ables in this study, we hope that our results show that such computational anal- ysis of birth stories is feasible and valuable. We have demonstrated that birth stories can highlight events and framings missing from the dominant medical narratives of birth. 71 CHAPTER 6 PERSONAL READING EXPERIENCES Book reviewer values and audience expectations can be measured in online reading communities, highlighting tensions between subjective reading expe- riences, evolving community understandings, and societal judgments. In my work, I have used a variety of unsupervised methods (linguistic and user simi- larity metrics) to map the an online social reading community’s understanding of genres based on the language expressed in reviews and the free-text tags cre- atively applied by individual community members. Where prior genre-mapping methods relied on academics, critics, publish- ers, and library catalogs to assign genres, my work on online reading commu- nities focuses on the common reader. These readers and their tags have effects in the offline world; for example, LibraryThing’s paid catalog, which is based on user’s tag assignments, is used to organize physical bookstores and libraries. Understanding the relationships between these users and tags will aid in un- derstanding modern literary reception and popular views of genre. 6.1 Literary Reception and Online Book Reviews The internet and social media have greatly increased the amount of available evidence about readers and reading communities. Earlier research about read- ers relied on sources such as archival materials (e.g., personal diaries), ethno- graphies, and surveys [Radway, 1991, National Endowment for the Arts, 2004]. These sources typically offer rich data about a small number of readers or more cursory data about a large number of readers, with little in between. Online so- 72 cial reading websites such as LibraryThing and Goodreads, where readers pub- lish records of their thoughts in their own words and form social bonds with other readers, offer invaluable resources for the study of readers and reading communities. Researchers in the fields of digital humanities and cultural analytics have started to take advantage of online book ratings and reviews to study readers, though they have mostly focused on Goodreads data. For example, the Stan- ford Literary Lab uses Goodreads ratings as metrics for general book popularity among readers [Porter, 2018]. Bourrier and Thelwall similarly use Goodreads ratings and reviews to understand the contemporary reception of 19th-century literature [Bourrier and Thelwall, 2020], while English et al. [2022] explore the overlap between Goodreads users who read “popular” books and users who read “prestigious” books. A variety of predictive tasks have also been studied in the context of book reviews. These include popularity prediction [Maity et al., 2018] and automatic recommendation systems that incorporate user specialties [Wang et al., 2020a]. Resources such as the UCSD Book Graph, a dataset of scraped and labeled Goodreads reviews and user data, are intended for the tasks of item recom- mendation [Wan and McAuley, 2018] and spoiler detection [Wan et al., 2019]. Unlike these works, we use prediction only as a lens and not a tool to enforce an ontology. Closest to this study of literary genres in online communities is Hegel [2018], which explored the influence of online reading communities like Goodreads on popular perceptions of genres by comparing these perceptions to those of pro- fessional literary critics. This analysis relied on classifying the themes in reviews 73 via a supervised classifier and comparing those themes across a small set of gen- res, as well as comparing the collocations in which genre names were used in reviews. Hegel [2018] found that “amateur” reviewers on Goodreads tend to or- ganize genres into more fine-grained categories than professional reviews, and they also tend to use more personal and evaluative language and language of legitimacy (“award-winning,” “greatest”) in their reviews — in contrast to pro- fessional reviewers, who focus more on publishing conventions and providing context and background (“debut,” “ya”). Hegel [2018] also created a map of the genres on Goodreads by clustering a set of genre tags that have been applied to a random subset of books and a publicly accessible subset of those book’s re- views. The resulting clusters are then used as labels for a prediction task using the texts of reviews, revealing that more established genres are easier to predict than subgenres or broad categories (e.g., “fiction”). 6.2 Collaborative Tagging Systems A collaborative tagging system allows multiple users in a community to tag the same object, and aggregations of these tags are then shown as features of the object [Smith, 2007]. These tagging systems are also referred to as folksonomies, a neologism for “folk taxonomy” [Vander Wal, 2005, Weber, 2006, Vander Wal, 2007]. Crucially, collaborative tagging systems and folksonomies rely on un- controlled vocabularies rather than pre-defined hierarchies and taxonomies and include interacting levels of personal and community tagging. Why do users choose to participate in collaborative tagging systems? Moti- vations can include organization of one’s personal data as well as social recogni- 74 tion from other users [Wash and Rader, 2007] and identification of functions of the object (e.g., what the object is, who owns it) [Golder and Huberman, 2006]. Through a set of surveys, Bartley [2009] finds that LibraryThing users usually add tags for collection management, to add factual information, and to help others find books. Tagging systems can also be seen as collaborative sensemak- ing, “orienteering”, or information foraging [Markus, 2001, Teevan et al., 2004, Pirolli, 2005]. Individual tagging decisions are sometimes influenced by other users [Sen et al., 2006, Golder and Huberman, 2006], indicating that users learn from other users and make sense of the tagging space together. These motiva- tions and habits will likely vary depending on the design and functionality of the website that houses a given tagging system. Several attempts have been made to categorize the tags used in folk- sonomies. For example, Golder and Huberman [2006] proposes three tag classes — factual, subjective, personal — which are used in later work by Sen et al. [2006] to categorize movie tags. Heymann et al. [2010] divide the tags into six types — objective and content-based, opinion, personal, physical, acronym, and junk — and find that the majority of LibraryThing and Goodreads tags are objective and content-based, while Goodreads has more personal tags than LibraryThing. Other work seeks to categorize the taggers themselves. For example, Körner et al. [2010] divide users into categorizers, who use a small set of hierarchical tags, and describers, who use many creative and non-hierarchical tags. They explore the emergent semantics in collaborative tagging systems and find that describers contribute more than their more rigid counterparts. Some work has found that users tag independently of other users [Rader and Wash, 2008], while Zubiaga et al. [2011] finds that certain groups of users assign higher quality tags that are 75 more useful for tag prediction systems. Much prior work has focused on tagging systems as problems to be solved. If the tags are to be used as input for the creation of canonical systems and hier- archies, then the tags should be normalized. Hypothesized synonyms should be conflated and ambiguities should be resolved to enhance information re- trieval, recommendation, automatic tagging, and ontology construction [Lan- dauer, 1984, Zubiaga et al., 2011, Kar et al., 2018]. For example, Heymann et al. [2010] emphasize three qualities of collaborative tagging systems — consistency, quality, and completeness — and compare to systems designed by experts. Rather than learning a hierarchy, we want to use the associations of users to learn nuances about their understanding of genre and usage of tags. The point of collaborative tagging is to escape the hierarchical view of data and in- stead favor an inclusive, flexible structure [Golder and Huberman, 2006]. The non-hierarchical tagging system allows each object to be about several things simultaneously [Golder and Huberman, 2006]; this quality is exactly what al- lows genres on LibraryThing to overlap and intertwine. This overlap allows us to learn about cooccurrences, correlations, and relationships between genres according to a community in ways previously not possible. 6.3 Literary Genres Many readers understand “genre” as a way of classifying literary works based on shared textual characteristics, such as similar plot structures, character types, or settings. This conception of genre is reflected in Wikipedia descriptions of genres, unsurprising given its descriptive goals as a popular encyclopedia. 76 However, many literary scholars resist understanding genre as a neat classifi- cation system and instead emphasize genres as blurry, changing over time, and dependant on context [Pavel, 2003]. Underwood [2016], Wilkens [2016] rely on computational classification precisely because other forms of classification fail. Genre, according to other scholars, is not something that books have, or some- thing that can be found in the texts themselves. Rosen and Pavel argue that genre is a tool that authors use to write books, akin to a “set of recipes” [Rosen, 2018, Pavel, 2003]. From another angle, Radway and others have explored genre as a product of the publishing industry, as categories that are used to market and sell books [Radway, 1991]. Within the fields of natural language processing and computational social science, research has focused on learning fixed genre categories from texts. A variety of approaches have been proposed for automatic genre identification [Biber, 1986, Kessler et al., 1997, Stamatatos et al., 2000, Worsham and Kalita, 2018], most focusing on book-length texts as training data. These works raise the question of what genre is: is it a set of surface level facets [Kessler et al., 1997] or is abstraction required [Worsham and Kalita, 2018]? Genre has also been successfully incorporated into book recommendation systems [Maharjan et al., 2018] and used for analysis of emotional and narrative arcs [Kim et al., 2017]. While we are similarly focused on genre definitions, similarities, and boundaries, we focus not on the book texts but on user reviews and tags; our goal is not to predict the “correct” genre label but to learn from users about genres are understood and used in the LibraryThing community. 77 6.4 Case Study: Mapping Literary Genres on LibraryThing 6.4.1 Data from LibraryThing LibraryThing is a social reading website, similar to Goodreads in much of its functionality but more transparent in its data. LibraryThing contains an enor- mous number of books and reviews. After years of user input, it also contains an enormous number of tags: over 167 million. We find that these tags form a “long tail” in which the majority of the tags are applied to a very small num- ber of books. We cannot analyze all of these tags, both because of lack of space and because many of the tags are not associated with enough reviews to make reliable comparisons with other tags. Our methods require that we control for several review characteristics — including rating polarity, review length, and book title — and most tags do not have enough data to properly control for all of these features. Restricting and holding constant our target tags also al- lows us to more easily make comparisons across different metrics. While the unconstrained, creative use of tags is part of what makes LibraryThing genres so interesting, we must find ways to scope down the tags for analysis. Therefore, we manually identify a set of 20 target genres (shown in Table ??) by examining the most frequent 75 tags on LibraryThing. We discard tags that are too broad (e.g., fiction, to-read) or that are near duplicates of other tags (e.g., classic and classics).1 Prior to our collection, LibraryThing already combined some synonymous tags, e.g., the fantasy tag includes Fantasy, fantasia, fantası́a, and FANTASY. We choose these target genres rather than more creative or user- 1There are cases where tags that are close in name operate very differently; e.g., books tagged french are usually books written in French while books tagged france are usually books set in France. 78 specific tags, because we are interested in how LibraryThing users re-imagine more conventional literary genres. While this decision leaves unexplored many areas of LibraryThing, and perhaps could be read as re-imposing traditional genres on the collaborative tagging system, we see these target genres as both touchstones and starting points. But there are also surprising and unconven- tional genres even in the most frequent 75 tags, like vampires, family, and animals, which do not fit traditional or scholarly conceptions of genre. We scrape metadata for the 1,000 top books for each of these 20 target gen- res, where “top” books are those that have most often received the target tag. Scraped book metadata includes the title, author, rating distribution, publica- tion date, and tag cloud (counts for all the tags that all users have applied to the book). We scrape the full set of public reviews (review text, user ID, date, star rating) for each book, and for each reviewer, we scrape their public tag cloud (the tags they have personally applied). This results in a total of 17,440 books, 319,850 reviews, and 33,849 users. The top books for each tag are not mutually exclusive. For example, a top book for the tag fantasy might also be a top book for the tag science-fiction. Even if the book is tagged science-fiction more often than fantasy, we will still add the book to the fantasy genre if its fantasy ranking is in the top 1,000. In other words, the top books are the most popular books for the tag, not the books most specific to the tag. We find significant differences between the target genres, including mean review length, vocabulary size, mean star rating, and mean number of ratings. For example, picture books have a very high mean star rating and a very low mean number of ratings, while horror has a higher number of ratings but a much 79 lower mean star rating. Users infrequently review picture books, but when they do, they rate them very positively, whereas users tend to be more critical of horror books, even though they review them more overall. For most genres, the vocabulary size is correlated with the mean length of the reviews, but outliers include vampires and young adult, which have small vocabulary sizes given their mean review lengths. These outliers suggest that reviews of vampires and young adult books tend to discuss more similar subjects in similar ways. In order to compare the genre features of research interest, we use the following sampling sequence to control for features like review length which are not of interest. This method also controls for the influence of extremely popular books such as the Twilight series. We remove reviews without ratings, reviews not written in English (using a simple filter requiring at least five English stopwords and fewer than five Span- ish stopwords), duplicate reviews (where duplicates require identical review IDs, user IDs, and book IDs), and reviews with fewer than 100 words. To con- trol for polarity, we randomly sample two positive and two negative reviews for each book. We define negative reviews as those with ratings between 0.5-3.5 stars and positive reviews as those with ratings between 4-5 stars. We choose a higher cut-off for negative reviews, rather than choosing the midpoint 2.5, be- cause there is a strong skew across the book reviews towards positive ratings, and qualitatively, we find that a rating of 3.5 stars usually indicates serious crit- icisms of the book. To control for the review length, which can vary significantly by genre, we retain only the last 100 words of each review text. This is a common prepro- cessing step in NLP analyses of texts with variables lengths; for example, see 80 the discussion and sampling decisions in Danescu-Niculescu-Mizil et al. [2013]. Controlling for review length is particularly important in our analyses of di- versity of themes present in reviews (e.g., our use of topic entropy), as longer reviews could by nature of their length contain more diverse language. In the case of online book reviews, we observe that reviewers are more likely to be- gin reviews with meta-content (e.g., where they read the book, personal stories unrelated to the book) while they are more likely to end the reviews with sum- maries of their thoughts, re-stating the different themes mentioned earlier. We use the last 100 words because our analysis is focused on the reviewer’s judge- ments of the books. Books that do not meet these all of these filtering requirements are discarded. Of the remaining books, we randomly sample 300 books per genre. We al- low books to appear in multiple categories, as this reflects the reality of genre- crossing books, and we allow multiple books from the same author, as this re- flects the outsized influence of prolific authors. Our sampling results in a total of 4,934 unique books (100 words per review, 2 reviews per polarity per book, 300 books per genre). 6.4.2 Ethical Considerations As discussed in Chapter 3, online book reviews pose challenges for ethical data science, especially with regard to citation and quotation. On the one hand, Li- braryThing reviews are public and usually intended to be read by a wide audi- ence of other book lovers. Many reviewers clearly take pride in their reviews and tags, as evidenced by their profiles full of badges, descriptions of their read- 81 ing habits, and interactions with other reviewers. Reviewers often use their real names or include information in their public profile (e.g., location, age, profes- sion, photos) that make them easily identifiable. Some reviewers are compen- sated for their reviews by authors or publishers, or they receive free books in exchange for reviews. All of this suggests that LibraryThing users and their labor deserve credit. On the other hand, book reviews represent personal opin- ions on a wide variety of sensitive topics, and this information could be harmful if revealed in a new context or to an unexpected audience. We can view the re- viewers as writers who deserve credit for their work, or we can view them as people who might not want or expect their work to appear beyond Library- Thing. Most likely, different reviewers will have different perspectives on these questions. Our study was considered exempt from our institution’s IRB. Given the ten- sions discussed in Chapter 3, we do not release review texts or any data that is not easily viewable on the LibraryThing Zeitgeist web page.2 Instead, we release the names of the 20 target genres as well as the 300 book IDs for each genre.3 This maintains the review authors’ abilities to edit and delete their reviews, while still giving credit to the creative work that enabled this study [Bruckman, 2002]. For the reviews that we directly quote in this article, we contacted the authors, disclosed our identities and publication intentions, and asked permission for use of their creative work and whether they would like their username credited. If the authors did not want to be included or did not respond, we replaced these quotations with reviews written by authors who have given consent. 2https://www.librarything.com/zeitgeist 3https://github.com/maria-antoniak/librarything-genres 82 6.4.3 Mapping Methods Book and User Overlap We measure genre similarity using two metrics. First, using our sampled book sets, we measure the book overlap between each pair of genres — that is, how many books have been tagged as both one genre and another genre. Second, we measure the reviewer overlap between each pair of genres — that is, how many reviewers have tagged a book in one genre and a book in another genre. We convert both measurements into ranks, where the genre pair with the greatest book overlap has Rank 0. We expect the user overlap rankings to largely mirror the book overlap rank- ings. If two genres share many books in common, it follows that they would also share many reviewers in common. First, the shared books will necessarily in- clude shared reviewers and, second, the high overlap in books implies that the genres are thematically related. If a user finds one of the genres appealing, they are likely to also find the other genre appealing. We quantify these patterns by taking the difference between the user and book overlap rankings. overlap difference = user overlap rank − book overlap rank (6.1) Genre pairs with very high or very low scores are outlier pairs, which deviate from the expectation that book overlap rank should match user overlap rank. 83 Quantifying Lexical Fit: “Mistakes” and Surprises We can classify the genre of a book being reviewed based on the text of the review alone — without the book’s own text, title, or author — because users focus on different aspects (e.g., characters, plot, suspense) for certain genres in their reviews. We follow similar work that has sought to predict genres from texts [Underwood, 2016, Kar et al., 2018], but our training set is reviews, rather than book texts, so that we can focus on the reception of a book rather than its content. Note that although we are training a classifier to quantify the associ- ation between words and labels, we are not running a predictive experiment with held-out testing data, but rather an evaluation on the full data set, more like a standard linear regression. Our goal is not to maximize predictive perfor- mance, but rather simply to computationally represent ambiguity and similar- ity between genres. As a result, our results should be interpreted as an upper bound for predictive accuracy, and not as a measure of generalization. This ap- proach allows us to analyze the collection in two ways: first, if reviews for two genres cannot be easily distinguished even when the labels are available at training time, that is evidence that they serve the same values and expectations, and sec- ond, if a review is “surprising,” it may describe a setting in which a reader has a unique or idiosyncratic experience of a book. We train a supervised classifier on the review texts, using as labels the genre of each review in our sampled data. We use a logistic regression (one-vs-all) model with TF-IDF weighted unigram features, using the last 100 words of each review (as described in our sampling procedure). Genre labels are the most fre- quent tags assigned to the book, after filtering out a small set of high level tags 84 that do not resemble genres.4 We measure the surprisal of the review text given the genre using the probability of the true label: surprisal = 1 − P(true label). High surprisal scores indicate that the predicted probability of the true label was low and that the review was difficult to classify as its target genre. Low surprisal scores indicate that the predicted probability of the true label was high and that the classifier was able to predict the review’s target genre. This method generally works well at identifying reviews for books that blend different gen- res. When averaged, these scores can tell us which genres blend more with other genres. Values and Expectations: Measuring Thematic Signatures So far, we have summarized genres as one or two dimensional scores. This has allowed us to map genres onto an interpretable space where we can compare genres, measure their similarity, and identify outliers. However, while user and book overlap, predictive surprise, and community density are strong signals of genre similarity, they do not tell us why these genres are or are not similar. Learning review aspects might help answer this question. Aspects are themes of a review, usually focused on features of the product being reviewed; in the case of books, these might include plot, characters, and writing style. Reviews are generally written to explain a rating, not a genre tag, but by measuring the amount users choose to write about particular aspects and averaging over re- views for a specific genre, we hope to approximate which aspects are most sig- nificant for that genre. Measuring which aspects users focus on for each genre will teach us the expectations and values that the LibraryThing community at- taches to each genre. 4[fiction, non-fiction, to-read, ebook, kindle, literature, unread, own, hardcover, wishlist] 85 To answer these questions, we measure the thematic similarity of the review texts for our target genres. For our purposes, we take a relatively simple un- supervised approach, as we would like to discover themes rather than ordain them. We train a latent Dirichlet model (LDA) [Blei et al., 2003] on the full set of scraped reviews, removing duplicate texts. Before training, we probabilisti- cally downsample words associated with specific genres by using the Author- less Topic Models package [Thompson and Mimno, 2018b].5 This downsam- pling reduces the incidence of genre-specific topics, as we are more interested in cross-cutting themes that could be important for more than one genre (e.g., a Harry Potter topic would not be useful outside a narrow band of genres). We experiment with different numbers of topics and find that 30 topics pro- duce interpretable and not overly broad or narrow topics. For readability, we remove a set of common stopwords from the topic keywords, and we assign labels to each topic through manual examination of each topic’s most probable words and highest ranked documents. Mapping Genres by Community Homogeneity Do users “specialize” in specific genres—that is, often tag books in the same genre or write reviews that are lexically similar to other reviews in the genre? If so, how can we best measure this specialization, and what can we learn from this specialization about tagging and genre on LibraryThing? We hypothesize that there are different kinds of genre specialization. (1) A reviewer could be well-read in a particular genre and write reviews that are lexically similar to other reviews for books in that genre. (2) A reviewer could fit a genre lexically 5https://github.com/laurejt/authorless-tms 86 but only read one or two books in that genre. (3) A reviewer could be well-read in a particular genre but their reviews might be lexical outliers, indicating that they apply a different framework to these books from other reviewers. We explore both of these possibilities through measures of lexical homogeneity and community homogeneity for each genre. For lexical homogeneity, we rely on the surprisal scores learned previously. For community homogeneity, we use the personal tag cloud associated with each user that represents all the tags they have assigned to books. We filter tags that occur in fewer than 20 tag clouds, and we find the cosine similarity between each pair of normalized vectors, where each vector represents the tags used by a user who has reviewed in that genre. A high cosine similarity indicates a high degree of similarity between the re- viewers. This tagging similarity could be interpreted as a similarity in reading habits. Relying on a user’s tagging history comes with some limitations. Users of- ten tag books that they have not read, either for personal reasons (e.g., to mark the book for future reading) or as volunteer labor for the community (e.g., to add missing metadata for unpopular books). Users also employ tags for dif- ferent functions, including personal cataloging (using idiosyncratic tags) and community contribution, and it’s possible that these preferences align with dif- ferent communities. However, by limiting our comparison sets to those who have written at least one review for the target genre, we enforce a lower bound on user-genre relatedness. 87 6.4.4 Results Measuring reader expectations and how these expectations are shaped by online communities can improve recommendations and increase engagement. Using the classics as a touchstone, we find that affordances of the Goodreads website can shape the discourse around these books. In the larger LibraryThing study, I find that reviewer expertise, distribution of review discourses, and sharing of minority and majority opinions can be genre-dependent, which leads to insights in how genres are re-defined outside of an academic setting. This suggests a model in which future reading habits and shifts in tastes are predicted by the similarities between a review and its neighboring reviews. Book and User Overlap Some genre pairs share no book overlap (e.g., politics and mystery; classics and graphic novel) while others share many books in common (e.g., children and ani- mals; memoir and biography). 24% of the genre pairs have no book overlap. How- ever, some outliers emerge. For example, in Figure 6.1, we notice that classics and animals have higher book overlap than we would expect given their user overlap. In contrast, classics and politics have higher user overlap than we would expect given their book overlap. 88 memoir + mystery mystery + politics memoir + crime biography + mystery graphic novel + classics graphic novel + animals 175 graphic novel + mystery science fiction + mystery science fiction + crime fantasy + crime 150 crime + politics historical fiction + memoir 125 historical fiction + humor historical fiction + biography fantasy + picture book 100 picture book + classics 75 picture book + biography classics + politics humor + picture book historical fiction + mystery 50 young adult + animals fantasy + animals family + politics 25 humor + animals classics + animals children + humor children + young adult 0 0 25 50 75 100 125 150 175 User Rank <------ Higher User Overlap Lower User Overlap ------> Figure 6.1: A mapping of LibraryThing genres. User overlap between genre pairs correlates with book overlap, but there are outliers. Each point represents two genres, and the axes represent the rank of the genre pair, where lower num- bers indicate higher ranks and therefore higher overlap. For example, the genre pair classics + animals has a mid-range user overlap rank and a high book over- lap rank, indicating that these genres share surprisingly few users given how many books are shared. Pearson correlation between book and user overlap is significant (r = 0.68, p ¡ 0.05). For example, given the low number of books that have been tagged as both graphic novel and classics, it is surprising to see how many users read within both of these genres. This high user overlap could be explained by a tendency of 89 Book Rank <------ Higher Book Overlap Lower Book Overlap ------> users who review within the graphic novel tag to also review within the classics tag—or it could be explained by a tendency of users who review classics to read widely across many genres, including graphic novels. On the other hand, humor and picture book have relatively high book overlap but relatively low user overlap. Perhaps picture book reviewers read more frequently in other genres and only occasionally review a picture book, e.g., when they give a book as a gift or when reading a book to a child. Lexical Surprisal Figure 6.2 shows the relationship between misclassification counts and book overlap for each pair of genres. The high number of misclassifications between memoir and biography seems to conform to our expectations, as both genres are stories of a person’s life. Similarly, the frequent misclassifications of reviews for the animals, picture book, and children genres points to their commonality, as these sets of genres have high book and reviewer overlap. By comparing to book overlap, we can identify pairs with unusually high or low numbers of misclassifications given their similarity. For example, romance and horror have an unusually low number of misclassifications given their high book overlap, while animals and psychology have an unusually high number of misclassifica- tions given that they share no books in common. 90 True: animals True: children Predicted: picture book Predicted: animals True: memoir 4 Predicted: biography True: mystery Predicted: crime True: children True: classics Predicted: humor 3 Predicted: animals True: romance True: memoir Predicted: horror Predicted: psychology True: horror 2 Predicted: graphic novelTrue: fantasy Predicted: animals True: mystery Predicted: romance True: humor 1 Predicted: psychology True: historical fiction Predicted: psychology 0 True: animals Predicted: psychology True: historical fiction Predicted: horror 0 1 2 3 4 5 Misclassification Count (Log) Figure 6.2: The number of overlapping books and the number of genre misclas- sifications of user reviews for each pair of genres. Each point represents a pair of genres in which one is the true tag applied to the review text and one is the predicted tag from our model. As expected, we find a significant relationship using Pearson correlation (r = 0.65, p < 0.05) between the book overlap and misclassification count, but we highlight outlier genre pairs, e.g., animals and psychology have an unusually high misclassification count given their very low book overlap. Often, the classifier’s mistakes indicate similarities and overlaps between genres. But on other occasions, the classifier’s mistakes indicate a mismatch between the reviewer’s priorities and the typical priorities for that genre. For 91 Book Overlap (Log) example, the following review of Ann Bronte’s novel The Tenant of Wildfell Hall (1848) was misclassified as psychology when the book was actually tagged as romance: I was in awe of Anne Bronte’s ability to tell such a relevant story in 1848. There are so many women who find themselves in the same situation today. She was young and naı̈ve when she married Arthur Huntingdon and by the time she learned his true character it was too late. The writing is wonderful and for me that story pulled me in completely. The author tells the story from Gilbert’s point-of-view at times and from Helen’s at other times. The changing narrative flowed well and never rang false.Bronte covers some intense subjects in the book. In addition to infidelity and alcoholism, she makes some disturbing observations about women’s rights during this time period. Sometimes it’s easy to forget how far we’ve come in the last few years. —bookworm12 The reviewer focuses on elements of The Tenant of Wildfell Hall that pertain to the characters’ psychological states and mental and physical health, as well as how these conditions relate to broader society of the 19th century and of the present. A review that more easily conformed to the romance genre might have discussed the ending, the romantic plot, or the attractiveness of the characters. But these are not the elements that this particular reviewer discussed. The surprisal scores thus helps us better understand the elements that readers seem to really care about or gravitate toward in a particular genre, as well identify and interpret outlier reviews for the genre. We show examples of the classifier output and surprisal scores in Table 6.1. For example, we show a review excerpt of Anna Sewell’s Black Beauty. The most popular tag for this book was animals but our model misclassified this review as 92 classics with high confidence, resulting in a high surprisal score. The reviewer writes about the book’s popularity and compares its sales rates to well-known classics. The misclassification, in other words, flags the distinctiveness of both the book and the review, suggesting that what shapes genre perception is likely more than the text itself, which is discussed in the context of other books tagged as classics. True Genre Predicted Surprisal Example Misclassified Reviews Genre romance romance 0.00 I am not normally a fan of romance novels as I find them too mushy and cutesy, but this one had a sense of humor about it that I really en- joyed...The heroine was very independent and snarky and the main ro- mance was full of comedic situations with a smattering of seriousness that made it seem fairly realistic for the genre. It was a book that was absorbing and fast to read. —Arualanne (The Perfect Rake) romance historical fiction 0.03 There were some moments though where I had to wonder about the his- torical accuracy of some of the attitudes and that broke the reading spell for me.Pretty predictable but I enjoyed the ride. Almost a 4 read for me but not quite. —wyvernfriend (Simply Unforgettable) animals classics 0.20 ...it’s no wonder it’s been so popular since it was first published. I was surprised to learn that Black Beauty is one of the top thirty best-selling books of all time in the English language, selling over 50 million copies– more than The Odyssey, To Kill a Mockingbird, Pride and Prejudice, and Gone with the Wind... —nsenger (Black Beauty) Table 6.1: Examples classifications and surprisal scores. Excerpts are selected from the last 100 words of the reviews. Higher surprisal indicates greater confi- dence in the incorrect label. We can also arrive at single surprisal score for each genre by taking the mean of the surprisal scores for the reviews assigned to that genre. The most sur- prising genres include young adult, family, classics, children, and fantasy. Genres that have higher mean surprisal scores are harder to classify; these genres are “fuzzier” and the language used in the reviews for these genres is more wide- ranging. These genres are often mistaken for similar genres, but it could also be the case that these genres are simply broader. For example, the classics genre 93 contains a wide range of themes and discourses, in both its books and reviews. The high surprisal scores emphasize the view of genres as fuzzy, overlapping tags, rather than the rigid hierarchy sought by Wikipedia editors. 94 Community Homogeneity family young adult classics children 0.82 humor historical fiction fantasy 0.80 memoir romance mystery horror 0.78 biography crime science fiction vampires 0.76 animals politics graphic novel 0.74 picture book psychology 0.72 0.06 0.07 0.08 0.09 Community Homogeneity Figure 6.3 3.72 picture book children 3.70 animals 3.68 biography historical fiction family 3.66 graphic novel crime fantasy 3.64 mystery memoir 3.62 classics young adult science fiction 3.60 politics vampires humor romance 3.58 psychology horror 3.56 0.06 0.07 0.08 0.09 Community Homogeneity Figure 6.4 Figure 6.5: Are tighter communities easier to predict? Are tighter communi- ties more critical? Figure 6.3 shows the target genres plotted along surprisal (the ability of a classifier to predict the genre of a review) and community ho- mogeneity (averaged cosine similarities between reviewers’ tagsets). Figure 6.4 shows the target genres plotted along rating and community homogeneity. Gen- res whose reviewers have more similar reading habits tend to also have higher ratings according to a Pearson correlation test (r = -0.60, p ¡ 0.05). 95 Rating Review Surprisal 6.4.5 Discussion and Impact There is not a single right way to map tags and genres in the LibraryThing com- munity. Different maps reveal different outliers, pairings, and patterns. Reduc- ing the rich tags to two dimensions will not answer all of our questions, but creating multiple mappings and comparing them has allowed us to tease apart some of the ways in which LibraryThing users see genre. Unlike much prior work, we do not seek to normalize the tags. While the unconstrained vocab- ulary of tags on LibraryThing means that “errors” like synonyms, typos, and overly personalized tags do exist, we take advantage of this information and use it to discover what is new, rather than force it to fit a traditional structure. By exploring thematic signatures of LibraryThing genres, we learn which aspects of the reading experience are valued by LibraryThing reviewers and how these values vary depending on the genre of the book being reviewed. We discover strange similarities — e.g., the resemblance between young adult and more “adult” genres like horror — and we also find peculiarities in the topic signatures of strongly related genres, as in the case of memoir and biography. These patterns connect to a broader set of themes which we discuss below. There are many parallels between our work and recent digital humanities studies of genre. For example, our use of classification to measure an aspect of literary style is similar to Underwood [2016] and our attempts to map genres are similar to the book clusters in Wilkens [2016]. Our approach is founded in this tradition, which often uses computational tools in a non-normative way to explore ambiguity and to find outliers and “misclassifications” rather than to make “accurate” predictions. However, much prior computational work on genre in the digital humanities has focused not on reception but on book texts 96 [Underwood, 2016], whereas we focus on reception via online book reviews. Reception scholars such as Fish [1982] have argued that readers’ experiences of texts are strongly shaped by their “interpretive communities” — groups that share common strategies for interpreting texts (e.g., a group of professional lit- erary critics). We find that LibraryThing users’ reviews and tagging behaviors similarly correspond to their audiences on the site, with reviewers for certain genres writing more about certain aspects than others (§??). The shared norms in this tagging community might be driven not only by personal tagging moti- vations (tagging and curating one’s own library) but by communal and perfor- mative ones, too (publishing reviews, ratings, and tags). LibraryThing is not just a virtual meeting place for book lovers; it also pro- vides a cataloging service, TinyCat, to physical lending libraries around the world. TinyCat allows librarians to input their own metadata, but it also pro- vides genre labels for books, which saves these librarians additional work. The process for genre assignment is not publicly explained but presumably relies to some extent on the tags provided by users on LibraryThing. Our exploration of how genre is defined on LibraryThing thus has implications for small libraries in addition to online communities. The non-conventional genres of Library- Thing may be shaping how today’s library patrons discover books. It could be the case that “non-prestigious” genres are shaping our libraries and that patrons will be able to find books categorized by vampires enthusiasts on LibraryThing. 97 CHAPTER 7 CONCLUSION My dissertation work has combined unsupervised computational methods with sets of personal narratives and experiences. The disclosures made in on- line communities are rich sources of emotional reactions, stories grounded in healthcare experiences, and personal relationships with cultural objects, but this data also requires careful handling and attention to biases and instabilities. This work opens up new questions into both probing distributional models and us- ing those models for cultural analysis. My work has questioned the evaluation of statistical results, highlighting their instability and examining the conditions under which such NLP methods can provide robust results to social and humanistic research questions. Treating the training corpus as the central object of study, as is common in the digital humanities and computational social science, necessitates increased attention to errors and robustness, but it also creates a natural pathway from computa- tional results back to a close reading of the data, a practice that future work will continue to refine. There is a tension throughout this dissertation between the “spurious story- telling” critiqued in Chapter 4 and the “computational reading” demonstrated in Chapters 5 and 6. When using computational tools for the study of social and cultural questions, there are multiple possible views or interpretative lenses of the data, each of which might be useful for researchers in different contexts. Community members themselves might not agree on a single interpretation of their community. In these settings, stability and coherence of the computational results and comparison of the computational results with qualitative methods 98 (e.g., close reading) are more important than determining a single “correct” model of the data. Careful data controls, as in Chapter 6, and comparing results across bootstrapped samples of the documents as in Chapter 5, can ground the reliability of the results. More work is needed to continue incorporating lessons from (and contribut- ing to) the rich literature on bias and ethics in data science, with particular attention to tensions between averaged patterns and outlier voices. As mod- els get bigger, questions increase about how large pretrained models can be used for small, socially-specific datasets in unique domains. A promising fu- ture direction would include probing these models using comparison sets across different health and/or literary genre subdomains, measuring discrepancies in each model’s coverage and performance. Research is needed that measures both dangers (e.g., domain mismatches, unintended biases) and opportunities (e.g., model errors as an analytic lens on the training corpus), as well as research that translates between these models, combining computational methods with nat- ural language data in specific social contexts. 99 BIBLIOGRAPHY Jacob Abbott, Haley MacLeod, Novia Nurain, Gustave Ekobe, and Sameer Patil. Local standards for anonymization practices in health, wellness, accessibility, and aging research at chi. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, pages 462:1–462:14, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-5970-2. doi: 10.1145/3290605.3300692. URL http://doi.acm.org/10.1145/3290605.3300692. Irwin Altman and Dalmas A. Taylor. Social penetration: The development of inter- personal relationships. Holt, Rinehart & Winston, 1973. Maria Antoniak and David Mimno. Evaluating the stability of embedding- based word similarities. Transactions of the Association for Computational Linguistics, 6:107–119, 2018. doi: 10.1162/tacl a 00008. URL https:// aclanthology.org/Q18-1008. Maria Antoniak and David Mimno. Bad seeds: Evaluating lexical methods for bias measurement. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natu- ral Language Processing (Volume 1: Long Papers), pages 1889–1904, Online, Au- gust 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. acl-long.148. URL https://aclanthology.org/2021.acl-long.148. Maria Antoniak, David Mimno, and Karen Levy. Narrative paths and ne- gotiation of power in birth stories. Proc. ACM Hum.-Comput. Interact., 3 (CSCW), nov 2019. doi: 10.1145/3359190. URL https://doi.org/10. 1145/3359190. Maria Antoniak, Melanie Walsh, and David Mimno. Tags, borders, and catalogs: Social re-working of genre on librarything. Proc. ACM Hum.-Comput. Interact., 100 5(CSCW1), apr 2021. doi: 10.1145/3449103. URL https://doi.org/10. 1145/3449103. Danielle Arigo and Joshua M. Smyth. The benefits of expressive writing on sleep difficulty and appearance concerns for college women. Psychology & Health, 27(2):210–226, 2012. JinYeong Bak, Suin Kim, and Alice Oh. Self-disclosure and relationship strength in Twitter conversations. In Proceedings of the 50th Annual Meeting of the As- sociation for Computational Linguistics (Volume 2: Short Papers), pages 60–64, Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL https://aclanthology.org/P12-2012. JinYeong Bak, Chin-Yew Lin, and Alice Oh. Self-disclosure topic model for clas- sifying and analyzing twitter conversations. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1986–1996, 2014. Sairam Balani and Munmun De Choudhury. Detecting and characterizing men- tal health related self-disclosure in social media. In Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Sys- tems, pages 1373–1378, 2015. David Bamman and Noah A. Smith. Unsupervised discovery of biographical structure from text. Transactions of the Association for Computational Linguistics, 2:363–376, 2014. doi: 10.1162/tacl a 00189. URL https://www.aclweb. org/anthology/Q14-1029. David Bamman, Brendan O’Connor, and Noah A. Smith. Learning latent per- sonas of film characters. In Proceedings of the 51st Annual Meeting of the As- 101 sociation for Computational Linguistics (Volume 1: Long Papers), pages 352–361, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P13-1035. Jack Bandy and Nicholas Vincent. Addressing ”documentation debt” in ma- chine learning research: A retrospective datasheet for BookCorpus. arXiv preprint arXiv:2105.05241, 2021. URL https://arxiv.org/abs/2105. 05241. Azy Barak and Orit Gluck-Ofri. Degree and reciprocity of self-disclosure in online forums. CyberPsychology & Behavior, 10(3):407–417, 2007. Peishan Bartley. Book tagging on LibraryThing: How, why, and what are in the tags? Proceedings of the American Society for Information Science and Technology, 46(1):1–22, 2009. Eric P.S. Baumer, David Mimno, Shion Guha, Emily Quan, and Geri K. Gay. Comparing grounded theory and topic modeling: Extreme divergence or un- likely convergence? Journal of the Association for Information Science and Tech- nology, 68(6):1397–1410, 2017. Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604, 2018. doi: 10.1162/tacl a 00041. URL https://www.aclweb.org/anthology/ Q18-1041. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, 102 and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Associ- ation for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188. 3445922. URL https://doi.org/10.1145/3442188.3445922. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neu- ral probabilistic language model. Journal of Machine Learning Research, 3(Feb): 1137–1155, 2003. Sudeep Bhatia, Geoffrey P. Goodwin, and Lukasz Walasek. Trait associations for Hillary Clinton and Donald Trump in news media: A computational analysis. Social Psychological and Personality Science, 9(2):123–130, 2018. Douglas Biber. Spoken and written textual dimensions in english: Resolving the contradictory findings. Language, pages 384–414, 1986. Lucas M. Bietti, Ottilie Tilston, and Adrian Bangerter. Storytelling as adaptive collective sensemaking. Topics in Cognitive Science, 11:710 – 732, 2019. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning research, 3(Jan):993–1022, 2003. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Lan- guage (technology) is power: A critical survey of “bias” in NLP. In Pro- ceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 5454–5476, Online, July 2020. Association for Computational Lin- guistics. doi: 10.18653/v1/2020.acl-main.485. URL https://www.aclweb. org/anthology/2020.acl-main.485. Olivier Bodenreider. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Research, 32(suppl 1):D267–D270, 2004. 103 Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. pages 4349–4357, 2016a. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Advances in Neural Information Processing Sys- tems, pages 4349–4357, 2016b. Karen Bourrier and Mike Thelwall. The Social Lives of Books: Reading Victorian Literature on Goodreads. Journal of Cultural Analytics, page 12049, February 2020. doi: 10.22148/001c.12049. Margaret M. Bradley and Peter J. Lang. Affective norms for English words (ANEW): Instruction manual and affective ratings. Technical report, 1999. The Center for Research in Psychophysiology, University of Florida. Amy Bruckman. Studying the amateur artist: A perspective on disguising data collected in human subjects research on the internet. Ethics and Information Technology, 4(3):217–231, 2002. Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ashton Anderson, and Richard Zemel. Understanding the origins of bias in word embeddings. In International Conference on Machine Learning, pages 803–811, 2019. Tarana Burke. “It made my heart swell to see women using this idea - one that we call ‘empowerment through empathy’ #metoo”. Twitter, October 15, 2017a. Tarana Burke. “to not only show the world how widespread and pervasive 104 sexual violence is, but also to let other survivors know they are not alone. #metoo”. Twitter, October 15, 2017b. Carma L. Bylund. Mothers’ involvement in decision making during the birthing process: A quantitative analysis of women’s online birth stories. Health Com- munication, 18(1):23–39, 2005. Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356 (6334):183–186, 2017. Lynn Clark Callister. Making meaning: Women’s birth narratives. Journal of Obstetric, Gynecologic, & Neonatal Nursing, 33(4):508–518, 2004. Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative event chains. In Proceedings of ACL-08: HLT, pages 789–797, Columbus, Ohio, June 2008. Association for Computational Linguistics. URL https://www. aclweb.org/anthology/P08-1090. Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative schemas and their participants. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natu- ral Language Processing of the AFNLP, pages 602–610, Suntec, Singapore, Au- gust 2009. Association for Computational Linguistics. URL https://www. aclweb.org/anthology/P09-1068. Stevie Chancellor, Michael L. Birnbaum, Eric D. Caine, Vincent M. B. Silenzio, and Munmun De Choudhury. A taxonomy of ethical tensions in inferring mental health states from social media. In Proceedings of the Conference on Fair- ness, Accountability, and Transparency, FAT* ’19, page 79–88, New York, NY, 105 USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287587. URL https://doi.org/10.1145/3287560. 3287587. Danqi Chen and Christopher Manning. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/ D14-1082. URL https://aclanthology.org/D14-1082. Esther H. Chen, Frances S. Shofer, Anthony J. Dean, Judd E. Hollander, William G. Baxt, Jennifer L. Robey, Keara L. Sease, and Angela M. Mills. Gen- der disparity in analgesic treatment of emergency department patients with acute abdominal pain. Academic Emergency Medicine, 15(5):414–418, 2008. Colin Cherry and Hongyu Guo. The unreasonable effectiveness of word rep- resentations for Twitter named entity recognition. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 735–745, Denver, Colorado, May–June 2015. Association for Computational Linguistics. doi: 10.3115/v1/ N15-1075. URL https://aclanthology.org/N15-1075. Billy Chiu, Gamal Crichton, Anna Korhonen, and Sampo Pyysalo. How to train good word embeddings for biomedical NLP. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing, pages 166–174, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-2922. URL https://www.aclweb.org/anthology/ W16-2922. David Alan Cruse. Lexical Semantics. Cambridge University Press, 1986. 106 Cristian Danescu-Niculescu-Mizil, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. No country for old members: User lifecycle and linguistic change in online communities. In Proceedings of the 22nd Interna- tional Conference on World Wide Web, WWW ’13, page 307–318, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450320351. doi: 10.1145/2488388.2488416. URL https://doi.org/10.1145/2488388. 2488416. Janet S. de Moor, Lemuel Moyé, David Low, Edgardo Rivera, S. Eva Singletary, Rachel T. Fouladi, and Lorenzo Cohen. Expressive writing as a presurgical stress management intervention for breast cancer patients. Journal of the Soci- ety for Integrative Oncology, 6(2), 2008. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391, 1990. Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, and Ju- nichi Tsujii, editors. Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online, July 2020. Association for Computational Lin- guistics. URL https://aclanthology.org/2020.bionlp-1.0. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423. 107 Brianna Dym and Casey Fiesler. Ethical and privacy considerations for research using online fandom data. Transformative Works and Cultures, 33, 2020. James F. English, Scott Enderle, and Rahul Dhakecha. Bad habits on Goodreads? In In preparation for James F. English and Heather Love, eds., Literary Studies and Human Flourishing. Oxford UP, 2022. Daniel A. Epstein, Nicole B. Lee, Jennifer H. Kang, Elena Agapie, Jessica Schroeder, Laura R. Pina, James Fogarty, Julie A. Kientz, and Sean Munson. Examining menstrual tracking to inform the design of personal informatics tools. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ’17, page 6876–6888, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450346559. doi: 10.1145/3025453.3025635. URL https://doi.org/10.1145/3025453.3025635. Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. Understanding unde- sirable word embedding associations. In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguistics, pages 1696–1705, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/ P19-1166. URL https://www.aclweb.org/anthology/P19-1166. Ethan Fast, Binbin Chen, and Michael S. Bernstein. Empath: Understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, page 4647–4657, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450333627. doi: 10.1145/2858036.2858535. URL https://doi.org/10.1145/2858036. 2858535. Casey Fiesler and Nicholas Proferes. ”Participant” perceptions of Twitter re- search ethics. Social Media+Society, 4(1):2056305118763366, 2018. 108 Stanley Fish. Is There a Text in This Class? The Authority of Interpretive Communi- ties. Harvard University Press, Cambridge Mass., June 1982. ISBN 978-0-674- 46726-2. Michael Flor and Swapna Somasundaran. Sentiment analysis and lexical cohe- sion for the story cloze task. In Proceedings of the 2nd Workshop on Linking Mod- els of Lexical, Sentential and Discourse-level Semantics, pages 62–67, Valencia, Spain, April 2017. Association for Computational Linguistics. doi: 10.18653/ v1/W17-0909. URL https://www.aclweb.org/anthology/W17-0909. Patrick S. Forscher, Calvin K. Lai, Jordan R. Axt, Charles R. Ebersole, Michelle Herman, Patricia G. Devine, and Brian A Nosek. A meta-analysis of change in implicit bias. Jolene Galegher, Lee Sproull, and Sara Kiesler. Legitimacy, authority, and com- munity in electronic support groups. Written Communication, 15(4):493–530, 1998. Ryan J. Gallagher, Elizabeth Stowell, Andrea G. Parker, and Brooke Fou- cault Welles. Reclaiming stigmatized narratives: The networked disclosure landscape of metoo. Proc. ACM Hum.-Comput. Interact., 3(CSCW), nov 2019. doi: 10.1145/3359198. URL https://doi.org/10.1145/3359198. Adriana Gallardo. How we collected nearly 5,000 stories of maternal harm. ProPublica, 2018. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. 109 Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. Word embed- dings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644, 2018. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets. Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning, PMLR, 2018. Shelagh K. Genuis and Jenny Bronstein. Looking for “normal”: Sense making in the context of health disruption. Journal of the Association for Information Science and Technology, 68(3):750–761, 2017. Katherine J. Gold, Martha E. Boggs, Emeline Mugisha, and Christie Lancaster Palladino. Internet message boards for pregnancy loss: Who’s on-line and why? Women’s health issues : official publication of the Jacobs Institute of Women’s Health, 22 1:e67–72, 2012. Yoav Goldberg. Neural Network Methods for Natural Language Processing. Synthe- sis Lectures on Human Language Technologies. Morgan & Claypool Publish- ers, 2017. Scott A. Golder and Bernardo A. Huberman. Usage patterns of collaborative tagging systems. Journal of Information Science, 32(2):198–208, 2006. Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty, Ralph Weischedel, and Nanyun Peng. Content planning for neural story generation with Aristotelian rescoring. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4319–4338, Online, November 2020. As- 110 sociation for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main. 351. URL https://aclanthology.org/2020.emnlp-main.351. Connie Golsteijn and Serena Wright. Using narrative research and portraiture to inform design research. In IFIP Conference on Human-Computer Interaction, pages 298–315. Springer, 2013. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. Geoffrey P Goodwin, Jared Piazza, and Paul Rozin. Moral character predom- inates in person perception and evaluation. Journal of Personality and Social Psychology, 106(1):148, 2014. Andrew Gordon and Reid Swanson. Identifying personal stories in millions of weblog entries. In Third International Conference on Weblogs and Social Media, Data Challenge Workshop, San Jose, CA, volume 46, 2009. Jonathan Gordon and Benjamin Van Durme. Reporting bias and knowledge extraction. 2013. Amit Goyal, Ellen Riloff, and Hal Daumé III. Automatically producing plot unit representations for narrative text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 77–86, Cambridge, MA, October 2010. Association for Computational Linguistics. URL https: //www.aclweb.org/anthology/D10-1008. Anthony G. Greenwald, Debbie E. McGhee, and Jordan L.K. Schwartz. Measur- ing individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology, 74(6):1464, 1998. 111 Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie Huang. A knowledge-enhanced pretraining model for commonsense story generation. Transactions of the Association for Computational Linguistics, 8:93–108, 2020. doi: 10.1162/tacl a 00302. URL https://aclanthology.org/2020. tacl-1.7. William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word em- beddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1489–1501, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1141. URL https: //aclanthology.org/P16-1141. Alex Hanna, Emily Denton, Andrew Smart, and Jamila Smith-Loud. Towards a critical race methodology in algorithmic fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 501–512, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi: 10.1145/3351095.3372826. URL https://doi.org/ 10.1145/3351095.3372826. Allison Hegel. Social Reading in the Digital Age. University of California, Los Angeles, 2018. Ryan Heuser. Word vectors in the eighteenth-century. In IPAM Workshop: Cul- tural Analytics, 2016. Paul Heymann, Andreas Paepcke, and Hector Garcia-Molina. Tagging human knowledge. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, page 51–60, New York, NY, USA, 2010. 112 Association for Computing Machinery. ISBN 9781605588896. doi: 10.1145/ 1718487.1718495. URL https://doi.org/10.1145/1718487.1718495. Charles T. Hill and Donald E. Stull. Gender and self-disclosure. In Self- Disclosure, pages 81–100. Springer, 1987. Diane E. Hoffmann and Anita J. Tarzian. The girl who cried pain: A bias against women in the treatment of pain. The Journal of Law, Medicine & Ethics, 28 (4 suppl):13–27, 2001. Alexander Hoyle, Pranav Goel, Andrew Hian-Cheong, Denis Peskov, Jordan Boyd-Graber, and Philip Resnik. Is automated topic model evaluation bro- ken? The incoherence of coherence. Advances in Neural Information Processing Systems, 34, 2021. Alexander Miserlis Hoyle, Lawrence Wolf-Sonkin, Hanna Wallach, Isabelle Au- genstein, and Ryan Cotterell. Unsupervised discovery of gendered language through latent-variable modeling. In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguistics, pages 1706–1716, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/ P19-1167. URL https://www.aclweb.org/anthology/P19-1167. Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Pro- ceedings of the Tenth ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, KDD ’04, page 168–177, New York, NY, USA, 2004. As- sociation for Computing Machinery. ISBN 1581138881. doi: 10.1145/1014052. 1014073. URL https://doi.org/10.1145/1014052.1014073. Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber, and Hal Daumé III. Feuding families and former Friends: Unsupervised learning for 113 dynamic fictional relationships. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1534–1544, San Diego, California, June 2016. As- sociation for Computational Linguistics. doi: 10.18653/v1/N16-1180. URL https://www.aclweb.org/anthology/N16-1180. Kokil Jaidka, Sharath Chandra Guntuku, Anneke Buffone, H Andrew Schwartz, and Lyle H Ungar. Facebook vs. Twitter: Cross-platform differences in self- disclosure and trait prediction. In Proceedings of the Twelfth International AAAI Conference on Web and Social Media, pages 141–150, 2018. Bram Jans, Steven Bethard, Ivan Vulić, and Marie Francine Moens. Skip n-grams and ranking functions for predicting script events. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 336–344, Avignon, France, April 2012. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/E12-1034. A. Cecile J.W. Janssens and Peter Kraft. Research conducted using data ob- tained through online communities: ethical implications of methodological limitations. PLoS Medicine, 9(10):e1001328, 2012. Mukund Jha and Noémie Elhadad. Cancer stage prediction based on patient online discourse. In Proceedings of the 2010 Workshop on Biomedical Natu- ral Language Processing, pages 64–71, Uppsala, Sweden, July 2010. Associa- tion for Computational Linguistics. URL https://aclanthology.org/ W10-1908. Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, H. Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony 114 Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database. Scientific Data, 3:160035, 2016. Adam N. Joinson. Knowing me, knowing you: Reciprocal self-disclosure in internet-based surveys. CyberPsychology & Behavior, 4(5):587–591, 2001. Adam N. Joinson and Carina B. Paine. Self-disclosure, privacy and the internet. The Oxford handbook of Internet psychology, 2374252, 2007. Adam N. Joinson, Ulf-Dietrich Reips, Tom Buchanan, and Carina B. Paine Schofield. Privacy, trust, and self-disclosure online. Human–Computer Interac- tion, 25(1):1–24, 2010. Kenneth Joseph, Wei Wei, and Kathleen M Carley. Girls rule, boys drool: Ex- tracting semantic and affective stereotypes from twitter. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Com- puting, pages 1362–1374. ACM, 2017. Sidney M. Jourard. The transparent self. Van Nostrand Reinhold Company, 1971. Sidney M. Jourard and Paul Lasakow. Some factors in self-disclosure. The Jour- nal of Abnormal and Social Psychology, 56(1):91, 1958. Sudipta Kar, Suraj Maharjan, A. Pastor López-Monroy, and Thamar Solorio. MPST: A corpus of movie plot synopses with tags. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Associa- tion (ELRA). URL https://www.aclweb.org/anthology/L18-1274. Brett Kessler, Geoffrey Nunberg, and Hinrich Schutze. Automatic detection of text genre. In 35th Annual Meeting of the Association for Computational Lin- 115 guistics and 8th Conference of the European Chapter of the Association for Com- putational Linguistics, pages 32–38, Madrid, Spain, July 1997. Association for Computational Linguistics. doi: 10.3115/976909.979622. URL https: //www.aclweb.org/anthology/P97-1005. Os Keyes. Stop mapping names to gender. https://ironholds.org/ names-gender/, 2017. Accessed: 2021-05-26. Evgeny Kim, Sebastian Padó, and Roman Klinger. Investigating the relation- ship between literary genres and emotional plot development. In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 17–26, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/ W17-2203. URL https://www.aclweb.org/anthology/W17-2203. Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. Tempo- ral analysis of language through neural language models. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, pages 61–65, Baltimore, MD, USA, June 2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-2517. URL https://aclanthology. org/W14-2517. Markus Knoche, Radomir Popović, Florian Lemmerich, and Markus Strohmaier. Identifying biases in politically biased wikis through word em- beddings. In Proceedings of the 30th ACM Conference on Hypertext and Social Media, HT ’19, pages 253–257, New York, NY, USA, 2019. ACM. ISBN 978-1- 4503-6885-8. doi: 10.1145/3342220.3343658. URL http://doi.acm.org/ 10.1145/3342220.3343658. Christian Körner, Dominik Benz, Andreas Hotho, Markus Strohmaier, and Gerd 116 Stumme. Stop thinking, start tagging: Tag semantics emerge from collabora- tive verbosity. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, page 521–530, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781605587998. doi: 10.1145/1772690.1772744. URL https://doi.org/10.1145/1772690.1772744. Austin C. Kozlowski, Matt Taddy, and James A. Evans. The geometry of cul- ture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5):905–949, 2019. Sicong Kuang. Semantic and context-aware linguistic model for bias detection. 2016. Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. Freshman or fresher? Quan- tifying the geographic variation of language in online social media. In ICWSM, pages 615–618, 2016. Mei Chun Louisa Lam, Christine Urquhart, and Dervin L. Brenda. Sense- Making/Sensemaking. Oxford University Press, United Kingdom, June 2016. doi: 10.1093/obo/9780199756841-0112. Thomas K. Landauer. Statistical semantics-analysis of the potential performance of keyword information-systems, and a cure for an ancient problem. In Journal of Psycholinguistic Research, volume 13, pages 495–496, 1984. Thomas K. Landauer and Susan T. Dumais. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211, 1997. Brian Larson. Gender as a variable in natural-language processing: Ethical considerations. In Proceedings of the First ACL Workshop on Ethics in Nat- 117 ural Language Processing, pages 1–11, Valencia, Spain, April 2017. Associ- ation for Computational Linguistics. doi: 10.18653/v1/W17-1601. URL https://www.aclweb.org/anthology/W17-1601. Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021. Wendy G. Lehnert. Plot units and narrative summarization. Cognitive Science, 5: 293–331, 1981. Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems, 27, 2014. Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225, 2015. doi: 10.1162/tacl a 00134. URL https://aclanthology.org/Q15-1016. Yao Li, Yubo Kou, Je Seok Lee, and Alfred Kobsa. Tell me before you stream me: Managing information disclosure in video game live streaming. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):1–18, 2018. Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.447. URL https://aclanthology.org/ 2020.acl-main.447. 118 Li Lucy and David Bamman. Gender and representation bias in GPT-3 gen- erated stories. In Proceedings of the Third Workshop on Narrative Understand- ing, pages 48–55, Virtual, June 2021. Association for Computational Linguis- tics. doi: 10.18653/v1/2021.nuse-1.5. URL https://aclanthology.org/ 2021.nuse-1.5. Stephanie Lukin, Kevin Bowden, Casey Barackman, and Marilyn Walker. PersonaBank: A corpus of personal narratives and their story intention graphs. In Proceedings of the Tenth International Conference on Language Re- sources and Evaluation (LREC’16), pages 1026–1033, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA). URL https:// aclanthology.org/L16-1163. Ian Lundberg, Arvind Narayanan, Karen Levy, and Matthew J Salganik. Pri- vacy, ethics, and data access: A case study of the fragile families challenge. Socius, 5:2378023118813023, 2019. Haiwei Ma, C. Estelle Smith, Lu He, Saumik Narayanan, Robert A. Giaquinto, Roni Evans, Linda Hanson, and Svetlana Yarosh. Write for life: Persisting in online health communities through expressive writing and social support. Proc. ACM Hum.-Comput. Interact., 1(CSCW), dec 2017a. doi: 10.1145/3134708. URL https://doi.org/10.1145/3134708. Xiao Ma, Jeffery T Hancock, Kenneth Lim Mingjie, and Mor Naaman. Self- disclosure and perceived trustworthiness of airbnb host profiles. In Proceed- ings of the 2017 ACM conference on computer supported cooperative work and social computing, pages 2397–2409, 2017b. Suraj Maharjan, Manuel Montes, Fabio A. González, and Thamar Solorio. A genre-aware attention model to improve the likability prediction of books. 119 In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3381–3391, Brussels, Belgium, October-November 2018. As- sociation for Computational Linguistics. doi: 10.18653/v1/D18-1375. URL https://www.aclweb.org/anthology/D18-1375. Suman Kalyan Maity, Ayush Kumar, Ankan Mullick, Vishnu Choudhary, and Animesh Mukherjee. Understanding book popularity on Goodreads. In Pro- ceedings of the 2018 ACM Conference on Supporting Groupwork, GROUP ’18, page 117–121, New York, NY, USA, 2018. Association for Computing Ma- chinery. ISBN 9781450355629. doi: 10.1145/3148330.3154512. URL https: //doi.org/10.1145/3148330.3154512. Divine Maloney, Samaneh Zamanifard, and Guo Freeman. Anonymity vs. fa- miliarity: Self-disclosure and privacy in social virtual reality. In 26th ACM Symposium on Virtual Reality Software and Technology, pages 1–9, 2020. Diane Maloney-Krichmar and Jennifer Preece. The meaning of an online health community in the lives of its members: Roles, relationships and group dy- namics. In IEEE 2002 International Symposium on Technology and Society (IS- TAS’02). Social Implications of Information and Communication Technology. Pro- ceedings (Cat. No. 02CH37293), pages 20–27. Ieee, 2002. Lena Mamykina, Drashko Nakikj, and Noemie Elhadad. Collective sensemak- ing in online health forums. In Proceedings of the 33rd Annual ACM Con- ference on Human Factors in Computing Systems, CHI ’15, page 3217–3226, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450331456. doi: 10.1145/2702123.2702566. URL https://doi.org/ 10.1145/2702123.2702566. Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia Tsvetkov. Black is to 120 criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 615–621, Minneapolis, Min- nesota, June 2019. Association for Computational Linguistics. doi: 10.18653/ v1/N19-1062. URL https://www.aclweb.org/anthology/N19-1062. Lynne M. Markus. Toward a theory of knowledge reuse: Types of knowledge reuse situations and factors in reuse success. Journal of Management Information Systems, 18(1):57–93, 2001. Nina Martin, Emma Cillekens, and Alessandra Freitas. The last person you’d expect to die in childbirth. ProPublica, 2017a. Nina Martin, Emma Cillekens, and Alessandra Freitas. Lost mothers. ProPublica, 2017b. Erin L. Merz, Rina S. Fox, and Vanessa L. Malcarne. Expressive writing inter- ventions in cancer patients: A systematic review. Health Psychology Review, 8 (3):339–361, 2014. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013a. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, June 2013b. 121 Association for Computational Linguistics. URL https://aclanthology. org/N13-1090. David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. Optimizing semantic coherence in topic models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 262–272, Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics. URL https://aclanthology.org/D11-1024. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasser- man, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Ge- bru. Model cards for model reporting. In Proceedings of the Conference on Fair- ness, Accountability, and Transparency, FAT* ’19, page 220–229, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287596. URL https://doi.org/10.1145/3287560. 3287596. Barbara M. Montgomery. Verbal immediacy as a behavioral indicator of open communication content. Communication Quarterly, 30(1), 1981. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Pro- ceedings of the 2016 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 839– 849, San Diego, California, June 2016. Association for Computational Lin- guistics. doi: 10.18653/v1/N16-1098. URL https://www.aclweb.org/ anthology/N16-1098. Aaron Mueller, Zach Wood-Doughty, Silvio Amir, Mark Dredze, and Ali- 122 cia Lynn Nobles. Demographic representation and collective storytelling in the me too twitter hashtag activism movement. Proc. ACM Hum.-Comput. In- teract., 5(CSCW1), apr 2021. doi: 10.1145/3449181. URL https://doi.org/ 10.1145/3449181. Md National Commission for the Proptection of Human Subjects of Biomedi- caland Behavioral Research, Bethesda. The Belmont report: Ethical principles and guidelines for the protection of human subjects of research. Superintendent of Documents, 1978. National Endowment for the Arts. Reading at Risk: A Survey of Literary Read- ing in America, 2004. Dong Nguyen, Dolf Trieschnigg, A. Seza Doğruöz, Rilana Gravel, Mariët The- une, Theo Meder, and Franciska de Jong. Why gender and age prediction from tweets is hard: Lessons from a crowdsourcing experiment. In Pro- ceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1950–1961, Dublin, Ireland, August 2014. Dublin City University and Association for Computational Linguistics. URL https://www.aclweb.org/anthology/C14-1184. Pok-Ja Oh and Soo Hyun Kim. The effects of expressive writing interventions for patients with cancer: A meta-analysis. In Oncology Nursing Forum, vol- ume 43, 2016. Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kıcıman. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data, 2:13, 2019. Debra L Oswald, Eddie M Clark, and Cheryl M Kelly. Friendship maintenance: 123 An analysis of individual and dyad behaviors. Journal of Social and Clinical Psychology, 23(3):413–441, 2004. Jessica Ouyang and Kathy McKeown. Towards automatic detection of nar- rative structure. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), pages 4624–4631, Reyk- javik, Iceland, May 2014. European Languages Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2014/ pdf/1154_Paper.pdf. Galen Panger. Reassessing the facebook experiment: critical thinking about the validity of big data research. Information, Communication & Society, 19(8):1108– 1126, 2016. Dilisha Patel, Ann Blandford, Mark Warner, Jill Shawe, and Judith Stephen- son. ”I feel like only half a man”: Online forums as a resource for find- ing a ”new normal” for men experiencing fertility issues. Proc. ACM Hum.- Comput. Interact., 3(CSCW), nov 2019. doi: 10.1145/3359184. URL https: //doi.org/10.1145/3359184. Thomas Pavel. Literary Genres as Norms and Good Habits. New Literary His- tory, 34(2):201–210, 2003. ISSN 0028-6087. URL https://www.jstor.org/ stable/20057776. James W. Pennebaker. Writing about emotional experiences as a therapeutic process. Psychological Science, 8(3):162–166, 1997. James W. Pennebaker and Sandra K. Beall. Confronting a traumatic event: To- ward an understanding of inhibition and disease. Journal of Abnormal Psychol- ogy, 95(3):274, 1986. 124 James W. Pennebaker, Martha E. Francis, and Roger J. Booth. Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71(2001): 2001, 2001. Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162. Bethany Percha and Russ B. Altman. A global network of biomedical relation- ships derived from text. Bioinformatics, 34(15):2614–2624, 2018. Lawrence Phillips, Kyle Shaffer, Dustin Arendt, Nathan Hodas, and Svitlana Volkova. Intrinsic and extrinsic evaluation of spatiotemporal text represen- tations in Twitter streams. In Proceedings of the 2nd Workshop on Representa- tion Learning for NLP, pages 201–210, Vancouver, Canada, August 2017. As- sociation for Computational Linguistics. doi: 10.18653/v1/W17-2624. URL https://aclanthology.org/W17-2624. Karl Pichotta and Raymond J. Mooney. Learning statistical scripts with lstm re- current neural networks. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. Andrew Piper, Richard Jean So, and David Bamman. Narrative theory for computational narrative understanding. In Proceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Processing, pages 298–311, On- line and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.26. URL https://aclanthology.org/2021.emnlp-main.26. 125 Peter Pirolli. Rational analyses of information foraging on the web. Cognitive Science, 29(3):343–373, 2005. Peter Pirolli and Stuart Card. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceed- ings of International Conference on Intelligence Analysis, volume 5, pages 2–4. McLean, VA, USA, 2005. J. D. Porter. Popularity/Prestige. Stanford Literary Lab, Pamphlet 17, 2018. Vinodkumar Prabhakaran, William L. Hamilton, Dan McFarland, and Dan Ju- rafsky. Predicting the rise and fall of scientific topics from trends in their rhetorical framing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1170–1180, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1111. URL https://aclanthology.org/P16-1111. Rebecca M. Puhl and Chelsea A. Heuer. The stigma of obesity: a review and update. Obesity, 17(5):941–964, 2009. Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ana- niadou. Distributional semantics resources for biomedical text processing. Languages in Biology and Medicine, pages 39–44, 2013. Emilee Rader and Rick Wash. Influences on tag choices in del.icio.us. In Pro- ceedings of the 2008 ACM Conference on Computer Supported Cooperative Work, CSCW ’08, page 239–248, New York, NY, USA, 2008. Association for Com- puting Machinery. ISBN 9781605580074. doi: 10.1145/1460563.1460601. URL https://doi.org/10.1145/1460563.1460601. 126 Janice A. Radway. Reading the Romance: Women, Patriarchy, and Popular Literature. The University of North Carolina Press, Chapel Hill, November 1991. ISBN 978-0-8078-4349-9. Stephen Ramsay. Reading Machines: Toward and Algorithmic Criticism. University of Illinois Press, 2011. Abhilasha Ravichander and A. Black. An empirical study of self-disclosure in spoken dialogue systems. In SIGDIAL Conference, 2018. Andrew J. Reagan, Lewis Mitchell, Dilan Kiley, Christopher M. Danforth, and Peter Sheridan Dodds. The emotional arcs of stories are dominated by six basic shapes. EPJ Data Science, 5:1–12, 2016. Felix Reer and Nicole C. Krämer. Underlying factors of social capital acquisition in the context of online-gaming: Comparing world of warcraft and counter- strike. Computers in Human Behavior, 36:179–189, 2014. Jeremy Rosen. Literary Fiction and the Genres of Genre Fiction. Post45: Peer- Reviewed, August 2018. Rachel Rudinger, Chandler May, and Benjamin Van Durme. Social bias in elicited natural language inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 74–79, Valencia, Spain, April 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-1609. URL https://www.aclweb.org/anthology/W17-1609. Anna Rumshisky, Kirk Roberts, Steven Bethard, and Tristan Naumann, editors. Proceedings of the 3rd Clinical Natural Language Processing Workshop, Online, November 2020. Association for Computational Linguistics. URL https: //aclanthology.org/2020.clinicalnlp-1.0. 127 Diego Saez-Trumper, Carlos Castillo, and Mounia Lalmas. Social media news communities: gatekeeping, coverage, and statement bias. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pages 1679–1684. ACM, 2013. Maarten Sap, Marcella Cindy Prasettio, Ari Holtzman, Hannah Rashkin, and Yejin Choi. Connotation frames of power and agency in modern films. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2329–2334, Copenhagen, Denmark, September 2017. As- sociation for Computational Linguistics. doi: 10.18653/v1/D17-1247. URL https://aclanthology.org/D17-1247. Roger C. Schank and Robert P. Abelson. Scripts, plans, goals, and understand- ing: An inquiry into human knowledge structures. Psychology Press, 1977. Alexandra Schofield and David Mimno. Comparing apples to apple: The ef- fects of stemmers on topic models. Transactions of the Association for Compu- tational Linguistics, 4:287–300, 2016. doi: 10.1162/tacl a 00099. URL https: //aclanthology.org/Q16-1021. Alexandra Schofield, Mans Magnusson, and David Mimno. Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Lin- guistics: Volume 2, Short Papers, pages 432–436, Valencia, Spain, April 2017a. Association for Computational Linguistics. URL https://aclanthology. org/E17-2069. Alexandra Schofield, Laure Thompson, and David Mimno. Quantifying the effects of text duplication on semantic models. In Proceedings of the 2017 Con- ference on Empirical Methods in Natural Language Processing, pages 2737–2747, 128 Copenhagen, Denmark, September 2017b. Association for Computational Linguistics. doi: 10.18653/v1/D17-1290. URL https://aclanthology. org/D17-1290. João Sedoc and Lyle Ungar. The role of protected class word lists in bias iden- tification of contextualized word representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 55–61, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/ v1/W19-3808. URL https://www.aclweb.org/anthology/W19-3808. Maya Sen and Omar Wasow. Race as a bundle of sticks: Designs that estimate effects of seemingly immutable characteristics. Annual Review of Political Sci- ence, 19, 2016. Shilad Sen, Shyong K. Lam, Al Mamunur Rashid, Dan Cosley, Dan Frankowski, Jeremy Osterhouse, F. Maxwell Harper, and John Riedl. Tagging, communi- ties, vocabulary, evolution. In Proceedings of the 2006 20th Anniversary Con- ference on Computer Supported Cooperative Work, CSCW ’06, page 181–190, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595932496. doi: 10.1145/1180875.1180904. URL https://doi.org/10. 1145/1180875.1180904. Noam Shazeer, Ryan Doherty, Colin Evans, and Chris Waterson. Swivel: Im- proving Embeddings by Noticing What’s Missing. arXiv:1602.02215, 2016. URL http://arxiv.org/abs/1602.02215. Gene Smith. Tagging: people-powered metadata for the social web. New Riders, 2007. Jessie J. Smith, Saleema Amershi, Solon Barocas, Hanna Wallach, and Jen- nifer Wortman Vaughan. REAL ML: Recognizing, exploring, and articulating 129 limitations of machine learning research. In Proceedings of the 2022 ACM Con- ference on Fairness, Accountability, and Transparency, FAccT ’22, New York, NY, USA, 2022. Association for Computing Machinery. Efstathios Stamatatos, Nikos Fakotakis, and George Kokkinakis. Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4): 471–495, 2000. URL https://www.aclweb.org/anthology/J00-4001. Donna E. Stewart and Simone Vigod. Postpartum depression. New England Journal of Medicine, 375(22):2177–2186, December 2016. Roger Suss. The hero with a thousand faces visits the doctor. Canadian Family Physician (Medecin de famille canadien), 60 7:656, 2014. Chris Sweeney and Maryam Najafian. A transparent framework for evaluat- ing unintended demographic bias in word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1662–1667, Florence, Italy, July 2019. Association for Computational Lin- guistics. doi: 10.18653/v1/P19-1162. URL https://www.aclweb.org/ anthology/P19-1162. Chenhao Tan, Dallas Card, and Noah A. Smith. Friendships, rivalries, and trysts: Characterizing relations between ideas in texts. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 773–783, Vancouver, Canada, July 2017. Asso- ciation for Computational Linguistics. doi: 10.18653/v1/P17-1072. URL https://aclanthology.org/P17-1072. Timothy R. Tangherlini. Heroes and lies: Storytelling tactics among paramedics. Folklore, 111(1):43–66, 2000. 130 Jaime Teevan, Christine Alvarado, Mark S. Ackerman, and David R. Karger. The perfect search engine is not enough: A study of orienteering behavior in di- rected search. In Proceedings of the SIGCHI Conference on Human Factors in Com- puting Systems, CHI ’04, page 415–422, New York, NY, USA, 2004. Association for Computing Machinery. ISBN 1581137028. doi: 10.1145/985692.985745. URL https://doi.org/10.1145/985692.985745. Laure Thompson and David Mimno. Authorless topic models: Biasing models away from known structure. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3903–3914, Santa Fe, New Mexico, USA, August 2018a. Association for Computational Linguistics. URL https:// aclanthology.org/C18-1329. Laure Thompson and David Mimno. Authorless topic models: Biasing models away from known structure. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3903–3914, Santa Fe, New Mexico, USA, August 2018b. Association for Computational Linguistics. URL https:// www.aclweb.org/anthology/C18-1329. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384– 394, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL https://aclanthology.org/P10-1040. William E. Underwood. The Life Cycles of Genres. Journal of Cultural Ana- lytics, page 11061, May 2016. URL https://culturalanalytics.org/ article/11061-the-life-cycles-of-genres. 131 Thomas Vander Wal. Folksonomy definition and Wikipedia, 2005. URL http: //www.vanderwal.net/random/entrysel.php?blog=1750. Thomas Vander Wal. Folksonomy, 2007. URL https://www.vanderwal. net/essays/051130/folksonomy.pdf. Effy Vayena, Marcel Salathé, Lawrence C. Madoff, and John S. Brownstein. Eth- ical challenges of big data in public health. Public Library of Science, 2015. Ivan Vulić and Marie-Francine Moens. Bilingual word embeddings from non- parallel document-aligned data applied to bilingual lexicon induction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Lin- guistics and the 7th International Joint Conference on Natural Language Process- ing (Volume 2: Short Papers), pages 719–725, Beijing, China, July 2015. As- sociation for Computational Linguistics. doi: 10.3115/v1/P15-2118. URL https://aclanthology.org/P15-2118. Hanna Wallach, David Mimno, and Andrew McCallum. Rethinking LDA: Why priors matter. Advances in Neural Information Processing Systems, 22, 2009. Melanie Walsh and Maria Antoniak. The Goodreads ‘classics’: A computational study of readers, Amazon, and crowdsourced amateur criticism. Journal of Cultural Analytics, 4:243–287, 2021. Mengting Wan and Julian J. McAuley. Item recommendation on monotonic be- havior chains. In Sole Pera, Michael D. Ekstrand, Xavier Amatriain, and John O’Donovan, editors, Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, pages 86–94. ACM, 2018. doi: 10.1145/3240323.3240369. URL https://doi.org/10. 1145/3240323.3240369. 132 Mengting Wan, Rishabh Misra, Ndapa Nakashole, and Julian McAuley. Fine- grained spoiler detection from large-scale review corpora. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2605–2610, Florence, Italy, July 2019. Association for Computational Lin- guistics. doi: 10.18653/v1/P19-1248. URL https://www.aclweb.org/ anthology/P19-1248. Jianling Wang, Ziwei Zhu, and James Caverlee. User recommendation in con- tent curation platforms. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 627–635, 2020a. Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, Rodney Kinney, Ziyang Liu, William Mer- rill, et al. Cord-19: The covid-19 open research dataset. ArXiv, 2020b. Xi Wang, Kang Zhao, and Nick Street. Analyzing and predicting user partici- pations in online health communities: A social support perspective. Journal of medical Internet research, 19(4):e6834, 2017. Yi-Chia Wang, Moira Burke, and Robert Kraut. Modeling self-disclosure in so- cial networking sites. In Proceedings of the 19th ACM Conference on Computer- Supported Cooperative Work amp; Social Computing, CSCW ’16, page 74–85, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450335928. doi: 10.1145/2818048.2820010. URL https://doi.org/ 10.1145/2818048.2820010. Rick Wash and Emilee Rader. Public bookmarks and private benefits: An anal- ysis of incentives in social computing. Proceedings of the American Society for Information Science and Technology, 44(1):1–13, 2007. 133 Jonathan Weber. Folksonomy and controlled vocabulary in librarything. Un- published Final Project, University of Pittsburgh, pages 5–6, 2006. Karl E. Weick, Kathleen M. Sutcliffe, and David Obstfeld. Organizing and the process of sensemaking. Organization Science, 16(4):409–421, 2005. Miaomiao Wen and Carolyn Penstein Rosé. Understanding participant behav- ior trajectories in online health support groups using automatic extraction methods. In Proceedings of the 17th ACM International Conference on Supporting Group Work, pages 179–188, 2012. Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. Factors in- fluencing the surprising instability of word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2092– 2102, New Orleans, Louisiana, June 2018. Association for Computational Lin- guistics. doi: 10.18653/v1/N18-1190. URL https://www.aclweb.org/ anthology/N18-1190. Matthew Wilkens. Genre, computation, and the varieties of twentieth-century U.S. fiction. Journal of Cultural Analytics, 11 2016. John E. Williams and Susan M. Bennett. The definition of sex stereotypes via the adjective check list. Sex Roles, 1(4), 1975. John E. Williams and Deborah L. Best. Sex stereotypes and trait favorability on the adjective check list. Educational and Psychological Measurement, 1977. John E Williams and Deborah L Best. Measuring sex stereotypes: A multination study, Rev. Sage Publications, Inc, 1990. 134 Joseph Worsham and Jugal Kalita. Genre identification and the compositional effect of genre in literature. In Proceedings of the 27th International Confer- ence on Computational Linguistics, pages 1963–1973, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https: //www.aclweb.org/anthology/C18-1167. Diyi Yang, Robert E. Kraut, Tenbroeck Smith, Elijah Mayfield, and Dan Jurafsky. Seekers, providers, welcomers, and storytellers: Modeling social roles in on- line health communities. In CHI, CHI ’19, pages 344:1–344:14, New York, NY, USA, 2019a. ACM. ISBN 978-1-4503-5970-2. doi: 10.1145/3290605.3300574. URL http://doi.acm.org/10.1145/3290605.3300574. Diyi Yang, Zheng Yao, Joseph Seering, and Robert Kraut. The channel mat- ters: Self-disclosure, reciprocity and social support in online cancer support groups. In CHI, CHI ’19, pages 31:1–31:15, New York, NY, USA, 2019b. ACM. ISBN 978-1-4503-5970-2. doi: 10.1145/3290605.3300261. URL http: //doi.acm.org/10.1145/3290605.3300261. Alyson L. Young and Andrew D. Miller. ”This girl is on fire”: Sensemaking in an online health community for vulvodynia. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–13, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450359702. doi: 10.1145/3290605.3300359. URL https://doi.org/ 10.1145/3290605.3300359. Mohammadzaman Zamani, H. Andrew Schwartz, Johannes Eichstaedt, Sharath Chandra Guntuku, Adithya Virinchipuram Ganesan, Sean Clous- ton, and Salvatore Giorgi. Understanding weekly COVID-19 concerns through dynamic content-specific LDA topic modeling. In Proceedings of 135 the Fourth Workshop on Natural Language Processing and Computational So- cial Science, pages 193–198, Online, November 2020. Association for Com- putational Linguistics. doi: 10.18653/v1/2020.nlpcss-1.21. URL https: //aclanthology.org/2020.nlpcss-1.21. Zachariah Zhang, Jingshu Liu, and Narges Razavian. BERT-XML: Large scale automated ICD coding using BERT pretraining. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 24–34, On- line, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.clinicalnlp-1.3. URL https://aclanthology.org/ 2020.clinicalnlp-1.3. Chen Zhao, Pamela Hinds, and Ge Gao. How and to whom people share: The role of culture in self-disclosure in online communities. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work, CSCW ’12, page 67–76, New York, NY, USA, 2012. Association for Computing Ma- chinery. ISBN 9781450310864. doi: 10.1145/2145204.2145219. URL https: //doi.org/10.1145/2145204.2145219. Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. Learn- ing gender-neutral word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4847–4853, Brus- sels, Belgium, October-November 2018. Association for Computational Lin- guistics. doi: 10.18653/v1/D18-1521. URL https://www.aclweb.org/ anthology/D18-1521. Arkaitz Zubiaga, Christian Körner, and Markus Strohmaier. Tags vs shelves: From social tagging to social classification. In Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia, HT ’11, page 93–102, New York, NY, 136 USA, 2011. Association for Computing Machinery. ISBN 9781450302562. doi: 10.1145/1995966.1995981. URL https://doi.org/10.1145/1995966. 1995981. 137