MODELING PERSONAL EXPERIENCES SHARED
IN ONLINE COMMUNITIES
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Maria Alexandra Antoniak
August 2022
© 2022 Maria Alexandra Antoniak
ALL RIGHTS RESERVED
MODELING PERSONAL EXPERIENCES SHARED IN ONLINE
COMMUNITIES
Maria Alexandra Antoniak, Ph.D.
Cornell University 2022
Written communications about personal experiences, such as giving birth or
reading a book, can be both rhetorically powerful and statistically difficult to
model. My research explores unsupervised natural language processing (NLP)
models to represent complex personal experiences and self-disclosures commu-
nicated in online communities, while also re-examining these models for biases
and instabilities. I seek to reliably represent individual experiences within their
social contexts and model interpretive dimensions that illuminate both patterns
and outliers, while addressing social and humanistic questions. Through this
work, I develop a data science practice that emphasizes cross-disciplinary col-
laborations and care for datasets and their authors. In this dissertation, I’ll share
case studies that highlight both the opportunities and the risks in reusing NLP
models for context-specific research questions.
BIOGRAPHICAL SKETCH
Maria Antoniak was born and raised in Richland, Washington. Prior to her
Ph.D., Maria received her B.A. from the University of Notre Dame in the Pro-
gram of Liberal Studies, with a minor in the Glynn Family Honors Program.
Maria spent one year teaching English at the Ukrainian Catholic University in
Lviv, Ukraine, and then completed her M.S. in Computational Linguistics at the
University of Washington, where she collaborated with and was supported in
her research by the Pacific Northwest National Laboratory. Maria worked as
a data scientist for the startup Maana for one year before beginning her Ph.D.
in Information Science at Cornell University, where she was advised by David
Mimno. During her time as Ph.D. student, she completed research internships
at Microsoft Research, Facebook Core Data Science, and Twitter Cortex. In
Fall 2022, Maria will join the Allen Institute for Artificial Intelligence in Seat-
tle, Washington, as a Young Investigator.
iii
ACKNOWLEDGEMENTS
Thank you to my advisor, David Mimno. I have been incredibly fortunate to
work with you, and I am not only a better scientist but a better person for having
followed in your footsteps for these six years. At every twist and turn in this
journey, you supported me without hesitation, and I will always be grateful for
your wisdom, generosity, expertise, and kindness.
Thank you to my committee: Lillian Lee, Jeff Rzeszotarski, and Richard Jean
So. You inspired me, you gave me room to explore, and you championed me at
key moments when I needed your support. Your advice, feedback, and atten-
tion were an honor to receive.
Thank you to my collaborators, including Karen Levy, Melanie Walsh,
LeAnn McDowall, A. Feder Cooper, and Robert Griffin. In particular, thank you
to Karen Levy for honest conversations and thoughtful teaching that showed
me new perspectives; your brilliance has been a guiding light. Thank you to
Melanie Walsh for your creativity and rigor as a coauthor and for your steady
friendship.
Thank you to my labmates: Moontae Lee, Jack Hessel, Alexandra Schofield,
Laure Thompson, Gregory Yauney, Rosamond Thalken, Katherine Lee, and Fed-
erica Bologna. I have learned so much from each one of you and was lucky to
work, read, and discuss (and enjoy brunch) together. Thank you especially to
Alexandra Schofield and Jack Hessel for your mentorship and friendship and
for always blazing the trail ahead of me. Thank you also to my extended lab-
mates, with whom I shared an office space and long conversations: Justine
Zhang, Liye Fu, and Jonathan Chang.
Thank you to my many friends made at Cornell: Lauren Kilgour, Sharifa
Sultana, Jen Liu, Briana Vecchione, Emily Tseng, Bradi Heaberlin, Malcolm Bare,
iv
Natalie Tong, Anthony Poon, and many others who journeyed with me through
dark winters, late nights, and moments of joy and celebration. I am sure we
have many more adventures ahead of us. Thank you also to friends and col-
leagues in my “extended cohort” across different institutions, whom I met at
conferences, at internships, and on social media, who grew up academically
with me and supported me from afar.
Thank you to my allies, teachers, and friends in Graduate Women in Science
(GWiS) and Graduate Students for Gender Inclusion in Computing (GSGIC).
Working with you pushed me to grow in new directions. In particular, thank
you to the founding volunteers of GSGIC — Alexa VanHattum, Marianne Aubin
Le Quéré, Tegan Wilson, Kate Donahue, Varsha Kishore, Claire Liang, Gregory
Yauney, Andrea Cuadra, Griffin Berlstein, Sharifa Sultana, Adelaide Fuller, and
Sachi Angle — who worked tirelessly throughout the pandemic to support their
community.
Thank you to mentors in the past who took a chance on me early in my
career. Thank you to Jane Oliensis, Emily Bender, Gina-Anne Levow, Fei Xia,
Courtney Corley, Eric Bell, Jason Mackay, and many other academic and indus-
try mentors who shared their time and advice with me. Thank you in particular
to Women in Machine Learning (WiML), whose activities opened my eyes to
research opportunities at Cornell University.
Thank you to Cornell faculty members who provided mentorship and guid-
ance, including Matthew Wilkens, Marten van Schijndel, Steve Jackson, Yoav
Artzi, Mor Naaman, Emma Pierson, and many others. Thank you also to the
Cornell NLP group for six years of discussions, debates, and learning.
Thank you to staff in Cornell Information Science, especially Barbara Woske,
Janeen Orr, Eileen Grabosky, Penny Stewart, and Lou DiPietro, who helped me
v
many times, both academically and in my efforts to start community initiatives.
Finally, thank you endlessly to my family, including my father, Dr. Zenen
Antoniak, and my mother, Sherry Dempsey Antoniak, my brilliant brothers and
sister, and my many aunts, uncles, and cousins who supported me on this am-
bitious path.
vi
TABLE OF CONTENTS
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1 Introduction 1
2 Background 7
2.1 Measurement Tools for Unlabeled, Socially-Specific Data . . . . . 7
2.1.1 Topic Modeling for Cultural Analysis . . . . . . . . . . . . 8
2.1.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Self-Disclosures in Online Communities . . . . . . . . . . . . . . . 15
2.3 Personal Narratives . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Sensemaking through Self-Disclosure and Storytelling . . . . . . . 19
3 Data Practices 22
3.1 Upstream and Downstream Research . . . . . . . . . . . . . . . . 22
3.2 Handling Data with Care . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Examples of Data Handling Strategies . . . . . . . . . . . . 26
4 Instability of Unsupervised, Distributional Models 29
4.1 Case Study: Word Embedding Instabilities . . . . . . . . . . . . . 30
4.1.1 Models and Methods . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Results Across Models and Settings . . . . . . . . . . . . . 32
4.1.3 Discussion and Impact . . . . . . . . . . . . . . . . . . . . . 37
4.2 Case Study: Comparative Measurements of Bias . . . . . . . . . . 39
4.2.1 Framework of Seed Sources . . . . . . . . . . . . . . . . . . 43
4.2.2 Bias Measurement Methods . . . . . . . . . . . . . . . . . . 46
4.2.3 Seed Features Impact Bias Measurements . . . . . . . . . . 47
4.2.4 Discussion and Impact: Biases All the Way Down . . . . . 52
5 Personal Healthcare Experiences 55
5.1 Healthcare Datasets for Natural Language Processing . . . . . . . 56
5.2 Case Study: Online Childbirth Narratives . . . . . . . . . . . . . . 58
5.2.1 Data Curation . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 Narrative Analysis . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Framing of Power . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.4 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . 67
5.2.5 Discussion and Impact . . . . . . . . . . . . . . . . . . . . . 69
vii
6 Personal Reading Experiences 72
6.1 Literary Reception and Online Book Reviews . . . . . . . . . . . . 72
6.2 Collaborative Tagging Systems . . . . . . . . . . . . . . . . . . . . 74
6.3 Literary Genres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Case Study: Mapping Literary Genres on LibraryThing . . . . . . 78
6.4.1 Data from LibraryThing . . . . . . . . . . . . . . . . . . . . 78
6.4.2 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . 81
6.4.3 Mapping Methods . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.5 Discussion and Impact . . . . . . . . . . . . . . . . . . . . . 96
7 Conclusion 98
viii
LIST OF TABLES
4.1 The three settings that manipulate the document order and pres-
ence in each corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 The most similar words with their means and standard devia-
tions for the cosine similarities between the query word mari-
juana and its 10 nearest neighbors (highest mean cosine similar-
ity in the FIXED setting. Embeddings are learned from docu-
ments segmented by sentence. . . . . . . . . . . . . . . . . . . . . 33
4.3 The 10 closest words to the query term pregnancy are highly vari-
able. None of the words shown appear in every run. Results are
shown across runs of the BOOTSTRAP setting for the full corpus
of the 9th Circuit, the whole document size, and the SGNS model. 35
4.4 Examples of real seed terms used in recent work to measure bi-
ases in corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Overview of the surveyed seed sources. . . . . . . . . . . . . . . . 43
4.6 When two seed sets are more semantically distinct they are more
distinguishable in the resulting geometric subspace. The top ta-
ble shows pairs of artificially generated seed sets, ranked by their
coherence for WEAT in the NYT dataset. The bottom table shows
pairs of seed sets gathered from published papers, ranked by
their coherence for WEAT in the WikiText dataset. Scores are av-
eraged across 20 bootstrapped samples of the training data, and
values are rounded; no coherence scores are exactly 1.0. Higher
coherence scores indicate that the seeds pairs were projected far-
ther apart in the bias subspace. . . . . . . . . . . . . . . . . . . . . 51
5.1 The bigrams drawn from the post titles associated with the most
and least probable stories. Probabilities represent the means of
the summed log probabilities of the last ten topic transitions in
a story. Lower scores indicate stories with more unusual topic
transitions (sequences of events). Results are averaged (mean)
across bootstrapped samples of the stories. . . . . . . . . . . . . . 64
5.2 Personas identified in the birth stories collection and the n-grams
used to classify the personas. . . . . . . . . . . . . . . . . . . . . . 65
6.1 Examples classifications and surprisal scores. Excerpts are se-
lected from the last 100 words of the reviews. Higher surprisal
indicates greater confidence in the incorrect label. . . . . . . . . . 93
ix
LIST OF FIGURES
4.1 The mean standard deviations across settings and algorithms for
the 10 closest words to the query words in the 9th Circuit and
NYT Music corpora using the whole documents. Larger varia-
tions indicate less stable embeddings. . . . . . . . . . . . . . . . . 34
4.2 The mean Jaccard similarities across settings and algorithms for
the top 2 and 10 closest words to the query words in the AskHis-
torians corpus. Larger Jaccard similarity indicates more consis-
tency in top N membership. Results are shown for the sentence
document length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Bias measurements depend on seeds. We calculate the cosine
similarities between different seed sets representing women and
an averaged upleasantness vector from two embedding models.
Results are consistent across seeds for romance review embed-
dings, but vary widely between sets for history and biography.
We find similar variation even for a pretrained Google News
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Ranking word vectors by cosine similarity with the top princi-
ple component vector for the original gender seed pairs (a) ap-
pears to identify words representing men and women much bet-
ter than random (b). But shuffling the pairing of seed words (c)
maintains correlation with gender but to a less clear degree. Re-
sults are shown for the NYT corpus with a frequency threshold
of 100 and bootstrap resampling. . . . . . . . . . . . . . . . . . . . 50
4.5 Identifying bias is less effective when set pairs are similar. Gen-
erated seeds are frequency-controlled nouns from the WikiText
dataset. We highlight two sets of gathered seeds; both target
similar racial categories but the name-based sets are more simi-
lar and explain less variance. We find similar trends for WEAT,
coherence, and the other corpora and POS. . . . . . . . . . . . . . 52
5.1 A selection of topics over time. Plots are labeled with the five
highest probability words for each topic. Results show the prob-
ability for each topic at 10% intervals of story time, averaged
across all stories. Error bars show standard deviation across
bootstrapped samples of stories. . . . . . . . . . . . . . . . . . . . 61
5.2 Histograms showing the frequencies of persona mentions over
story time. Some entities (e.g., author) are consistently more fre-
quent than rare entities (e.g., doula). Some frequency patterns
are expected while others are surprising (e.g., frequency of we
decreases near the middle of the stories). . . . . . . . . . . . . . . 61
x
5.3 Flowchart of the most probable topic transitions (above 0.2%).
We removed one orphan node without a parent path leading to
the beginning of story (BOS) state. . . . . . . . . . . . . . . . . . . 63
5.4 Most frequent verbs from the power lexicon associated with each
persona in the birth stories corpus. Green indicates a positive
power contribution, while pink indicates a negative power con-
tribution. The cell values indicate the proportion of persona
mentions with the given verb and power relationship. . . . . . . 66
5.5 (a) Power scores for each persona. Error bars show standard de-
viation over 20 bootstrap samples of the collection. (b) Estimated
power of personas (rows) over other personas (columns). The
NURSE is consistently framed as more powerful than the other
personas, except for the DOULA. . . . . . . . . . . . . . . . . . . . 67
6.1 A mapping of LibraryThing genres. User overlap between genre
pairs correlates with book overlap, but there are outliers. Each
point represents two genres, and the axes represent the rank of
the genre pair, where lower numbers indicate higher ranks and
therefore higher overlap. For example, the genre pair classics
+ animals has a mid-range user overlap rank and a high book
overlap rank, indicating that these genres share surprisingly few
users given how many books are shared. Pearson correlation be-
tween book and user overlap is significant (r = 0.68, p ¡ 0.05). . . 89
6.2 The number of overlapping books and the number of genre mis-
classifications of user reviews for each pair of genres. Each point
represents a pair of genres in which one is the true tag applied to
the review text and one is the predicted tag from our model. As
expected, we find a significant relationship using Pearson corre-
lation (r = 0.65, p < 0.05) between the book overlap and misclas-
sification count, but we highlight outlier genre pairs, e.g., animals
and psychology have an unusually high misclassification count
given their very low book overlap. . . . . . . . . . . . . . . . . . . 91
6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.5 Are tighter communities easier to predict? Are tighter commu-
nities more critical? Figure 6.3 shows the target genres plotted
along surprisal (the ability of a classifier to predict the genre of
a review) and community homogeneity (averaged cosine simi-
larities between reviewers’ tagsets). Figure 6.4 shows the target
genres plotted along rating and community homogeneity. Gen-
res whose reviewers have more similar reading habits tend to
also have higher ratings according to a Pearson correlation test
(r = -0.60, p ¡ 0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
xi
CHAPTER 1
INTRODUCTION
Sharing personal stories and experiences can be transformative, persuasive,
and powerful, for both the narrators and their audiences. Social movements like
#MeToo amass large numbers of personal stories to build solidarity and commu-
nity and to make sense of collective experiences. The founder of #MeToo, Tarana
Burke, wrote that “We call [this idea] ‘empowerment through empathy’ ... to not
only show the world how widespread and pervasive sexual violence is, but also
to let other survivors know they are not alone” [Burke, October 15, 2017a,O].
Sharing personal stories can have physical, emotional, and social health bene-
fits Pennebaker and Beall [1986], Pennebaker [1997], Merz et al. [2014], Oh and
Kim [2016] and can help communities make sense of difficult experiences to-
gether [Tangherlini, 2000, Mamykina et al., 2015].
Personal stories are also used by journalists and politicians as a rhetorical
device to connect with their audiences and sway public opinion. For example,
as part of a larger ProPublica project focusing on the maternal mortality crisis
in the U.S., an article focusing on the story of a single neonatal intensive care
nurse received wide attention and hundreds of comments [Martin et al., 2017a].
Another article in the series, which gathered stories of 16 women, received an
award from the National Association of Black Journalists [Martin et al., 2017b].
Publishing these stories led to an outpouring of 5,000 more personal stories
[Gallardo, 2018] and awoke a public consciousness of the crisis that previously
published statistics had not.
Online, people are eager to read and share their personal experiences with
products, books, healthcare, and other topics via reviews, blogs, and social me-
1
dia posts. Online healthcare support communities abound in personal stories
and questions, even when these are often repetitions of other posts. These
communities provide opportunities for people to share their stories, emotions,
and reflections with a sympathetic audience while making sense of these ex-
periences as a group [Patel et al., 2019, Genuis and Bronstein, 2017, Young
and Miller, 2019]. Sometimes these communities allow people to self-disclose
[Jourard and Lasakow, 1958, Jourard, 1971], or share information about them-
selves, that they cannot share in other settings [Yang et al., 2019b].
Computational analysis of these stories can help researchers learn from large
amounts of online disclosures and narratives. For example, Gallagher et al.
[2019] used text classification and network analysis methods to find that dis-
closing can encourage others to share more details when disclosing, while Yang
et al. [2019b] used machine learning techniques to determine that online health
support community members make more negative disclosures in private set-
tings. Unsupervised distributional models for text, like topic modeling [Blei
et al., 2003] or word embeddings [Mikolov et al., 2013a], can reveal shared struc-
ture and biases across large datasets without high up-front costs.
The voices in these communities deserve to be heard, and precisely because
of their importance, they deserve to be heard reliably. Written communica-
tions about personal experiences can be statistically difficult to model, and there
might not be a single correct interpretation of these communications. Analyz-
ing the sentiment or framing disclosed in personal stories is challenging, and
extracting structure from narratives is a complex task involving many subtasks,
including entity recognition and linking, event detection and next-event predic-
tion, sentiment analysis, etc. Methods addressing theses challenges often rely on
2
labeled datasets which are costly to create and do not necessarily transfer well
across specific subdomains. Datasets of personal experiences are often socially
specific, with their own context and communication norms, and these datasets
can be much smaller than the typical machine learning training dataset. These
are fragile samples prone to noise and misreading — not generalizable training
datasets or complete populations — and they need to be treated with care, both
ethically and statistically.
In this thesis, I explore these themes of personal stories, reliability, and online
communities through a series of case studies. Methodologically, these studies
rely on unsupervised distributional models from natural language processing
(NLP), representing complex personal experiences and self-disclosures via lex-
ical patterns. These studies focus on stories people tell about themselves and
the sharing of those stories in online communities, and they also re-examine
computational models for biases and instabilities. The goal of these studies is
to reliably represent individual experiences within their social contexts, while
modeling interpretive dimensions that illuminate both patterns and outliers. For
personal stories in particular, it is not sufficient to discuss this data only in terms
of summary statistics but also by zooming in on individual stories, e.g., for close
reading, a method from literary criticism that focuses sustained attention on a
brief selection of a written work.
While NLP researchers often focus on the downstream applications of NLP
models, cultural researchers often focus on what might be called upstream ap-
plications — that is, on what the training data can reveal about situated cultural
questions. My work has considered both of these perspectives and translated
between these approaches. How can we use NLP methods reliably and ethically
3
to identify patterns while not losing sight of outliers and uncertainty? Within
the field of NLP, I have critically re-examined machine learning measurement
methods. I probed models designed for traditional NLP tasks involving large,
generic datasets by exploring their results on small, socially-specific datasets
that are common in social science and humanities research.
My work has shown that popular NLP methods ported to new domains can
result in surprising instabilities and biases [Antoniak and Mimno, 2018, 2021].
For example, training a word embedding model on a target corpus and com-
paring cosine distances between word vectors can lead to unpredictable results,
unless researchers measure the stability of these distances, e.g., by bootstrap-
ping the training corpus. Similarly, bias measurements that rely on lexicons
of seed terms can produce significantly different results depending on various
features of the seed sets.
Within computational social science and the digital humanities, I use these
same NLP methods for the study of structured personal experiences shared in
online communities [Antoniak et al., 2019, 2021, Walsh and Antoniak, 2021]. I
identified two main sites for this research: online communities grounded in cul-
tural experiences (books, games) and online communities grounded in health-
care experiences (childbirth, contraception, pain management). These com-
munities situate personal opinions and stories in social contexts of reception,
expectation, and judgment, and the shared grounding in each community al-
lows clearer illumination of patterns and outliers. My research has highlighted
how community members use shared experiences to reframe and redefine es-
tablished narratives and hierarchies, revealing lessons for healthcare systems,
readership and reception, and careful research practices.
4
In work focused on personal healthcare experiences, for example, I have
shown how postpartum people in a specific online community share stories
about their birth experiences [Antoniak et al., 2019]. This work shed light on
the framing of outlier event sequences and the perceived power hierarchy of
medical professionals, family and support people, and the narrator themselves.
By taking advantage of the biological and medical structure of these stories, we
were able to use simple unsupervised models and lexical methods to extract
narrative pathways and measure power relationships.
In work focused on online reading communities, my work has highlighted
the literary significance of new research resources like Goodreads and Library-
Thing [Antoniak et al., 2021, Walsh and Antoniak, 2021]. Before these communi-
ties existed, scholars interested in literary reception had to rely mostly on pro-
fessional reviews, publishing records, and sales data to try to reconstruct public
perceptions of books. My work examined the the collaborative tagging sys-
tem that exists on both of these websites, where users can categorize books into
genres, subjects, or any categorical system the user wishes, with downstream
consequences for physical libraries and bookstores. These user-based tagging
systems capture popular perceptions and can remap the literary landscape, but
they can also entrench long-standing social hierarchies and modern algorithmic
design choices.
In this thesis, I will discuss a selection of case studies covering the topics
discussed above. First, I will highlight the instabilities in two distributional
methods that are commonly used for research applications in cultural analytics,
where the reliability of fine-grained measurements are of utmost importance.
These methods include ranking of word vectors by cosine similarities after train-
5
ing an embedding model; and bias measurement methods, which rely on sets
of seed terms. Then I will move on to two applied case studies, where I employ
similar distributional methods to study particular online communities. In the
first case study, I will discuss my work on an online birth stories community,
and in the second case study, I will discuss my work on online reading commu-
nities. Finally, I will conclude with open questions for future work.
6
CHAPTER 2
BACKGROUND
2.1 Measurement Tools for Unlabeled, Socially-Specific Data
Unsupervised methods allow researchers to explore their datasets without high
up-front costs. In particular, distributional semantic models — which rely on com-
parisons of word co-occurrences — can provide fast ways for researchers to
explore themes, biases, and other patterns in text data. These models include
word embeddings and topic modeling, two of the central methods discussed in
this thesis.
However, these methods also carry some risks, particularly in settings with
small and constrained datasets. In these settings, it becomes even more im-
portant to test the reliability and significance of results, as the presence or ab-
sence of individual documents can more significantly alter distributional pat-
terns learned from the dataset. Without proper evaluation, the lack of ground-
truth can result in spurious “story-telling” in which the researcher retroactively
fits their theory to the model.
These constraints are common in both the humanities and healthcare set-
tings discussed later in this thesis. In healthcare, privacy restrictions make it
very difficult to create shared datasets, while in the digital humanities, copy-
right restrictions are supplemented by the natural boundaries of the object of
study. We cannot augment an author’s oeuvre and are constrained to the data
that has survived through curation. Existing datasets are often small and labels
are limited; it might be expensive or impossible to create reliable gold labels
7
(e.g., there is no single correct summarization of a narrative). Even for tasks
where large, labeled datasets do exist, pretrained models have limited ability to
generalize to small, domain-focused datasets.
In this section, I provide an overview of these methods and situate them
within cultural analytics and NLP research. I begin by describing two com-
putational methods, topic modeling and word embeddings, which are popular
technique in both fields.
2.1.1 Topic Modeling for Cultural Analysis
Topic modeling is an unsupervised NLP method that automatically discovers
topics or themes in a set of texts and can be used as an exploratory technique
leading a researcher to different views of a dataset [Baumer et al., 2017]. La-
tent Dirichlet allocation (LDA) is a popular topic model [Blei et al., 2003] that
has been used in a wide range of applications in cultural analytics as various as
measuring the relationship between ideas over time [Prabhakaran et al., 2016,
Tan et al., 2017], exploring the types of disclosures shared during the #MeToo
social movement [Mueller et al., 2021], examining gender stereotypes in auto-
matically produced stories [Lucy and Bamman, 2021], and tracking changes in
societal concerns and views related to COVID-19 [Zamani et al., 2020].
LDA is a generative probabilistic model where each topic in a set of K topics
is represented by a probability distribution over the full vocabulary and each
document is represented by a probability distribution over the topics. The re-
searcher can select K as well as set the Dirichlet priors informing the sparsity
of topics for each document and the vocabulary for each topic. This model
8
makes a series of simplifying assumptions, including a bag-of-words assump-
tion (word order does not matter), a sparsity assumption (each document is best
represented by a few topics), and a mixture assumption (each document is rep-
resented by a mixture of topics).
Producing coherent topics often requires careful preprocessing of the dataset
and consideration of the dataset features and training parameters. I detail some
of the key steps in this process below.
Training algorithm, document size, and number of topics. Producing co-
herent topics requires careful consideration of the training dataset. For small
datasets, more coherent results can be found using Gibbs sampling rather than
other training procedures. The training document size can also affect results,
with smaller documents (e.g., tweets) often producing less coherent results than
longer documents (e.g., paragraphs, posts). Selecting the number of topics K re-
quires experimentation with each new dataset; different numbers of topics can
reveal different types of topics (e.g., a smaller number can produce higher level
themes), and there is no one “correct” number of topics, as this depends on the
researcher’s goals.
Stemming and stop word removal. Stemming (removing all but the morpho-
logical root from each token) before training a topic model is mostly redundant
and can be harmful [Schofield and Mimno, 2016]. Removing stop words can
be done pre- or post-training; the benefit is mostly an aesthetic improvement
to the lists of most probable words that are commonly used to interpret the
topics after training [Schofield et al., 2017a]. Most importantly, large numbers
of duplicates can affect the resulting model. While sometimes the duplicates
9
will be sequestered into a single topic, which can be ignored, at other times the
duplicates can overwhelm the model and should be removed before training
[Schofield et al., 2017b].
Shifting away from known metadata. Sometimes, LDA will produce topics
that redundantly mirror known categories that might already be encoded in a
corpus’s metadata. For example, a topic model trained on a dataset of novels
might learn topics that correspond to each of the most popular authors. This
output is correct — the model is successfully identifying meaningful patterns
in the dataset — but it is probably not useful to a researcher who already has
author data for each of the novels. To overcome this challenge and bias the
model away from known metadata, Thompson and Mimno [2018a] provide a
tool that probabilistically subsamples words that appear more frequently with
the undesired categories before training. Preprocessing the training data in this
way can produce topics that are more useful for exploration and interpretation.
Dirichlet priors. Two Dirichlet priors control the sparsity of the topic-word
and document-topic distributions. [Wallach et al., 2009] found that tuning the
document-topic prior improves the resulting topics, while tuning the topic-
word prior is not helpful.
Topic model evaluation. Topic model evaluation is difficult, as there is no sin-
gle best set of topics for a dataset. Automatic metrics like coherence [Mimno
et al., 2011] can provide a general rule-of-thumb but do not always correlate
well with human judgments [Hoyle et al., 2021]. Instead, it is best to use man-
ual evaluation, such as “intruder” tests, in which a human annotator tries to
10
find the random intruder word among a set of true highest probability words
for a specific topic (or an intruder document among the set of documents with
highest probability for this topic) [Hoyle et al., 2021]. These tests measure the
success of the model in producing themes that are legible to the researcher.
2.1.2 Word Embeddings
Word embeddings models map words in a vocabulary to low-dimensional vec-
tors. These word vectors can then be used as input to classifiers or compared to
one another using different distance or similarity metrics, like cosine distance.
These metrics can then be used as a proxy for semantic similarity — or rather,
distributional similarity — of the words in this training corpus. In NLP, word em-
beddings are often used as features for downstream tasks. Dependency parsing
[Chen and Manning, 2014], named entity recognition [Turian et al., 2010, Cherry
and Guo, 2015], and bilingual lexicon induction [Vulić and Moens, 2015] are just
a few examples where the use of embeddings as features has increased perfor-
mance in recent years.
Word embeddings are often used as evidence in studies of language and
culture. For example, Hamilton et al. [2016] train separate embeddings on tem-
poral segments of a corpus and then analyze changes in the similarity of words
to measure semantic shifts, and Heuser [2016] uses embeddings to characterize
discourse about virtues in 18th Century English text. Other studies use cosine
similarities between embeddings to measure the variation of language across
geographical areas [Kulkarni et al., 2016, Phillips et al., 2017] and time [Kim
et al., 2014]. Researchers have used these similarity scores to measure the biases
11
encoded in the embedding model and to make claims about biases in the train-
ing dataset and its authors [Caliskan et al., 2017]. Each of these studies seeks to
reconstruct the mental model of authors based on documents.
Word embedding models are often described as measuring the semantic sim-
ilarity of words. In truth, they measure the distributional similarity of words,
which may or may not correlate with semantic similarity. This discrepancy
becomes more apparent when considering various weakness of these models,
where semantic similarity tests would fail.
As with other distributional methods, word embedding models are prone
to errors where words are related in context but not in meaning. In this vector
space, antonyms will often appear close together, as antonyms are usually very
similar to each other in all ways except one and are used in similar contexts
[Cruse, 1986]. Word sets (like days of the week or months of the year) will also
often appear as close to one another as synonyms, as they are usually used in
near identical contexts. Strange errors can also emerge through reporting bias:
when the frequency of a descriptive phrase does not match the real-world fre-
quency of the things signified by the phrase [Gordon and Van Durme, 2013]. For
example, according to a word embedding model, the vector representing sheep
might be much closer by cosine distance to the word black rather than white, be-
cause sheep are assumed to be white and their color is usually noted for sheep
where this is not the case. Word embedding models can also encode human
biases, including harmful stereotypes [Bolukbasi et al., 2016a, Caliskan et al.,
2017] (partly driven by the reporting biases described above).
Many different word embedding models and training algorithms have been
proposed. In static models, each word is represented by a single vector, regard-
12
less of the word’s context and how many meanings the word might have. For
example, a word like bar will be represented by the same vector in every situ-
ation, whether the word is referring to a bar of soap or taking the bar exam. To
address this issue, word vectors can instead be extracted from contextualized
models like BERT [Devlin et al., 2019] which represent each word’s usage in a
specific context.
In this dissertation, I focus on static embedding models as these are more
commonly used in cultural analytics research. I highlight three of the most pop-
ular static embedding models below.
LSA. Latent semantic analysis (LSA), an early form of a word embedding
model, factorizes a sparse term-document matrix X Deerwester et al. [1990],
Landauer and Dumais [1997]. X is factored using singular value decomposition
(SVD), retaining K singular values such that
X ≈ XK = U Σ VTK K K .
The elements of the term-document matrix are weighted, often with TF-IDF,
which measures the importance of a word to a document in a corpus. The dense,
low-rank approximation of the term-document matrix, XK , can be used to mea-
sure the relatedness of terms by calculating the cosine similarity of the relevant
rows of the reduced matrix.
Word2Vec. A popular model, word2vec, is trained via the skip-gram with neg-
ative sampling (SGNS) algorithm Mikolov et al. [2013b], an online algorithm
13
that uses randomized updates to predict words based on their context. In each
iteration, the algorithm proceeds through the original documents and, at each
word token, updates model parameters based on gradients calculated from the
current model parameters. This process maximizes the likelihood of observed
word-context pairs and minimizes the likelihood of negative samples.
Word2vec can also be approximated through a method similar to LSA. The
positive pointwise mutual information (PPMI) matrix, whose cells represent the
PPMI of each pair of words and contexts, where PMI is defined as
P(w, c)
PMI(w, c) = log ;
P(w)P(c)
PPMI(w, c) = max(PMI(w, c), 0).
The PMI matrix is factored using singular value decomposition (SVD) and
results in low-dimensional embeddings that perform similarly to GloVe and
SGNS Levy and Goldberg [2014].
GloVe. Finally, Global Vectors for Word Representation (GloVe) uses stochas-
tic gradient updates but operates on a “global” representation of word co-
occurrence that is calculated once at the beginning of the algorithm Pennington
et al. [2014]. Words and contexts are associated with bias parameters, bw and bc,
where w is a word and c is a context, learned by minimizing the cost function:
∑
L = f (xwc)w⃗ · c⃗ + bw + bc − log(xwc).
w,c
14
2.2 Self-Disclosures in Online Communities
Self-disclosure is the “process of making the self known to others” [Jourard and
Lasakow, 1958]. This can include the sharing of personal opinions, beliefs and
values, sentiment and emotion, personal stories, and aspects of one’s iden-
tity such as gender, age, or nationality [Montgomery, 1981, Wang et al., 2016,
Ravichander and Black, 2018, Altman and Taylor, 1973, Bak et al., 2014, Barak
and Gluck-Ofri, 2007, Balani and De Choudhury, 2015].
Self-disclosure and trust are tightly intertwined. On one hand, trust is an
important prerequisite for self-disclosure, but on the other hand, trust can also
be fostered through self-disclosure [Joinson and Paine, 2007]. According to
social penetration theory, different types of self-disclosures are typical of dif-
ferent relationship stages; as social bonds grow and become more intimate,
more self-disclosures are made [Altman and Taylor, 1973]. Self-disclosures can
strengthen social bonds and foster relationship formation [Oswald et al., 2004,
Ma et al., 2017b], while group-level disclosures can enhance the trust between
group members and their sense of group identity [Galegher et al., 1998, Yang
et al., 2019b, Ma et al., 2017b, Joinson et al., 2010].
However, the decision to self-disclose is contextual. For example, Yang
et al. [2019b] find that health-related self-disclosure decisions depend on the
privacy of a conversation, while prior work has found gender differences in
self-disclosure differ by context [Hill and Stull, 1987] and that men and those
with wanting to manage impressions self-disclose less [Wang et al., 2016]. Oth-
ers have also found self-disclosure to be affected by culture and community
norms [Zhao et al., 2012, Li et al., 2018]. In online gaming, in-group behav-
15
iors like frequency of playing together have been found to positively relate to
increased self-disclosure [Reer and Krämer, 2014], while in virtual reality (VR)
games, context, levels of anonymity, and user control were found to be impor-
tant factors in deciding to self-disclose to other players [Maloney et al., 2020].
Many different automatic measurements of self-disclosure have been pro-
posed. Capitalizing on linguistic patterns associated with self-disclosure, prior
work has experimented with emotion lexicons [Bak et al., 2012, Ravichander
and Black, 2018], topic modeling [Bak et al., 2012], and training supervised mod-
els on coded conversational or social media datasets [Bak et al., 2012, Wang
et al., 2016, Balani and De Choudhury, 2015]. While prior work has also at-
tempted to build labeled training sets for self-disclosure detection [Balani and
De Choudhury, 2015], the context-dependent nature of many self-disclosures
often complicates their use.
While pronoun rates alone will not capture specific types of self-disclosure
(e.g., sharing emotions vs. opinions), the rate of first person singular pronouns
has been found to be a reliable proxy for self-disclosure [Joinson, 2001, Barak
and Gluck-Ofri, 2007]. Jaidka et al. [2018] found that first person pronoun use
reflects mentions of self, while Ravichander and Black [2018] that first person
singular pronouns help identifying self-disclosures with higher precision (and
low recall). Making explicit statement about ones identity (e.g., gender, race,
relationships) have previously been measured using lexicons of demographic
terms and pattern-matching.
16
2.3 Personal Narratives
Among the many forms of self-disclosure is telling stories about oneself. Nar-
ratives are powerful modes of expression, with physical, emotional, and social
benefits for both the narrator and the audience [Pennebaker and Beall, 1986,
Pennebaker, 1997, Merz et al., 2014, Oh and Kim, 2016, Tangherlini, 2000]. Nar-
ratives can also be useful methods for understanding human behavior and be-
liefs [Golsteijn and Wright, 2013]. Personal narratives can be rhetorically per-
suasive and difficult to moderate, and like other types of self-disclosure, they
can facilitate community trust and sensemaking. However, narratives are a
challenging test case for NLP tools, as their modeling needs to represent com-
plex and extended interactions between people, objects, environments, affects,
and beliefs. Book-length texts pose challenges for a field that often considers
dependencies longer than one paragraph “long range.”
In a recent overview of NLP approaches to narrative understanding, Piper
et al. [2021] formulate narrativity as a scalar construct rather than a binary class;
texts can include some or all narrative features (e.g., narrator, audience, sequen-
tial actions). Most NLP narrative tasks focus on building abstractions from nar-
ratives by extracting these features and measuring relationships between them.
These tasks include extracting narrative structure, like scripts, plot units, or nar-
rative arcs [Lehnert, 1981, Schank and Abelson, 1977, Chambers and Jurafsky,
2008, Goyal et al., 2010, Chambers and Jurafsky, 2009, Pichotta and Mooney,
2016, Reagan et al., 2016, Flor and Somasundaran, 2017]; discovering connec-
tions between characters [Bamman et al., 2013, Iyyer et al., 2016, Lukin et al.,
2016]; generating new stories or summaries [Goldfarb-Tarrant et al., 2020, Guan
et al., 2020]; and identifying a correct story ending [Mostafazadeh et al., 2016].
17
As in other areas of NLP, much narrative research tends to fall into shared
tasks, where artificial story datasets are used to test a particular technical abil-
ity of a system. In NLP, these datasets are sometimes created and often la-
beled by crowdworkers. For example, story completion is the goal of the Story
Cloze Task, in which the model is given all but the ending of a story and asked
to generate or select the appropriate ending [Chambers and Jurafsky, 2008,
Mostafazadeh et al., 2016]. These datasets describe brief scenarios intended to
isolate a phenomena from more complicated social contexts.
Outside of shared tasks, narrative research in NLP also includes corpus-
based studies, where researchers use narrative models to learn about a partic-
ular dataset and its authors. Underlying corpus-based studies are often curated
datasets that range widely and can include fictional works (e.g., novels, fairy-
tales) [Jans et al., 2012, Iyyer et al., 2016], news stories [Chambers and Juraf-
sky, 2008], biographies [Bamman and Smith, 2014], and personal stories shared
orally or on social media [Gordon and Swanson, 2009, Ouyang and McKeown,
2014, Antoniak et al., 2019]. These curated datasets were authored in social con-
texts separate from the NLP research study, and they are gathered passively
(e.g., scraped without the knowledge or consent of the authors); this oppor-
tunistic use case constrains the possible research questions.
As with other NLP tasks, many modern narrative methods rely on large,
pretrained models. These models in turn rely on massive and undocumented
datasets, often scrapes like the aptly-named Pile [Gao et al., 2020], that contain
a mixture of documents from unrelated domains, in the hopes that the quantity
and diversity of data will generalize to a variety of other domains and tasks
during the fine-tuning process. These pretraining datasets are too large to create
18
full datasheet descriptions [Gebru et al., 2018]; even quantifying the number of
duplicate documents in these datasets is challenging [Lee et al., 2021], and these
datasets can encode human biases [Bender et al., 2021].
While many technical challenges lie ahead in modeling narratives, NLP also
faces challenges in dataset design and curation and moving between averaged
patterns from models and individual examples and experiences. The differ-
ent types of datasets noted above each bring their own challenges. Collected
datasets are often used out-of-context; this constrains the possible research
questions. Massive and undocumented datasets can contain unmeasured hu-
man biases and stereotypes as well as unmeasured domain biases (e.g., if more
of the pretraining data is from a particular domain). Short and artificial datasets
lack the social grounding of collected datasets — these datasets were not con-
structed out of personal motivations — and do not necessarily generalize to
longer, more complex stories. In all of these cases, biases from the datasets can
also influence NLP models and cause undesired outcomes.
2.4 Sensemaking through Self-Disclosure and Storytelling
How do people use self-disclosures and storytelling to make sense of their ex-
periences as a community? While many definitions of sensemaking have been
posited, it is often described as “processes through which people interpret and
give meaning to their experiences” [Lam et al., 2016]. It necessarily involves
communication and is an “activity that talks events and organizations into ex-
istence,” in part through retrospective storytelling [Weick et al., 2005]. As a
process that creates organization [Weick et al., 2005] and relies on collaborative
19
problem solving [Pirolli and Card, 2005], the analysis of sensemaking within
online communities falls within the study of computer-supportive cooperative
work (CSCW), which examines how people use technical tools and spaces to
collaborate.
In online healthcare support communities, the individual works to make
sense of their healthcare experience, and the community collectively gathers in-
formation, compares stories, and makes sense of a shared experience [Mamyk-
ina et al., 2015]. Prior work has described how individual users can join a com-
munity and move through a transformative process; the collection and organi-
zation of information at the center of this process is also how the community
builds its own sense of the experience [Genuis and Bronstein, 2017, Patel et al.,
2019]. Not only do members of cancer support groups discernibly transition
between stages of treatment [Jha and Elhadad, 2010, Wen and Rosé, 2012] but
they can also be observed transitioning between community roles [Maloney-
Krichmar and Preece, 2002, Yang et al., 2019a]. Users usually begin by seeking
information and eventually transition to providing emotional support [Wang
et al., 2017, Yang et al., 2019a]. However, these processes are contextual, and dif-
ferent communities might rely on different processes more than others [Young
and Miller, 2019].
Storytelling and expressive writing about traumatic experiences can have
physical, emotional, and social health benefits Pennebaker and Beall [1986],
Pennebaker [1997], Merz et al. [2014], Oh and Kim [2016] and have been studied
for their role in therapy and the establishment of social norms. In one exam-
ple, Arigo and Smyth [2012] find that expressive writing about eating patterns
and personal appearance by college-age women either improves a range of out-
20
comes or reduces the risk of further decline. de Moor et al. [2008], on the other
hand, find no benefits in breast cancer survivors. Tangherlini [2000] argues that
the storytelling of paramedics forms the culture and organizational structure of
their jobs. Through these stories, the paramedics both work through trauma of
difficult experiences and negotiate their place in a hierarchy of workers (doctors,
nurses, police officers, other paramedics). Likewise, patients tell stories about
their experiences that cast medical professionals into sets of roles, which physi-
cians need to understand to interact with the patient effectively [Suss, 2014].
In both the humanities and healthcare, individuals and communities try to
make sense of very complicated processes. Storytelling can help narrators as-
similate traumatizing experiences [Tangherlini, 2000] and can transfer impor-
tant information to others without firsthand experience [Bietti et al., 2019]. On
the one hand are bureaucratic healthcare systems and the physical and psycho-
logical costs and traumas of experiencing health issues. In these medical set-
tings, sensemaking is “distributed across the healthcare system” and decisions
about individual cases are processes of coordination and information distribu-
tion [Weick et al., 2005]. On the other hand are novels, poems, games, and other
cultural objects, each of which might require community sensemaking even as
each is also a piece of a larger sensemaking puzzle, which allows its audience
to more easily manipulate the ideas contained in the object and draw connec-
tions between them. The digital humanities itself can be viewed as “interpretive
endeavours... which seek not to constrain meaning, but to guarantee its multi-
plicity” and to reveal “the hidden details of pattern formation” [Ramsay, 2011].
21
CHAPTER 3
DATA PRACTICES
3.1 Upstream and Downstream Research
Work in NLP tends to fall into two large categories of research goals: downstream
and upstream research.
In applied machine learning and industry settings, the downstream setting is
standard. In this setting, some set of authors write down a set of texts, which
may or may not encode their internal biases. These texts are curated and gath-
ered into a corpus, which might be missing documents or altered by other au-
thors or curators. This dataset is then used by NLP researchers train a model
for some downstream task, e.g., predicting which advertisement to display to a
user based on the text contained in the user’s emails. Since the focus is on the
downstream task, the training corpus is only of interest insofar as it generalizes
to the downstream task.
The downstream setting has become particularly prominent in the modern
pre-training and fine-tuning workflow, where a model is first trained on a very
large dataset and then fine-tuned on whichever specific dataset is of interest
downstream. The pre-training dataset is usually composed of a smorgasbord
of internet and book data, with Wikipedia, Reddit, and data from other on-
line communities mashed together into a single, super-sized corpus [Gao et al.,
2020, Bandy and Vincent, 2021]. This approach, using pre-trained models with
the transformer architecture, yields state-of-the-art results for the majority of
traditional NLP tasks like parsing, named entity recognition, and classification
22
[Devlin et al., 2019]. However, the size of these datasets can obstruct attempts to
document them, and researchers are left with unmeasured risks of memoriza-
tion, encoding biases into their models, and other possible harms.
The upstream, corpus-focused approach is distinct from these traditional
NLP settings. When we move from this downstream focus to a scenario in
which the training corpus itself becomes the object of study — when researchers
want to learn about the authors of a specific corpus — traditional NLP practices
are “flipped.” This use case might require re-analysis of methods that were de-
signed for downstream, rather than upstream, settings. In particular, the exten-
sion of traditional machine learning techniques to small, topic-specific corpora
requires new understandings of the limitations of these methods. The corpus
is at once both the central object of study and a small and unreliable sample whose
weaknesses must be evaluated.
The upstream setting is common for researchers working in the digital hu-
manities, computational social science, and cultural analytics. These disciplines
have strong traditions of curation, but these critical methods have not always
been applied to the datasets underlying many NLP tools and methods. For ex-
ample, researchers increasingly use embeddings [Garg et al., 2018, Knoche et al.,
2019, Hoyle et al., 2019] and other lexicon-based methods [Saez-Trumper et al.,
2013, Fast et al., 2016, Rudinger et al., 2017] to provide quantitative answers
to otherwise elusive political and social questions about the biases in a corpus
and its authors. This work often involves comparing bias measurements across
different corpora, which requires reliable, fine-grained measurements.
An example that highlights the contrast between the downstream-centered
and corpus-centered perspectives is the exploration of implicit bias in word em-
23
beddings. Researchers have observed that embedding-based word similarities
reflect cultural stereotypes, such as associations between occupations and gen-
ders Bolukbasi et al. [2016a]. From a downstream-centered perspective, these
stereotypical associations represent bias that should be filtered out before using
the embeddings as features. In contrast, from a corpus-centered perspective,
implicit bias in embeddings is not a problem that must be fixed but rather a
means of measurement, providing quantitative evidence of bias in the training
corpus.
Models trained on a dataset may appear to measure properties of language,
they in fact only measure properties of the training corpus, which could suffer
from several problems. Sources could be missing or over-represented, typos
and other lexical variations could be present, and, as noted by Goodfellow et al.
[2016], “Many datasets are most naturally arranged in a way where successive
examples are highly correlated.” Smaller corpora can result in more instability
in measurements (as I show in Chapter 4), but small corpora are common in
cultural analytics. Often, it is impossible to supplement the corpus (e.g., we
cannot create more 18th Century English books or change their topical focus).
In summary, the training corpus is merely a sample of the authors’ language
model Shazeer et al. [2016]; it is a fragile artifact, whose study should take into
account the instability of measurements deriving from its documents.
Access to healthcare data is limited by both privacy and proprietary con-
straints. Records of doctor-patient interactions in the form of medical notes
are particularly difficult to access. Medical language models usually rely on
PubMed, the Unified Medical Language System (UMLS) [Bodenreider, 2004],
and similar resources composed of biomedical publications [Pyysalo et al., 2013,
24
Chiu et al., 2016]. Models trained on these resources better represent academic
language than language used by patients, but the more informal conversations
between doctors and patients are also not easily modeled by standard conver-
sational tools because of the specialized vocabulary and formalized structure of
these conversations. Research is then constrained to whatever limited resource
is available, often without control for demographic or other variables if the data
comes from online sources. Likewise in the humanities, corpora are often fixed,
small, and specialized. Literary vocabulary, sometimes from different eras, does
not necessarily map to popular models trained on contemporary news and in-
ternet data. Online datasets, like book reviews, are not always accessible nor is
it often possible to control for demographics. In both cases, while we would like
to ask cultural questions about the authors of the data, we also have to recognize
the data is an incomplete and fragile sample.
3.2 Handling Data with Care
Applications of machine learning to the humanities and healthcare reveal dif-
ferent ends of the ethics and privacy spectrum. On the healthcare side, the huge
risk of patient harm requires that we prioritize privacy, carefully removing all
identifying information and not sharing data when anonymization is impossi-
ble. Medicine and related fields with direct user studies have a long history
of participant abuse, which has led to formalized frameworks to protect par-
ticipants, such as the framework presented in the Belmont Report [National
Commission for the Proptection of Human Subjects of Biomedicaland Behav-
ioral Research, 1978]. On the humanities side, the same concerns can lead to
very different ethical practices. Often, the people being studied are content cre-
25
ators, whether as novelists, poets, writers of serialized fan fiction, or online book
reviewers. In these cases, privacy concerns are still valid — for example, au-
thors of fan fiction might reveal intimate details about themselves — but it is
often just as important to consider the artistic rights of the authors, respecting
their contributions by giving them credit for their creations [Bruckman, 2002].
These tensions between protecting the authors and giving credit to the au-
thors of a dataset lead to a series of choices for any cultural analytics project.
These decisions are contextual, and researchers can weigh competing concerns
by using frameworks like the Belmont Report, with its three guiding principles
(respect for persons, beneficence, and justice) [National Commission for the Prop-
tection of Human Subjects of Biomedicaland Behavioral Research, 1978]. There
are also guided activities like Real ML [Smith et al., 2022] that can aid researchers
in making these decisions for a particular dataset and use case.
3.2.1 Examples of Data Handling Strategies
The following suggestions are not meant to be a comprehensive guide to data
handling in cultural analytics researcher but rather examples of how the Bel-
mont Report principles can influence specific data handling decisions. These
particular examples are drawn from decisions made during the work described
in Chapters 4 and 5 in this thesis. For an extended case study in this kind
of decision-making process, I recommend the descriptions in Lundberg et al.
[2019].
26
Model and dataset documentation. Sharing a complete dataset of the col-
lected data would allow other researchers to directly reproduce and evaluate
research results as well as train their own computational models. Reproduction
is particularly important for cultural analytics studies as the self-selection of the
authors can contribute to biases in the dataset [Janssens and Kraft, 2012]. The
lack of documentation for large models and dataset in NLP has led to a series of
frameworks to encourage better documentation practices, including Datasheets
[Gebru et al., 2018] and Model Cards [Mitchell et al., 2019].
However, copying and storing data removes it from the control of its au-
thors, and this can undermine a study’s respect for persons. One possible method
to support both reproducibility and privacy is to release the URLs of the data
points, rather than the content (as in the Twitter API terms, which allow shar-
ing of Tweet IDs but not the Tweet content). This maintains the user’s ability to
delete the content at any time and remove it from the dataset, while also allow-
ing researchers to directly compare their results. However, in cases of possible
severe harms, even releasing the URLs might not be desireable.
Publishing quotations. In NLP and cultural analytics research, it is common
to provide quotations from the target dataset, to describe the dataset or as exam-
ples of a particular phenomenon. Quotations are useful for readers and partic-
ularly important when describing results of unsupervised methods, where the
researchers use qualitative evaluations to make modeling decisions (e.g., exam-
ining the documents with highest topic probability to determine a topic model’s
coherence). However, quotations can carry significant risks to the dataset’s au-
thors if re-identified [Dym and Fiesler, 2020]. Both personal preferences (some
authors prefer to be named as authors, while others prefer anonymity) and con-
27
textual concerns for a specific project and dataset need to be considered before
providing quotations.
One tactic that allows for data analysis without compromising the privacy of
the authors is to only share paraphrased examples from the dataset [Bruckman,
2002, Yang et al., 2019b]. This both conceals private and/or identifying informa-
tion and prevents the audience from searching for exact string matches in order
to identify the source story.
Another option is to contact each author whom the researchers would like
to quote and ask each author for their quotation preferences (e.g., no quotation,
quotation with attribution, quotation without attribution). Contacting the au-
thors before publications honors the authors’ wishes with regard to privacy and
also grants them agency in how their creative work is presented in published
work. This option respects the contributions of the authors as “amateur artists”
who should be given credit for their creative work when desired [Bruckman,
2002].
Sharing findings with the community. To support the Belmont Report’s prin-
ciple of justice, researchers should consider sharing their findings directly with
the studied community. Studies of Twitter users [Fiesler and Proferes, 2018]
and online fandom participants [Dym and Fiesler, 2020] have found that users
have varying levels of comfort with researchers using their data and are often
unaware of how and why their data is being used for research. Research that
ostensibly is conducted to support a specific community should communicate
those benefits to the community, e.g., through public facing blog posts and out-
reach to the community.
28
CHAPTER 4
INSTABILITY OF UNSUPERVISED, DISTRIBUTIONAL MODELS
Unsupervised methods allow researchers to answer questions about their
datasets without high up-front costs. In cultural analytics, datasets are often
constrained in size, and labeling can be expensive and difficult, particularly for
humanities and healthcare data. This brittleness requires additional reliability
tests to avoid “story-telling” where researchers retroactively fit theories to mod-
els. In this upstream approach, the small, topic-specific corpus is at once both
the central object of study and a small and unreliable sample whose weaknesses
must be evaluated, particularly when using measurement tools designed for in-
dustry questions and large datasets, not cultural research questions and small
datasets.
In the two studies presented in this chapter, I re-examine unsupervised, dis-
tributional models which have been ported from the NLP community to other
fields. Ranked lists of words based on vector similarity are frequently used as
evidence in digital humanities and computational social science, but I find that
embedding-based word similarities are unreliable, especially on small datasets
[Antoniak and Mimno, 2018]. Properties of the training corpus, such as the pres-
ence or absence of specific documents can alter both similarities and rankings,
and training on small and idiosyncratic corpora significantly exacerbates these
problems. Considering embedding-based bias measurement methods, I find
that the lexicons used for bias measurement can themselves encode linguistic
and social biases [Antoniak and Mimno, 2021]. Bias measurements methods,
originally developed to examine models used for downstream NLP tasks, are
increasingly being used by scholars in the humanities and social sciences to ex-
29
plore the biases of the training dataset’s authors. By probing sets of artificial
seeds, I demonstrate that the order and similarity of these terms can signifi-
cantly affect results, and through a survey of prior work and critical reading of
gathered seed terms, I present a framework of common seed sources and po-
tential biases, finding that popularly reused seeds often contain the very stereo-
types they seek to measure.
4.1 Case Study: Word Embedding Instabilities
NLP research in word embeddings has so far focused on a downstream-centered
use case, where the end goal is not the embeddings themselves but perfor-
mance on a more complicated task. For example, word embeddings are of-
ten used as the bottom layer in neural network architectures for NLP [Bengio
et al., 2003, Goldberg, 2017]. In contrast, other researchers take a corpus-centered
approach and use relationships between embeddings as direct evidence about
the language and culture of the authors of a training corpus [Bolukbasi et al.,
2016a, Hamilton et al., 2016, Heuser, 2016]. Unlike the downstream-centered
approach, the corpus-centered approach is based on direct human analysis of
nearest neighbors to embedding vectors, and the training corpus is not simply
an off-the-shelf convenience but rather the central object of study. Disentangling
these separate use cases is vital in determining the proper use and evaluation of
word embedding models.
Variables such as the stochastic nature of the training algorithm, random ini-
tialization of the word vectors, sub-sampling of frequent tokens, and properties
of the corpus itself could cause these cosine similarities to be unreliable. We hy-
pothesized that training on small and potentially idiosyncratic corpora (which
30
are common in computational social science) can exacerbate these problems and
lead to highly variable estimates of word similarity.
4.1.1 Models and Methods
To test this hypothesis, we trained sets of 50 word embedding models for each
of four training algorithms, three datasets, two document sizes (sentence and
full document), and three corpus settings. Algorithms included latent semantic
analysis (LSA) [Deerwester et al., 1990, Landauer and Dumais, 1997], global
vectors (GloVe) [Pennington et al., 2014], skip-gram with negative sampling
(SGNS) [Mikolov et al., 2013b], and positive pointwise mutual information
(PPMI) [Levy and Goldberg, 2014]). We curated three datasets from The New
York Times Music and Sports sections, Reddit AskScience and AskHistorians, and
the 4th and 9th U.S. Federal Courts of Appeals. Most importantly, we manipu-
lated the corpus in three ways: FIXED (same documents in the constant order),
SHUFFLED (same documents in random order), and BOOTSTRAP (documents
sampled with replacement).
Setting Method Tests variability due to... Run 1 Run 2 Run 3
Fixed Documents in consistent order algorithm (baseline) A B C A B C A B C
Shuffled Documents in random order document order A C B B A C C B A
Bootstrap Documents sampled with replacement document presence B A A C A B B B B
Table 4.1: The three settings that manipulate the document order and presence
in each corpus.
For each corpus, we select a set of 20 relevant query words from high prob-
ability words from an LDA topic model [Blei et al., 2003] trained on that corpus
with 200 topics. We calculate the cosine similarity of each query word to the
31
other words in the vocabulary, creating a similarity ranking of all the words in
the vocabulary. We calculate the mean and standard deviation of the cosine sim-
ilarities for each pair of query word and vocabulary word across each set of 50
models.
From the lists of queries and cosine similarities, we select the 20 words most
closely related to the set of query words and compare the mean and standard
deviation of those pairs across settings. We calculate the Jaccard similarity be-
tween top-N lists to compare membership change in the lists of most closely
related words, and we find average changes in rank within those lists. We ex-
amine these metrics across different algorithms and corpus parameters.
4.1.2 Results Across Models and Settings
We begin with a case study of the framing around the query term marijuana. One
might hypothesize that the authors of various corpora (e.g. judges of the 4th
Circuit, journalists at the NYT, and users on Reddit) have different perceptions
of this drug and that their language might reflect those differences. Indeed, after
qualitatively examining the lists of most similar terms (see Table 4.2), we might
come to the conclusion that the allegedly conservative 4th Circuit judges view
marijuana as similar to illegal drugs such as heroin and cocaine, while Reddit
users view marijuana as closer to legal substances such as nicotine and alcohol.
However, we observe patterns that cause us to lower our confidence in such
conclusions. Table 4.2 shows that the cosine similarities can vary significantly.
We see that the top ranked words (chosen according to their mean cosine simi-
larity across runs of the FIXED setting) can have widely different mean similari-
32
4th Circuit NYT Sports
LSA LSA
heroin fixed criticized fixed
cocaine shuffled tested shuffled
distribution bootstrap testing bootstrap
crack cocaine
methamphetamine several
powder violent
distributing involving
oxycodone substance
manufacture reserved
distribute steroids
0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.45 0.50 0.55 0.60 0.65 0.70
Cosine Similarity Cosine Similarity
SGNS SGNS
heroin fixed cocaine fixed
oxycodone shuffled drug shuffled
cocaine bootstrap nandrolone bootstrap
methamphetamine steroid
pills counseling
drugs alcohol
crack urine
narcotics substance
powder abuse
cigarettes testing
0.70 0.75 0.80 0.85 0.90 0.60 0.65 0.70 0.75 0.80 0.85
Cosine Similarity Cosine Similarity
GloVe GloVe
cocaine fixed cocaine fixed
heroin shuffled procedures shuffled
kilograms bootstrap smoking bootstrap
crack testing
distribute addiction
drugs purposes
grams steroid
smoked blaming
growing suspensions
possession positive
0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.2 0.3 0.4 0.5 0.6 0.7
Cosine Similarity Cosine Similarity
PPMI PPMI
cocaine fixed cocaine fixed
heroin shuffled drug shuffled
powder bootstrap alcohol bootstrap
crack tested
grams positive
kilogram substance
paraphernalia drugs
kilograms crack
hydrochloride testing
methamphetamine steroid
0.75 0.80 0.85 0.90 0.5 0.6 0.7 0.8 0.9
Cosine Similarity Cosine Similarity
Table 4.2: The most similar words with their means and standard deviations
for the cosine similarities between the query word marijuana and its 10 nearest
neighbors (highest mean cosine similarity in the FIXED setting. Embeddings are
learned from documents segmented by sentence.
ties and standard deviations depending on the algorithm and the three training
settings, FIXED, SHUFFLED, and BOOTSTRAP.
33
Most Similar Words Most Similar Words Most Similar Words Most Similar Words
Most Similar Words Most Similar Words Most Similar Words Most Similar Words
As expected, each algorithm has a small variation in the FIXED setting. For
example, we can see the effect of the random SVD solver for LSA and the effect
of random subsampling for PPMI. We do not observe a consistent effect for
document order in the SHUFFLED setting.
Most importantly, these figures reveal that the BOOTSTRAP setting causes
large increases in variation across all algorithms (with a weaker effect for PPMI)
and corpora, with large standard deviations across word rankings. This indi-
cates that the presence of specific documents in the corpus can significantly af-
fect the cosine similarities between embedding vectors.
GloVe produced very similar embeddings in both the FIXED and SHUFFLED
settings, with similar means and small standard deviations, which indicates that
GloVe is not sensitive to document order. However, the BOOTSTRAP setting
caused a reduction in the mean and widened the standard deviation, indicating
that GloVe is sensitive to the presence of specific documents.
Standard Deviation in the 9th Circuit Corpus Standard Deviation in the NYT Music Corpus
0.07 0.12
fixed fixed
0.06 shuffled shuffled0.10
bootstrap bootstrap
0.05
0.08
0.04
0.06
0.03
0.04
0.02
0.01 0.02
0.00 0.00
LSA SGNS GloVe PPMI LSA SGNS GloVe PPMI
ALGORITHM ALGORITHM
Figure 4.1: The mean standard deviations across settings and algorithms for the
10 closest words to the query words in the 9th Circuit and NYT Music corpora
using the whole documents. Larger variations indicate less stable embeddings.
These patterns of larger or smaller variations are generalized in Figure 4.1,
which shows the mean standard deviation for different algorithms and settings.
34
STANDARD DEVIATION
STANDARD DEVIATION
We calculated the standard deviation across the 50 runs for each query word
in each corpus, and then we averaged over these standard deviations. The re-
sults show the average levels of variation for each algorithm and corpus. We
observe that the FIXED and SHUFFLED settings for GloVe and LSA produce the
least variable cosine similarities, while PPMI produces the most variable cosine
similarities for all settings. The presence of specific documents has a signifi-
cant effect on all four algorithms (lesser for PPMI), consistently increasing the
standard deviations.
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6
viability fetus trimester surgery trimester pregnancies
pregnancies pregnancies surgery visit surgery occupation
abortion gestation visit therapy incarceration viability
abortions kindergarten tenure pain visit abortion
fetus viability workday hospitalization arrival tenure
gestation headaches abortions neck pain visit
surgery pregnant hernia headaches headaches abortions
expiration abortion summer trimester birthday pregnant
sudden pain suicide experiencing neck birthday
fetal bladder abortion medications tenure fetus
Table 4.3: The 10 closest words to the query term pregnancy are highly variable.
None of the words shown appear in every run. Results are shown across runs of
the BOOTSTRAP setting for the full corpus of the 9th Circuit, the whole document
size, and the SGNS model.
We turn to the question of how this variation in standard deviation affects
the lists of most similar words. Are the top-N words simply re-ordered, or do
the words present in the list substantially change? Table 4.3 shows an example
of the top-N word lists for the query word pregnancy in the 9th Circuit corpus.
Observing Run 1, we might believe that judges of the 9th Circuit associate preg-
nancy most with questions of viability and abortion, while observing Run 5, we
might believe that pregnancy is most associated with questions of prisons and
family visits. Although the lists in this table are all produced from the same
35
corpus and document size, the membership of the lists changes substantially
between runs of the BOOTSTRAP setting.
Variation in Top 2 Words Variation in Top 10 Words
0.9 0.9
fixed fixed
0.8 shuffled 0.8 shuffled
0.7 bootstrap 0.7 bootstrap
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
LSA SGNS GloVe PPMI LSA SGNS GloVe PPMI
ALGORITHM ALGORITHM
Figure 4.2: The mean Jaccard similarities across settings and algorithms for
the top 2 and 10 closest words to the query words in the AskHistorians corpus.
Larger Jaccard similarity indicates more consistency in top N membership. Re-
sults are shown for the sentence document length.
These changes in top-N rank are shown in Figure 4.2. For each query word
for the AskHistorians corpus, we find the N most similar words using SGNS. We
generate new top-N lists for each of the 50 models trained in the BOOTSTRAP
setting, and we use Jaccard similarity to compare the 50 lists. We observe sim-
ilar patterns to the changes in standard deviation in Figure 4.2; PPMI displays
the lowest Jaccard similarity across settings, while the other algorithms have
higher similarities in the FIXED and SHUFFLED settings but much lower similar-
ities in the BOOTSTRAP setting. We display results for both N = 2 and N = 10,
emphasizing that even very highly ranked words often drop out of the top-N
list.
36
JACCARD SIMILARITY
JACCARD SIMILARITY
4.1.3 Discussion and Impact
The most obvious result of our experiments is to emphasize that embeddings are
not even a single objective view of a corpus, much less an objective view of lan-
guage. The corpus is itself only a sample, and we have shown that the curation
of this sample (its size, document length, and inclusion of specific documents)
can cause significant variability in the embeddings. Happily, this variability can
be quantified by averaging results over multiple bootstrap samples.
We can make several specific observations about algorithm sensitivities. In
general, LSA, GloVe, SGNS, and PPMI are not sensitive to document order in
the collections we evaluated. This is surprising, as we had expected SGNS to be
sensitive to document order and anecdotally, we had observed cases where the
embeddings were affected by groups of documents (e.g. in a different language)
at the beginning of training. However, all four algorithms are sensitive to the
presence of specific documents, though this effect is weaker for PPMI.
This sensitivity to the presence of specific documents leads us to a more nu-
anced understanding of the training corpus as a fragile artifact. The corpus has
been curated over time and might be biased in the representation of its doc-
uments; documents might be missing or over-represented, and they might be
ordered according to some scheme, perhaps chronologically or thematically
[Goodfellow et al., 2016]. While word embeddings may appear to measure
properties of language, they in fact only measure properties of a curated cor-
pus, and the training corpus is merely a sample of the authors’ language model
[Shazeer et al., 2016].
Although PPMI appears deterministic (due to its pre-computed word-
37
context matrix), we find that this algorithm produced results under the FIXED
ordering whose variability was closest to the BOOTSTRAP setting. We attribute
this intrinsic variability to the use of token-level subsampling. This sampling
method introduces variation into the source corpus that appears to be compa-
rable to a bootstrap resampling method. Sampling in PPMI is inspired by a
similar method in the word2vec implementation of SGNS [Levy et al., 2015]. It
is therefore surprising that SGNS shows noticeable differentiation between the
BOOTSTRAP setting on the one hand and the FIXED and SHUFFLED settings on
the other.
The use of embeddings as sources of evidence needs to be tempered with the
understanding that fine-grained distinctions between cosine similarities are not
reliable and that smaller corpora and longer documents are more susceptible
to variation in the cosine similarities between embeddings. When studying the
top-N most similar words to a query, it is important to account for variation in
these lists, as both rank and membership can significantly change across runs.
Therefore, we emphasize that with smaller corpora comes greater variability,
and we recommend that practitioners use bootstrap sampling to generate an
ensemble of word embeddings for each sub-corpus and present both the mean
and variability of any summary statistics such as ordered word similarities.
We leave for future work a full hyperparameter sweep for the three algo-
rithms. While these hyperparameters can substantially impact performance,
our goal with this work was not to achieve high performance but to examine
how the algorithms respond to changes in the corpus. We make no claim that
one algorithm is better than another.
38
4.2 Case Study: Comparative Measurements of Bias
There has been increasing concern in the NLP community over bias and stereo-
types contained in models and how these biases can trickle downstream to prac-
tical applications, such as serving job advertisements. In particular, there has
been much recent scrutiny of word representations, with many studies find-
ing harmful associations encoded in embedding models. Combating such bi-
ases requires measuring the bias encoded in a model so that researchers can es-
tablish improvements, and many variants of embedding-based measurement
techniques have been proposed [Bolukbasi et al., 2016b, Caliskan et al., 2017,
Manzini et al., 2019].
These measurements have had the additional upstream benefit of providing
computational social science and digital humanities scholars with a new means
of quantifying bias in datasets of social, political, or literary interest. Researchers
increasingly use embeddings [Garg et al., 2018, Knoche et al., 2019, Hoyle et al.,
2019] and other lexicon-based methods [Saez-Trumper et al., 2013, Fast et al.,
2016, Rudinger et al., 2017] to provide quantitative answers to otherwise elu-
sive political and social questions about the biases in a corpus and its authors.
This work often involves comparing bias measurements across different cor-
pora, which requires reliable, fine-grained measurements.
Despite the diversity of bias measurement methods, they all rely on lexicons
of seed terms to specify stereotypes or dimensions of interest. These lexicons
are short lists of terms meant to represent the groups being studied. Most often,
they are curated sets that represent pairs of demographic groups and attributes; for
example, popularly used sets for gender include men = [he, man, father, ...] and
39
Target Concept Highlighted Seeds
Unpleasant divorce, jail, poverty, cancer, ...
African American Tanisha, Tia, Lakisha, Latoya, ...
Domestic Work mom, mum, ...
Ugliness fat, chubby, obese, fatty, overweight,
disformed, disfigured, wrinkle, wrinkled,
...
Table 4.4: Examples of real seed terms used in recent work to measure biases in
corpora.
women = [she, woman, mother, ...] and researchers search for associations between
these gender seeds and attribute seed sets such as occupations or positive and
negative adjectives. If researchers can establish a stronger relationship between
one set of gender seeds and one set of occupations, this is used as evidence
of gender bias in the training corpus. Binary gender pairs are relatively easy
to construct in English because of gender-marked pronouns, but in the case of
other groups, such as racial and class-based groups, constructing sets of seed
terms is more challenging. Prior work has used crowd-sourcing or predefined
lists from the social sciences, but these lists can contain their own biases and
their coverage on the target dataset is not reliable. For example, when measur-
ing levels of racial bias in a corpus, prior work often uses manually created sets
of “European-American” and “African-American” first names. I show exam-
ples of these seeds in Table 4.4.
In prior work, these seeds sets are often not clearly documented. Their
sources, their validation, and sometimes even their contents are left to the
reader’s imagination. The rationale for choosing specific seeds is often unclear;
sometimes seeds are crowd-sourced, sometimes hand-selected by researchers,
and sometimes drawn from prior work in the social sciences. The impact of
40
the seeds is not well-understood, and many previous seed sets have serious
limitations. As shown in Table 4.4, the seeds used for bias measurement can
themselves exhibit several types of cultural and cognitive biases (e.g., reductive
definitions), and in addition, linguistic features of the seeds (e.g., frequency)
can affect bias measurements [Ethayarajh et al., 2019]. Though they are often re-
used, the suitability of these seeds to novel corpora is uncertain, and while eval-
uations sometimes include permutation tests, distinct sets of seeds are rarely
compared.
Word frequency and distribution can directly affect results of both count-
based methods [Gordon and Van Durme, 2013, Kuang, 2016] and embedding-
based methods [Ethayarajh et al., 2019]. Many methods rely on cosine simi-
larities between word vectors, which can be affected by factors such as part
of speech and training domain match [Antoniak and Mimno, 2018, Wendlandt
et al., 2018]. These problems are not hypothetical — many of the seed sets and
tools discussed in this paper are actively used in industry and research — and
the stakes are high. Accurate bias measurements can help to improve the fair-
ness of applications built on NLP models, and inaccurate bias measurements
can provide fodder for critics seeking to deny any evidence of prejudices. In
particular, we focus on the stability of bias measurements based on seed terms
in the upstream use case, when these measurements are used to make compar-
isons across corpora. Although in the NLP community it is common to rely on
a single pre-trained source model as a starting point, many researchers outside
NLP use the same unsupervised methods as a means of extracting bias informa-
tion from one or more specific collections of interest to answer specific social and
humanist questions. Seeds developed in one context can and are easily re-used
in other contexts, but evaluation and validation remain necessary precursors to
41
relying on seeds for sensitive measurements.
woman, women, she, her, her,...
(Kozlowski et al 2019)
sister, female, woman, girl, daughter,...
(Caliskan et al 2017)
woman, girl, she, mother, gal,...
(Bolukbasi et al 2016)
woman, girl, mother, daughter, sister,...
(Hoyle et al 2019)
lady, nun, heroine, actress, businesswoman,...
(Zhao et al 2018)
baker, counselor, nanny, librarians, socialite,...
(Zhao et al 2018)
0.0 0.2 0.4
Similarity to Unpleasantness Vector
romance history + biography
Figure 4.3: Bias measurements depend on seeds. We calculate the cosine simi-
larities between different seed sets representing women and an averaged upleas-
antness vector from two embedding models. Results are consistent across seeds
for romance review embeddings, but vary widely between sets for history and
biography. We find similar variation even for a pretrained Google News model.
Figure 4.3 presents a motivating example, showing the instability of mea-
surements using seeds. In this example, we imagine a digital humanities scholar
interested in measuring whether women are portrayed more negatively in dif-
ferent genres of book reviews. Perhaps this scholar hypothesises that women
are portrayed more negatively in history and biography reviews than in ro-
mance reviews, since more romance reviewers are women. As in the WEAT test,
each seed is plotted according to its cosine similarity to an averaged unpleasant-
ness vector [Caliskan et al., 2017]. Green indicates a subspace trained on romance
book reviews, while purple indicates a subspace defined for history book re-
views; our imaginary scholar is interested in whether these boxes overlap or are
separated from each other. Following prior work, our imaginary scholar exper-
iments with different seeds sets that represent women, all of which appear rea-
42
Seeds
sonable at first glance for the target book reviews dataset. However, the scholar
finds that for some sets of seeds representing women, no significant difference
is visible, while for other sets, there are much larger differences. Depending on
which set the researcher chose, they would draw significantly different conclu-
sions when comparing biases across datasets. And perhaps even worse — if the
scholar had relied on only one set and not compared many distinct sets, they
would not have realized the differences in results.
4.2.1 Framework of Seed Sources
I gather and document seed sets from recent papers, creating a framework of
common seed sources. Table 4.5 shows the number of papers in our survey that
used seeds from each of the sources in our framework. Each paper (and each
individual seed set) can draw from more than one source. The most commonly
used sources are seeds derived from the target corpus, seeds re-used from prior
bias measurement research, and seeds borrowed from the social sciences.
Corpus-Derived 7/18 papers
Re-Used 7/18 papers
Borrowed from Social Sciences 6/18 papers
Curated 5/18 papers
Adapted from Lexical Resources 3/18 papers
Crowd-Sourced 2/18 papers
Population-Derived 2/18 papers
Table 4.5: Overview of the surveyed seed sources.
Borrowed from Social Sciences. One of the most common seed sources is
prior work in psychology and other social sciences. Researchers often borrow
these in an effort to either replicate results or build confidence from previously
validated work. For example, Caliskan et al. [2017] validate prompts from
43
the Implicit Association Test [Greenwald et al., 1998], while Garg et al. [2018]
and Hoyle et al. [2019] use personality traits from Williams and Bennett [1975],
Williams and Best [1977, 1990]. In another example, Bhatia et al. [2018] use sub-
sets of personality traits from Goodwin et al. [2014] to measure stereotyping of
political candidates. Sometimes the seeds appeal for validity via highly cited
resources, like LIWC [Pennebaker et al., 2001], despite critiques about unrelia-
bility [Panger, 2016, Forscher et al.]. In some cases, relying on prior work has
the benefit of human validation. However, validation in one setting does not
guarantee validation in another; biases can be context-dependent. Borrowing
seeds does not absolve researchers from examining and validating seeds.
Crowd-Sourced. Researchers can also use the crowd to generate and annotate
seed sets. For example, Fast et al. [2016] use Mechanical Turk to validate the in-
clusion of terms in their seed sets; the final terms are then included in packaged
code for researchers and practitioners. Kozlowski et al. [2019] use Mechanical
Turk to gather ratings of items scaled along gender, race, and class. Crowd-
sourcing can aid in gathering contemporary associations and stereotypes. How-
ever, controlling for crowd demographics can be difficult, and crowd-sourcing
can result in alarming errors, in which popular stereotypes are hard-coded into
the seeds (as in Table 4.4).
Population-Derived. Seed sets can be derived from government-collected
population datasets. Popular sources include U.S. census data [Bolukbasi et al.,
2016b, Caliskan et al., 2017], the U.S. Bureau of Labor Statistics [Caliskan et al.,
2017], and the U.S. Social Security Administration [Garg et al., 2018]. These
sources are usually used to gather names and occupations common to certain
44
demographic groups (e.g., to gather lists of “European American” and “African
American” names). These sources tend to be U.S.-centric, though the training
data for the embedding does not always match (e.g., large Wikipedia datasets
are not guaranteed to have U.S. authors). Reliance on these sources is partic-
ularly vulnerable to reductive definitions of the target concepts—e.g., gender
[Keyes, 2017]—and assumes a level of trust and representation in the data col-
lection that might not exist evenly across groups.
Adapted from Lexical Resources. Some seed sets are drawn from existing dic-
tionaries, lexicons, and other public resources, such as SemEval tasks [Zhao
et al., 2018] and ConceptNet [Fast et al., 2016]. Pre-packaged sentiment lexi-
cons are a popular source [Saez-Trumper et al., 2013, Sweeney and Najafian,
2019]; these lexicons include the Affective Norms for English Words (ANEW)
[Bradley and Lang, 1999] and negative/positive sentiment words from Hu and
Liu [2004]. These seeds have the advantage of previous rounds of validation,
but this does not guarantee validity for new domains.
Corpus-Derived. Given a corpus of interest, quantitative methods can be used
to extract seed terms from the corpus. For example, Saez-Trumper et al. [2013]
use sorted lists of named entities extracted from a target dataset to create seed
sets for personas of interest. Similarly, Sweeney and Najafian [2019] extract high
frequency identity terms from a Wikipedia corpus. These methods have the
advantage of ensuring high frequency terms in the target dataset. However,
they pose similar risks to crowd-sourcing; unless an extra round of cleaning
and curation is completed by the researchers, terms with unintended effects can
be included in the seed sets.
45
Curated. Seed sets are sometimes hand-selected by the authors, usually after
close reading of the corpus of interest. For example, Rudinger et al. [2017] hand-
select a set of seed terms that correspond to a set of demographic categories of
interest, and Joseph et al. [2017] hand-select a set of identity seeds based on
their frequency in a Twitter dataset. Often, even when papers rely on other
seed sources, manual curation is included as a step in the seed creation process.
Hand-curation can result in high precision seeds, but this method relies on the
authors’ correction for their own social biases.
Re-Used. Finally, many papers rely on prior bias measurement research for
seed terms. The most popular sources in our survey include early papers on bias
in embeddings such as Bolukbasi et al. [2016b] and Caliskan et al. [2017]. This
repetition means that the seeds are tested on many different datasets, but they
should not be trusted without validation; there can be mismatches in frequency
and contextual meaning between datasets.
4.2.2 Bias Measurement Methods
Word Embedding Association Test. Given a set of embedding vectors w, the
Word Embedding Association Test (WEAT) [Caliskan et al., 2017] defines a vec-
tor based on the difference between the mean vector of the two target sets, and
then measures the cosine similarity of a set of attribute words to that vector.
The strength of the association between the target sets X and Y , and the sets of
attributes, A, and B, is given by
∑ ∑
s(X,Y, A, B) = s(x, A, B) − s(y, A, B)
x∈X y∈Y
46
where s(w, A, B) is equal to the difference in average cosine similarities between
a query w and each term in A and w and each term in B. To test whether the
resulting difference s(X,Y, A, B) is significant, this result is compared to the same
function applied to randomly permuted sets drawn from X and Y . Caliskan
et al. [2017] use WEAT to measure stereotypical associations between sets of
targets and attributes, where, for example, the target terms might be arts and sci-
ence terms, and the attribute terms might be terms representing men and women
terms.
Principal Component Analysis. The principal component analysis (PCA)
method tests how much variability there is in the difference vectors between
pairs of word vectors [Bolukbasi et al., 2016b]. If the vector difference between
pairs of seed terms can be approximated well by a single constant vector c, then
this vector represents a bias subspace. In this case, the subspace is simply a one
dimensional vector, though this process could be extended to more dimensions.
For each pair of embedding vectors corresponding to one seed word from set X
and one from set Y , Bolukbasi et al. [2016b] calculate the mean vector of those
two vectors and then include the two resulting half vectors from that mean to
the two seed vectors as columns in the input matrix.
4.2.3 Seed Features Impact Bias Measurements
Reductive Definitions. The seeds can be reductive and essentializing, codi-
fying life experiences into traditional categories. Using names as placeholders
for concepts like race [Nguyen et al., 2014, Sen and Wasow, 2016] or reducing
gender to a binary with two extremes [Bolukbasi et al., 2016b, Caliskan et al.,
47
2017] can create a distorted view of the source data. Sometimes these are sim-
plifying assumptions, made in an effort to measure biases that would otherwise
go unexamined. However, these decisions run the risk of further entrenching
these category definitions—e.g., see discussions in Keyes [2017], Larson [2017]
for the mistakes and harms that can be caused by mapping names to genders—
and these trade-offs should be evaluated and documented. More broadly, re-
cent work has critiqued NLP and ML bias research for not successfully con-
necting with the literature in sociology and critical race studies [Hanna et al.,
2020, Blodgett et al., 2020]. Engaging with this literature would provide a better
foundation for decision-making about seed sets and provide context for future
researchers.
Imprecise Definitions. If the target concept is not well-defined, the resulting
seed terms can be too broad and include multiple concepts, risking the creation
of confounded or circular arguments. Similarly, the unexamined use of pre-
existing sets and over-reliance on the category labels from prior work can result
in a series of errors. The seeds can contain confounding terms (e.g., in Table 4.4,
unpleasant contains “cancer” which in some datasets might be more prevalent
for certain demographic groups) or terms from the target group (e.g., domestic
work includes the gendered terms “mom” and “mum”). Similarly, the seeds
can manifest cultural stigmas: for example, including “fat” and “wrinkled” in
an ugliness category [Fast et al., 2016] results in a seed set that itself contains
stereotypes.
These stigmas are harmful and can interact with other demographic fea-
tures like gender or age [Puhl and Heuer, 2009], and unless their inclusion is
intentional, they can accidentally inflate measurements towards certain groups.
48
Rather than probing for a ugliness subspace, the social stereotypes could force
an unintended age-based comparison. Predicting all such errors is impossible,
and there can be cases where researchers intentionally include such terms (e.g.,
to capture a particular stereotype)—but this should be a conscious decision by
each researcher using the seeds, and at a minimum, researchers should clearly
define their target concepts.
Lexical Factors. Prior work examining seeds has shown that the frequency
and part of speech of seeds can affect the resulting bias measurements. Etha-
yarajh et al. [2019] show that the WEAT test requires that the paired seeds occur
at similar frequencies and that seed sets can be manipulated to produce certain
measurements. Brunet et al. [2019] explore the effects of perturbing the training
corpus, finding that (1) second-order neighbors to the seeds can have a strong
impact on the bias measurement effect size and (2) effects are stronger for rarer
words. Using contextual embeddings, Sedoc and Ungar [2019] show that differ-
ent classes of words (e.g., names vs. pronouns) can result in different bias sub-
spaces and that sometimes these subspaces represent an unintended dimension
(e.g., age instead of gender). Despite these documented sources of variation,
few seed sets are evaluated for frequency or word class.
Set Size and Alignment. The number of seeds included in each set can affect
the resulting bias subspace; Kozlowski et al. [2019] find small increases in per-
formance when using more seed pairs. The alignment of the seeds in matched
sets (i.e., the ordering or pairing of seeds in one set with seeds in another set)
can also affect the bias subspace. In the PCA method, each term in one seed
set is explicitly linked to a single term in the other seed set. The specific align-
49
ment between paired words matters; altering the pairing can result in dramat-
ically different results, even for cases like gender, which is marked in English.
However, we observe conscious pairings of seeds only for obvious cases, and
sometimes “obvious” pairings produce subspaces that explain less variance.
herself 0.50 likelihood 0.36 outcomes 0.26
ms 0.49 eurozone 0.34 son 0.26
her 0.49 incentive 0.34 father 0.26
she 0.41 downturn 0.31 mother 0.26
pregnant 0.40 setback 0.30 aunt 0.25
pitching -0.36 photographed -0.39 potentially -0.19
baseball -0.36 tales -0.41 male -0.19
syndergaard -0.38 hood -0.42 hood -0.29
himself -0.39 garcia -0.45 garcia -0.29
his -0.42 danced -0.59 md -0.39
(a) (b) (c)
Figure 4.4: Ranking word vectors by cosine similarity with the top principle
component vector for the original gender seed pairs (a) appears to identify
words representing men and women much better than random (b). But shuf-
fling the pairing of seed words (c) maintains correlation with gender but to a less
clear degree. Results are shown for the NYT corpus with a frequency threshold
of 100 and bootstrap resampling.
Figure 4.4 shows that when we used the ordered gender pairs, the ranked
words roughly divide into groups correlated with gender, while if we use shuf-
fled pairs, the lists of high and low ranked words are not as easily distinguish-
able as masculine or feminine. We find an opposite effect social class pairs [Ko-
zlowski et al., 2019]; when we shuffle, we find a subspace that explains more
variance than the explicitly ordered pairs (e.g., “richest”-“poorest”). We find
similar differences when testing some seed sets that lack intuitive pairings, e.g.,
the matched pleasantness and unpleasantness seeds [Caliskan et al., 2017] and the
matched Christianity and Islam seeds [Garg et al., 2018].
Order does not always affect the subspace — e.g, we found no significant
50
difference when shuffling sets of names — but we have shown that it can affect
the subspace, and so to build confidence in measurements, testing is required.
Coherence Generated Seed Set A Generated Seed Set B
1.000 distinctions, similarities, friction, parallels, similarity murder, rape, manslaughter, felony, assault
1.000 mile, miles, yards, yard, feet example, instance, purposes, explanation, short-
hand
1.000 shop, restaurant, kitchen, cafe, store sports, soccer, football, competitions, basketball
... ... ...
0.711 ambush, bombardment, escalation, altercation, corruption, terrorism, graft, bribery, abuses
militiamen
0.689 entrance, terrace, subway, cafe, lawn courtside, bamboo, freeway, shorts, sailboat
0.552 sticks, onions, tops, banana, mozzarella potatoes, onions, lemon, herbs, meats
Coherence Gathered Seed Set A Gathered Seed Set B
0.933 CAREER: executive, management, professional... FAMILY: home, parents, children, family,
cousins...
0.910 ASIAN: asian, asian, asian, asia, china... CAUCASIAN: caucasian, caucasian, white, amer-
ica...
0.909 FEMALE: sister, mother, aunt, grandmother... MALE: brother, father, uncle, grandfather, son...
... ... ...
0.375 FEMALE: countrywoman, sororal, witches... MALE: countryman, fraternal, wizards, manser-
vant...
0.110 NAMES ASIAN: cho, wong, tang, huang, chu... NAMES CHINESE: chung, liu, wong, huang, ng...
0.050 NAMES BLACK: harris, robinson, howard... NAMES WHITE: harris, nelson, robinson...
Table 4.6: When two seed sets are more semantically distinct they are more dis-
tinguishable in the resulting geometric subspace. The top table shows pairs of
artificially generated seed sets, ranked by their coherence for WEAT in the NYT
dataset. The bottom table shows pairs of seed sets gathered from published pa-
pers, ranked by their coherence for WEAT in the WikiText dataset. Scores are
averaged across 20 bootstrapped samples of the training data, and values are
rounded; no coherence scores are exactly 1.0. Higher coherence scores indicate
that the seeds pairs were projected farther apart in the bias subspace.
Set Similarity. By sampling random seed sets we find that it is more diffi-
cult to represent the variance of seed sets that are too close together. Figure
4.5 shows that set similarity (cosine similarity between the set mean vectors)
is significantly correlated with explained variance for generated sets (Pearson
r = −0.67, p < 0.05). We highlight two comparisons between gathered sets
intended to measure racial bias that explain different degrees of variance. Syn-
thetic pairings generally explain more variance than pairings of gathered sets
51
of equal similarity, although for gathered sets we cannot control for POS and
frequency. Table 4.6 shows the generated seed sets ranked by coherence, where
higher scores indicate that the bias subspace was able to separate the seed sets.
Similar seed sets and sets with duplicates (e.g., the pairing in the table in which
both generated sets contain food terms) have low coherence scores.
0.8
0.6
0.4
Black vs
Source White Roles
0.2 generated
gathered Black vsWhite Names
−0.2 0.0 0.2 0.4 0.6 0.8
Set Similarity
Figure 4.5: Identifying bias is less effective when set pairs are similar. Generated
seeds are frequency-controlled nouns from the WikiText dataset. We highlight
two sets of gathered seeds; both target similar racial categories but the name-
based sets are more similar and explain less variance. We find similar trends for
WEAT, coherence, and the other corpora and POS.
4.2.4 Discussion and Impact: Biases All the Way Down
Almost all recent work on bias measurement relies on sets of seed terms to
ground cultural concepts in language. These tools are often used to support
urgent appeals for fairness and accountability in machine learning systems. If
we do not pay attention to the seeds, these methods will lack foundation and
the claims they support will be left open to criticism and dismissal. Seeds and
their rationales need to be tested and documented, rather than hidden in code or copied
52
Explained Variance
without examination.
Some of the risks discussed here may seem obvious in retrospect, but our lit-
erature survey suggests there are widely varying levels of evaluation and doc-
umentation in recent published work. Rationales for picking sources or seeds
are not always explained, or the reader is left to assume that prior work has ad-
equately validated the seeds. Tests for frequency, semantic similarity, and other
features are rare or non-existent, and clear definitions and discussion of limi-
tations are often missing. Permutation tests are sometimes used (e.g., Caliskan
et al. [2017]), but these do not account for seeds outside of those already se-
lected. Significantly different results can be found using alternative seeds sets
for the same target concept, and fine-grained comparisons require validation on
multiple sets.
We faced a number of challenges in gathering 178 seed sets from prior work.
Sometimes seeds are shared online at an undocumented location and sometimes
hard-coded into code repositories; this can significantly obscure the seeds from
public view, which is troubling for tools intended for wide use on sensitive top-
ics. Documentation is often scattered across locations, and in more than one
case, we found contradictions between different sources for a single project. In
one case, we were unable to find the full list of seeds used in the paper, and in
several cases, it was unclear which seed sets were used for which experiments.
While some authors went to commendable lengths to document their materials,
there is a need for more consistent and transparent documentation.
We recommend that researchers carefully trace the origins of seed sets, with
attention to the risks associated with the origin type. We also recommend that
researchers examine seed features. POS, frequency, semantic similarity, and
53
pairing order can significantly affect the results of bias measurements. Seeds
should be both examined manually and tested; importantly, they should be
compared to alternative seeds with different attributes. To assist this we release
a compilation of 178 seed sets from prior work. These tests are particularly im-
portant when comparing biases across datasets. Finally, researchers should doc-
ument all seeds and the rationales underlying their design, including concept
definitions. We add to recent calls for better documentation and problem spec-
ification in machine learning [Bender and Friedman, 2018, Gebru et al., 2018,
Mitchell et al., 2019, Blodgett et al., 2020] and in studies of social biases in tech-
nology [Olteanu et al., 2019]. Specifically, when the seeds intentionally encode
harmful stereotypes or slurs, it can be beneficial to include a trigger warning or
not highlight the seeds in the paper; however, full seed lists should always be
accessible, not hard-coded, with unique labels matched to experiments.
Ultimately, our goal is not to eliminate a problem but to illuminate it:1 to help
practitioners think through the potential risks posed by seed sets used for bias
detection. We encourage thoughtful, critical studies, but we observe a trend in
which seed sets are used in new research and applications simply because they
have been used in prior published work, without additional vetting. Research
precedents can take on a life of their own and we have a responsibility to explore
and document possible sources of error. We believe that seed sets can be useful
and are probably unavoidable, but that no technical tool can absolve researchers
from the duty to choose seeds carefully and intentionally.
1“All problems can be illuminated; not all problems can be solved.” – Ursula Franklin
(quoted by M. Meredith via Olteanu et al. [2019] in http://bb9.berlinbiennale.de/
all-problems-can-be-illuminated-not-all-problems-can-be-solved/)
54
CHAPTER 5
PERSONAL HEALTHCARE EXPERIENCES
Healthcare systems in the U.S. face a range of important challenges, includ-
ing providing equitable care, addressing physician burnout, and discovering
causes and cures for understudied conditions. NLP methods can contribute to
addressing these challenges by measuring statistical patterns across large collec-
tions of text data, e.g., by extracting and linking medical entities like medication
names or by generating or parsing EHR notes. Much healthcare research in NLP
focuses on either EHR data or biomedical academic publications, leaving out
patients’ direct narration of their own experiences (as well as the experiences of
those who experience health issues but are not patients). If we could statistically
harness the emotional and narrative dimensions of personal healthcare stories,
then we could better identify benefits and harms to patients.
In a study of an online healthcare support community, I modeled narrative
patterns and power hierarchies in birth stories [Antoniak et al., 2019]. These
stories share interactions, emotions, decision-making, and deeply personal re-
flections and reframings of a medical experience that can sometimes be trauma-
tizing. Using topic modeling to discover themes with probabilistic connections,
I found diverging narrative pathways (medicalized and unmedicalized) as well
as outlier stories, whose event sequences are unlikely in this community and
which tend to be framed by the authors as “traumatic” and “unplanned” but
with “happy endings.” By parsing the stories and using a lexicon of verbs anno-
tated with directional power labels, I found that the authors frame themselves
as having the least amount of power, except for the baby, and frame the midwife
and doula, who often function as advocates for the pregnant person, as having
55
very high levels of power. This highlights the complicated social history and
legal standing of doulas in the U.S. healthcare system. I discuss this case study
in more detail below.
5.1 Healthcare Datasets for Natural Language Processing
For healthcare research that relies on large quantities of data, a constant chal-
lenge is obtaining training and test data that matches the intended use case and
domain, properly accounts for data ethics and privacy constraints, and does not
incorrectly privilege one viewpoint (e.g., medical professionals) over another
(e.g., patients).
Electronic health records (EHR). Much NLP research for healthcare focuses
on electronic health records (EHR) or electronic medical records (EMR) data
[Demner-Fushman et al., 2020, Rumshisky et al., 2020]. These patient-specific
records include information about medications, billing, laboratory tests, vital
signs, study reports, procedures, and free text notes describing a patient and/or
appointment. While a few large EHR datasets, like MIMIC-III [Johnson et al.,
2016], are accessible to researchers, these datasets represent snapshots of specific
hospitals, locations, and people. They also lead to increased research attention
for extracting structured data based on idiosyncracies in the EHR format, e.g.
extractin ICD codes [Zhang et al., 2020]. While these are important records of
medical appointments and other interactions between clinicians and patients,
they are told exclusively through the clinician’s point of view.
56
Biomedical research publications. Another popular healthcare data source is
biomedical research publications [Demner-Fushman et al., 2020]. For example,
the largest category of papers included in the Semantic Scholar Open Research
Corpus (S2ORC) dataset are medical studies [Lo et al., 2020]; the Cord-19 dataset
includes papers related to COVID-19 from PubMed, the World Health Orga-
nization, bioRxiv and medRxiv [Wang et al., 2020b]; and Percha and Altman
[2018] annotate a dataset of Medline abstracts. These academic datasets give rise
to their own specific set of tasks, mostly aimed at building knowledge graphs
from the included papers [Percha and Altman, 2018]. These datasets are par-
ticularly useful for drawing connections between seemingly unrelated research
topics, but like EHR datasets, they leave out the patient’s voice.
Online healthcare support communities. Online health communities (e.g., fo-
rums focused on supporting people with a particular health condition) allow
patients to give and receive emotional and informational support Yang et al.
[2019a,b] and to share expressive writing about sometimes traumatic experi-
ences Ma et al. [2017a]. These communities range in topic from men’s infertility
[Patel et al., 2019] to cancer [Yang et al., 2019b] to mental health [Chancellor
et al., 2019]. Compared to social networks or real-life support groups and con-
versations, disclosure on a public online forum can feel more private due to
the anonymity of the users, physical privacy of the users (emotional reactions
can be hidden), and a sympathetic, knowledgeable audience [Gold et al., 2012].
These communities allow people with similar healthcare concerns to exchange
advice and research, share stories and experiences, and organize and advocate
for themselves.
57
5.2 Case Study: Online Childbirth Narratives
Birth stories are narratives of real experiences giving birth, often written with
great medical and emotional detail and (in recent years) publicly posted on fo-
rums, blogs, and video-sharing websites. Birth stories are interesting from both
a computational and a healthcare perspective.
On the computational side, birth stories are an ideal test set for narrative
analysis, a task which frequently suffers from lack of datasets that are both
realistic but not overly challenging. While no two birth stories are the same,
most stories include common sequences of events and common sets of personas,
making them ideal testbeds for modeling narratives.
On the healthcare side, the motivations behind writing birth stories are com-
plex. Possible motivators include writing as a form of self-tracking and moni-
toring [Epstein et al., 2017], asserting agency and disrupting cultural norms and
society’s surveillance of pregnant people [Tangherlini, 2000], and distrust of a
medical profession that is biased in responding to women’s self-reported pain
[Hoffmann and Tarzian, 2001, Chen et al., 2008]. These suggest several lines of
inquiry, including measurement of portrayed power of different actors in the
birth stories.
5.2.1 Data Curation
We collect 2,847 birth stories from the social website Reddit. While birth sto-
ries exist in many venues and forms, we choose to focus on Reddit for its
accessibility and well-studied communities. These stories were posted pub-
58
licly from February 16th, 2011 to February 28th, 2018 on the subreddit (forum)
r/BabyBumps (all data available up to the date of collection). r/BabyBumps is a fo-
rum intended to be a “place for pregnant redditors, those who have been preg-
nant, those who wish to be in the future, and anyone who supports them.” This
community includes a wide range of posts related to pregnancy and birth, in-
cluding humor, requests for advice, rants and venting, recommendations, and
journaling posts (e.g., bump and ultrasound photos, summaries of doctor ap-
pointments, birth announcements and stories). The community rules explicitly
instruct members to post detailed birth stories rather than only photos and one
line descriptions.
We perform two rounds of filtering: first, we select the posts that contain the
n-gram “birth story” in the title, and second, we remove 348 posts that contain
fewer than 500 words. We remove these short posts as a second step of data
cleaning as many of these shorter posts are either not birth stories or are only
parts of birth stories published in installments. We do not include comments,
upvotes, or other interactions in our analysis; only the parent posts, containing
the stories, are included.
This set of stories constitutes a small sample of birth stories posted online,
and an even smaller sample of all birth stories, told or untold. Because the sto-
ries were posted anonymously to a forum, we do not have demographic data for
the authors of the stories. All stories are written in English, the majority appear
to take place in western and developed countries, and the authors have access
to the internet and to Reddit. As a first research foray into the computational
study of birth stories, we expect not that the patterns observed in r/BabyBumps
will generalize to all other birth stories, but that we will provide evidence of the
59
value and research interest of all birth stories.
5.2.2 Narrative Analysis
The stories range in length from a minimum of 500 words (our selected cutoff)
and a maximum of 6,057 words. Despite this difference in word count, the sto-
ries usually begin with the same events (arrival of the due date, water breaking,
contractions starting) and end with the same events (birth and weighing of the
baby, breastfeeding, leaving the hospital), though a few outliers break off in the
middle of the story (e.g., sharing the story in installments). In order to compare
these sequences of events across all the stories in the dataset, we divide each
story into ten equal sections, and we then use these sections to calculate statis-
tics of interest averaged over all the stories for the corresponding section. We
refer to these sequences of normalized sections as story time.
We find that simple methods are sufficient to identify readily interpretable
events and event sequences, or scripts. We train a latent Dirichlet allocation
(LDA) Blei et al. [2003] topic model with 50 topics on the birth stories collection,
using 100 word chunks as the training documents. We then divide each story
into 10 equal segments and plot the distributions of the topics over the resulting
normalized story time.
To establish additional validity, we can compare these topics to descriptions
of the birth process from health care providers such as the Mayo Clinic.1 We
calculate the probability of transitioning between topics by finding the most
probable topics for each segment of text, counting the number of times each
1https://www.mayoclinic.org/healthy-lifestyle/labor-and-delivery/
in-depth/stages-of-labor/art-20046545
60
nurse
sleep night hours rest slept
0.022 400
0.020 200
0
0.0 0.2 0.4 0.6 0.8 1.0
0.2 0.4 0.6 0.8 1.0 Story Time
Story Time
doula
water broke fluid break broken
50
0.025
0.020
0
0.0 0.2 0.4 0.6 0.8 1.0
0.2 0.4 0.6 0.8 1.0 Story Time
Story Time
midwife
hospital home car bag drive
0.025 200
0.020 100
0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Story Time Story Time
pitocin contractions started hours start doctor
0.025 500
0.020 250
0.015 0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Story Time Story Time
epidural pain relief anesthesiologist meds baby
0.025 1000
0.020
500
0.015
0.2 0.4 0.6 0.8 1.0 0
Story Time 0.0 0.2 0.4 0.6 0.8 1.0Story Time
push pushing head pushed pushes anesthesiologist
0.03
100
0.02
0.2 0.4 0.6 0.8 1.0 0
Story Time 0.0 0.2 0.4 0.6 0.8 1.0
Story Time
cord crying chest cry cut
we
0.025
0.020 1000
0.015
0.2 0.4 0.6 0.8 1.0 0
Story Time 0.0 0.2 0.4 0.6 0.8 1.0
Story Time
breastfeeding day milk days feeding
author
0.03
10000
0.02
5000
0.2 0.4 0.6 0.8 1.0
Story Time 0
0.0 0.2 0.4 0.6 0.8 1.0
Story Time
Figure 5.1: A selection of topics
over time. Plots are labeled with the Figure 5.2: Histograms showing
five highest probability words for the frequencies of persona mentions
each topic. Results show the prob- over story time. Some entities (e.g.,
ability for each topic at 10% inter- author) are consistently more fre-
vals of story time, averaged across quent than rare entities (e.g., doula).
all stories. Error bars show stan- Some frequency patterns are ex-
dard deviation across bootstrapped pected while others are surprising
samples of stories. (e.g., frequency of we decreases near
the middle of the stories).
61
Topic Probability Topic Probability Topic Probability Topic Probability Topic Probability Topic Probability Topic Probability Topic Probability
Persona Frequency Persona Frequency Persona Frequency Persona Frequency Persona Frequency Persona Frequency Persona Frequency Persona Frequency
pair of topics occurs in neighboring text segments, and normalizing by the total
number of transitions. We compare the learned topic transitions to the Mayo
Clinic’s document, which describes a “normal” birth path with few deviations.
Using the learned transition probabilities between topics, we identify outlier
stories containing less probable transitions. We rank the stories according to the
summed log probabilities of the transitions of the last five 100 word chunks (the
same size used for the topic model training). We limit the number of transitions
in each story to five to control for the varying lengths of the stories, and we
choose the last five transitions under the hypothesis that the story endings are
more likely to display variation in sequences of events than the story beginnings
(e.g., stories that end in emergency surgeries or unexpected trips to the hospi-
tal). For each bigram in the post titles (where authors frequently include tags
or labels that frame and situate the story within the community), we average
the log probabilities for all stories that include that ngram in the post title. This
allows us to measure the correlation between the outlier stories and the authors’
framing of the stories.
62
Figure 5.3: Flowchart of the most probable topic transitions (above 0.2%). We
removed one orphan node without a parent path leading to the beginning of
story (BOS) state.
The discovered sequences of events, persona patterns, and story likelihoods
describe the narrative norms and expectations of the r/BabyBumps community.
These norms are informed by the biological process of birthing and the medi-
cal organization and standardizing of procedures; but the specific sequence of
events that the authors in r/BabyBumps choose to highlight also arise from this
online health community’s particular expectations and priorities.
By calculating the transition probabilities between the stories, we construct
63
a flowchart that maps the community’s understanding of event sequences. A
topic about cutting the cord is often immediately followed by a topic about birth
weight and length. A set of diverging pathways emerge in the flowchart, in which
an unmedicalized set of events is mirrored by its medicalized version (e.g., in-
stead of a topic about pushing, a topic about c-sections). While many of these
events are familiar from official medical documentation, some of the events
(e.g., packing and going to the hospital, a meta-topic about birth stories) are
particular to the community and the authors’ points of view. Finally, we use
these common narrative arcs to identify stories that break the mold with less
probable topic sequences. These stories tend to be labeled by the authors as
“traumatic” and “unplanned” yet are also often labeled with the term “happy
ending.”
Highest Lowest
Probability Bigram Probability Bigram
-34.19 positive medicated -36.05 story warning
-34.27 positive hospital -36.11 unplanned c
-34.30 med free -36.13 slightly traumatic
-34.52 positive induction -36.27 natural birth
-34.53 story ftm -36.40 belated birth
-34.73 vaginal delivery -36.42 positive unmedicated
-34.77 story hospital -36.42 emergency c
-34.83 weeks pp -36.53 trigger warning
-34.85 hour labor -36.60 induction epidural
-34.88 super long -36.84 happy ending
Table 5.1: The bigrams drawn from the post titles associated with the most and
least probable stories. Probabilities represent the means of the summed log
probabilities of the last ten topic transitions in a story. Lower scores indicate
stories with more unusual topic transitions (sequences of events). Results are
averaged (mean) across bootstrapped samples of the stories.
The strengths of individual topics over time tracks with our expectations of
the stories, confirming that the dataset contains enough shared structure to be
easily detectable. For example, we can see in Figure 5.1 that water breaking hap-
64
pens near the beginning of the story; a topic about contractions starting and/or
pitocin (a medication that induces labor) being administered peaks around 30%
into the story; there is little sleep in the middle of the story; photos are shared
sometimes at the beginning of the story but usually near the end of the story.
We note that the model was trained with no information about word position
within stories, but topics nevertheless exhibit strong temporal clustering.
5.2.3 Framing of Power
To examine the framing of power hierarchies in these stories, we employ a lex-
icon over verbs annotated with directional power labels [Sap et al., 2017]. By
parsing the stories, we can identify each persona as the subject or object of these
annotated verbs and calculate an average power score both for individuals and
for dyads.
Persona N-Grams Total Mentions Stories Contain- Average Men-
ing Mentions tions Per Story
AUTHOR I, me, myself 210,795 2,846 74.0
We we, us, ourselves 24,757 2,764 8.7
BABY baby, son, daughter 14,309 2,668 5.0
DOCTOR doctor, dr, doc, ob, obgyn, gyne- 10,025 2,262 3.5
cologist, physician
PARTNER partner, husband, wife 8,998 2,006 3.2
NURSE nurse 7,080 2,012 2.5
MIDWIFE midwife 4,069 886 1.4
FAMILY mom, dad, mother, father, 3,490 1,365 1.2
brother, sister
ANESTHESIOLOGIST anesthesiologist 1,398 876 0.5
DOULA doula 896 256 0.3
Table 5.2: Personas identified in the birth stories collection and the n-grams used
to classify the personas.
As expected, the baby is very often framed as the last powerful person in
these stories while the medical professionals (nurse, doctor, anesthesiologist)
65
are framed as powerful. More surprisingly, the authors frame themselves as
have the least amount of power, except for the baby, and frame the midwife
and doula, who often function as advocates for the author, as having very high
levels of power. Examining pairs of personas, we find that the doula is the only
persona portrayed with more power when paired with the nurse, highlighting
the complicated social history and legal standing of doulas in the U.S. healthcare
system.
Author Baby Doctor Nurse Doula
get_subj 0.0404 do_subj 0.0129 call_obj 0.0760 check_subj 0.0499 call_obj 0.1412
know_subj 0.0219 want_obj 0.0113 check_subj 0.0271 call_obj 0.0382 suggest_subj 0.0357
start_subj 0.0215 get_subj 0.0110 decide_subj 0.0221 ask_obj 0.0272 get_subj 0.0306
do_subj 0.0199 start_subj 0.0081 get_subj 0.0159 give_subj 0.0171 hold_subj 0.0144
push_subj 0.0123 drop_subj 0.0072 do_subj 0.0145 get_subj 0.0149 keep_subj 0.0128
make_subj 0.0070 make_subj 0.0063 show_subj 0.0116 keep_subj 0.0147 show_subj 0.0122
wake_subj 0.0069 turn_subj 0.0055 break_subj 0.0105 hold_subj 0.0116 remind_subj 0.0117
keep_subj 0.0065 show_subj 0.0027 give_subj 0.0091 start_subj 0.0113 give_subj 0.0101
decide_subj 0.0059 enjoy_obj 0.0022 start_subj 0.0091 do_subj 0.0092 start_subj 0.0090
end_subj 0.0038 decide_subj 0.0021 make_subj 0.0088 bring_subj 0.0091 make_subj 0.0080
get_obj -0.0028 give_obj -0.0053 wait_subj -0.0014 leave_obj -0.0016 massage_subj -0.0022
believe_subj -0.0031 catch_obj -0.0057 like_subj -0.0014 believe_subj -0.0018 cheer_subj -0.0024
wait_subj -0.0034 need_subj -0.0072 mention_subj -0.0026 bring_obj -0.0018 fill_subj -0.0025
lose_subj -0.0035 feed_obj -0.0076 reach_subj -0.0029 mention_subj -0.0023 recognize_subj -0.0027
hear_subj -0.0057 bring_obj -0.0089 explain_subj -0.0036 explain_subj -0.0048 alarm_obj -0.0028
call_subj -0.0069 put_obj -0.0113 offer_subj -0.0039 offer_subj -0.0052 warn_obj -0.0028
ask_subj -0.0097 deliver_obj -0.0196 call_subj -0.0089 want_subj -0.0060 join_subj -0.0057
need_subj -0.0117 push_obj -0.0240 ask_subj -0.0133 call_subj -0.0174 get_obj -0.0141
check_obj -0.0137 hold_obj -0.0306 get_obj -0.0136 get_obj -0.0195 ask_subj -0.0171
want_subj -0.0266 get_obj -0.0369 want_subj -0.0152 ask_subj -0.0238 hire_obj -0.0438
Figure 5.4: Most frequent verbs from the power lexicon associated with each
persona in the birth stories corpus. Green indicates a positive power contri-
bution, while pink indicates a negative power contribution. The cell values
indicate the proportion of persona mentions with the given verb and power
relationship.
66
baby
0.8
author
partner
0.4
anesthes.
nurse
0.0
family
baby
author doctor
partner −0.4
anesthes. we
nurse
family doula
doctor −0.8
we midwife
doula
midwife
−0.1 0.0 0.1 0.2 0.3
Power
(a) (b)
Figure 5.5: (a) Power scores for each persona. Error bars show standard devi-
ation over 20 bootstrap samples of the collection. (b) Estimated power of per-
sonas (rows) over other personas (columns). The NURSE is consistently framed
as more powerful than the other personas, except for the DOULA.
5.2.4 Ethical Considerations
The dataset of unsolicited birth stories is a valuable resource for analysis of an
online health community, and we encourage further work on this and other
medical narrative datasets which prioritize the patient’s voice. Many birth sto-
ries recount experiences in which the pregnant person felt that they were not
empowered, and drawing attention to these voices can benefit the broader com-
munity of pregnant people. However, we caution practitioners to handle this
data with care. Research value, even for the community of pregnant people,
and reproducibility must be balanced against the specific ethical concerns sur-
rounding publicly shared medical data [Janssens and Kraft, 2012, Vayena et al.,
2015, Abbott et al., 2019]. We identify a series of tensions inspired by both prior
67
Persona
baby
author
partner
anesthes.
nurse
family
doctor
we
doula
midwife
work on ethical use of online medical data and the three guiding principles of
the Belmont Report, as discussed in Chapter 3.
While the stories posted to Reddit are already public, copying that data pre-
vents the authors from removing or editing their stories. Birth stories contain
extremely personal medical, interpersonal, and emotional information not just
about the author but also about the baby, who cannot consent to this public
sharing. While the authors posted their stories to a public venue, our own ex-
ploration of their motivations (e.g., resistance against surveillance, regaining
power) does not include providing material for researchers.
Due to the very sensitive nature of the dataset, we choose to not release ei-
ther the dataset or the URLs and instead prioritize the authors’ protection. To
support the replication of our results on other birth stories datasets, we release
our labeling pipeline (e.g., n-grams used for labeling, pre-processing steps). We
follow the recommendations in Bruckman [2002] and Yang et al. [2019b] to mask
the stories by only providing paraphrases rather than exact text snippets in all
of the examples highlighted in this paper, which minimizes the possible identi-
fication of and harm to the authors. These potential harms are weighed against
our hope that the results of our study will aid in the understanding of mod-
ern experiences of pregnancy and birth and spur further research that centers
the pregnant person’s voice. We also shared the final paper and a public-facing
blogpost with the r/BabyBumps community to support the principle of respect for
persons. We strongly recommend that researchers using this data follow these
same privacy-motivated guidelines.
68
5.2.5 Discussion and Impact
We view this study as a “close reading” via a computational lens of a specific
online community. This computational reading allows us to discover patterns
and outliers within our target community and also to prove the feasibility of
similar research on other communities that share medical stories. We do not
claim that the patterns discovered here will hold for all other collections of birth
stories; instead, we claim that such analysis can a) provide specific, statistical
evidence of patterns suggested or observed in prior work and b) prompt further
research in similar communities.
Narrative analysis is a challenging task in natural language processing,
partly because of the difficulty of creating datasets that are realistic but not too
complex for current models. We suggest that birth stories hit a sweet spot be-
tween formulaic, artificial datasets and complex, organic datasets; they are or-
ganically created (written spontaneously by the authors), share narrative struc-
ture and are constrained by topic despite each story being unique, are plentiful
enough to act as training data for machine learning models, and suggest real-
world motivations for their analysis. We successfully identify sentiment, topic
and person-based patterns that demonstrate the recoverable narrative qualities
of birth stories.
By uncovering this shared structure, we discovered events not described in
medical literature that are nevertheless important to the authors in this commu-
nity. We also use the learned topic patterns to discover outlier stories whose
sequences of events do not match those of the community’s expectations, and
we examine the framing of these stories. The authors of these stories emphasize
unexpected events, negative or triggering experiences, and “happy endings” in
69
their titling and framing of the posts. These results suggest that a lack of control
is associated with negative emotions, as found in Bylund [2005], and that in this
community, reframing of these unexpected events around “happy endings” is
common.
Many of our methods rely on averaging across bootstrapped sets of stories in
order to recover overarching themes in this community. We recognize that the
“average” story is not representative of every story in this community. We ex-
plore the question of outlier stories when we use the learned pattern of events
to identify stories whose sequences of events are unusual. The authors often
label these stories as unexpected, which confirms our interpretation of these out-
liers as stories that fall outside the expectations of the community. In this sense,
discovering archetypes through averaging is what allows us to discover by com-
parison stories in the minority.
For medical professionals, these results could help inform care decisions
and priorities. While doctors, nurses, and other medical professionals observe
hundreds of births every month, they do not observe these births through the
pregnant person’s eyes, and their interactions with the pregnant person are me-
diated by their power differential (the medical professional’s technical knowl-
edge, their place in the hierarchy of the hospital setting, their gender, race, age,
and education level, etc.). Birth stories allow these professionals to view a rou-
tine procedure through the fresh eyes of a person who is experiencing preg-
nancy and birth for perhaps the first time and to discover areas that could be
improved for the patient, like the events (e.g., traveling to and arriving at the
hospital) that are prominent in birth stories but not usually highlighted in med-
ical documentation of births. Postpartum depression can be alleviated through
70
attention to the patient’s emotional needs and feelings of agency during the
birth Callister [2004], Stewart and Vigod [2016], and our work has highlighted
that these needs sometimes go unmet within this community.
While the experiences of all pregnant people are valuable, we particularly
highlight the importance of listening to underrepresented perspectives. While
we were not able to control for race, education level, or other demographic vari-
ables in this study, we hope that our results show that such computational anal-
ysis of birth stories is feasible and valuable. We have demonstrated that birth
stories can highlight events and framings missing from the dominant medical
narratives of birth.
71
CHAPTER 6
PERSONAL READING EXPERIENCES
Book reviewer values and audience expectations can be measured in online
reading communities, highlighting tensions between subjective reading expe-
riences, evolving community understandings, and societal judgments. In my
work, I have used a variety of unsupervised methods (linguistic and user simi-
larity metrics) to map the an online social reading community’s understanding
of genres based on the language expressed in reviews and the free-text tags cre-
atively applied by individual community members.
Where prior genre-mapping methods relied on academics, critics, publish-
ers, and library catalogs to assign genres, my work on online reading commu-
nities focuses on the common reader. These readers and their tags have effects
in the offline world; for example, LibraryThing’s paid catalog, which is based on
user’s tag assignments, is used to organize physical bookstores and libraries.
Understanding the relationships between these users and tags will aid in un-
derstanding modern literary reception and popular views of genre.
6.1 Literary Reception and Online Book Reviews
The internet and social media have greatly increased the amount of available
evidence about readers and reading communities. Earlier research about read-
ers relied on sources such as archival materials (e.g., personal diaries), ethno-
graphies, and surveys [Radway, 1991, National Endowment for the Arts, 2004].
These sources typically offer rich data about a small number of readers or more
cursory data about a large number of readers, with little in between. Online so-
72
cial reading websites such as LibraryThing and Goodreads, where readers pub-
lish records of their thoughts in their own words and form social bonds with
other readers, offer invaluable resources for the study of readers and reading
communities.
Researchers in the fields of digital humanities and cultural analytics have
started to take advantage of online book ratings and reviews to study readers,
though they have mostly focused on Goodreads data. For example, the Stan-
ford Literary Lab uses Goodreads ratings as metrics for general book popularity
among readers [Porter, 2018]. Bourrier and Thelwall similarly use Goodreads
ratings and reviews to understand the contemporary reception of 19th-century
literature [Bourrier and Thelwall, 2020], while English et al. [2022] explore the
overlap between Goodreads users who read “popular” books and users who
read “prestigious” books.
A variety of predictive tasks have also been studied in the context of book
reviews. These include popularity prediction [Maity et al., 2018] and automatic
recommendation systems that incorporate user specialties [Wang et al., 2020a].
Resources such as the UCSD Book Graph, a dataset of scraped and labeled
Goodreads reviews and user data, are intended for the tasks of item recom-
mendation [Wan and McAuley, 2018] and spoiler detection [Wan et al., 2019].
Unlike these works, we use prediction only as a lens and not a tool to enforce
an ontology.
Closest to this study of literary genres in online communities is Hegel [2018],
which explored the influence of online reading communities like Goodreads on
popular perceptions of genres by comparing these perceptions to those of pro-
fessional literary critics. This analysis relied on classifying the themes in reviews
73
via a supervised classifier and comparing those themes across a small set of gen-
res, as well as comparing the collocations in which genre names were used in
reviews. Hegel [2018] found that “amateur” reviewers on Goodreads tend to or-
ganize genres into more fine-grained categories than professional reviews, and
they also tend to use more personal and evaluative language and language of
legitimacy (“award-winning,” “greatest”) in their reviews — in contrast to pro-
fessional reviewers, who focus more on publishing conventions and providing
context and background (“debut,” “ya”). Hegel [2018] also created a map of the
genres on Goodreads by clustering a set of genre tags that have been applied to
a random subset of books and a publicly accessible subset of those book’s re-
views. The resulting clusters are then used as labels for a prediction task using
the texts of reviews, revealing that more established genres are easier to predict
than subgenres or broad categories (e.g., “fiction”).
6.2 Collaborative Tagging Systems
A collaborative tagging system allows multiple users in a community to tag the
same object, and aggregations of these tags are then shown as features of the
object [Smith, 2007]. These tagging systems are also referred to as folksonomies,
a neologism for “folk taxonomy” [Vander Wal, 2005, Weber, 2006, Vander Wal,
2007]. Crucially, collaborative tagging systems and folksonomies rely on un-
controlled vocabularies rather than pre-defined hierarchies and taxonomies and
include interacting levels of personal and community tagging.
Why do users choose to participate in collaborative tagging systems? Moti-
vations can include organization of one’s personal data as well as social recogni-
74
tion from other users [Wash and Rader, 2007] and identification of functions of
the object (e.g., what the object is, who owns it) [Golder and Huberman, 2006].
Through a set of surveys, Bartley [2009] finds that LibraryThing users usually
add tags for collection management, to add factual information, and to help
others find books. Tagging systems can also be seen as collaborative sensemak-
ing, “orienteering”, or information foraging [Markus, 2001, Teevan et al., 2004,
Pirolli, 2005]. Individual tagging decisions are sometimes influenced by other
users [Sen et al., 2006, Golder and Huberman, 2006], indicating that users learn
from other users and make sense of the tagging space together. These motiva-
tions and habits will likely vary depending on the design and functionality of
the website that houses a given tagging system.
Several attempts have been made to categorize the tags used in folk-
sonomies. For example, Golder and Huberman [2006] proposes three tag classes
— factual, subjective, personal — which are used in later work by Sen et al. [2006]
to categorize movie tags. Heymann et al. [2010] divide the tags into six types
— objective and content-based, opinion, personal, physical, acronym, and junk — and
find that the majority of LibraryThing and Goodreads tags are objective and
content-based, while Goodreads has more personal tags than LibraryThing.
Other work seeks to categorize the taggers themselves. For example, Körner
et al. [2010] divide users into categorizers, who use a small set of hierarchical tags,
and describers, who use many creative and non-hierarchical tags. They explore
the emergent semantics in collaborative tagging systems and find that describers
contribute more than their more rigid counterparts. Some work has found that
users tag independently of other users [Rader and Wash, 2008], while Zubiaga
et al. [2011] finds that certain groups of users assign higher quality tags that are
75
more useful for tag prediction systems.
Much prior work has focused on tagging systems as problems to be solved.
If the tags are to be used as input for the creation of canonical systems and hier-
archies, then the tags should be normalized. Hypothesized synonyms should
be conflated and ambiguities should be resolved to enhance information re-
trieval, recommendation, automatic tagging, and ontology construction [Lan-
dauer, 1984, Zubiaga et al., 2011, Kar et al., 2018]. For example, Heymann et al.
[2010] emphasize three qualities of collaborative tagging systems — consistency,
quality, and completeness — and compare to systems designed by experts.
Rather than learning a hierarchy, we want to use the associations of users
to learn nuances about their understanding of genre and usage of tags. The
point of collaborative tagging is to escape the hierarchical view of data and in-
stead favor an inclusive, flexible structure [Golder and Huberman, 2006]. The
non-hierarchical tagging system allows each object to be about several things
simultaneously [Golder and Huberman, 2006]; this quality is exactly what al-
lows genres on LibraryThing to overlap and intertwine. This overlap allows
us to learn about cooccurrences, correlations, and relationships between genres
according to a community in ways previously not possible.
6.3 Literary Genres
Many readers understand “genre” as a way of classifying literary works based
on shared textual characteristics, such as similar plot structures, character types,
or settings. This conception of genre is reflected in Wikipedia descriptions
of genres, unsurprising given its descriptive goals as a popular encyclopedia.
76
However, many literary scholars resist understanding genre as a neat classifi-
cation system and instead emphasize genres as blurry, changing over time, and
dependant on context [Pavel, 2003]. Underwood [2016], Wilkens [2016] rely on
computational classification precisely because other forms of classification fail.
Genre, according to other scholars, is not something that books have, or some-
thing that can be found in the texts themselves. Rosen and Pavel argue that
genre is a tool that authors use to write books, akin to a “set of recipes” [Rosen,
2018, Pavel, 2003]. From another angle, Radway and others have explored genre
as a product of the publishing industry, as categories that are used to market and
sell books [Radway, 1991].
Within the fields of natural language processing and computational social
science, research has focused on learning fixed genre categories from texts. A
variety of approaches have been proposed for automatic genre identification
[Biber, 1986, Kessler et al., 1997, Stamatatos et al., 2000, Worsham and Kalita,
2018], most focusing on book-length texts as training data. These works raise
the question of what genre is: is it a set of surface level facets [Kessler et al.,
1997] or is abstraction required [Worsham and Kalita, 2018]? Genre has also
been successfully incorporated into book recommendation systems [Maharjan
et al., 2018] and used for analysis of emotional and narrative arcs [Kim et al.,
2017]. While we are similarly focused on genre definitions, similarities, and
boundaries, we focus not on the book texts but on user reviews and tags; our
goal is not to predict the “correct” genre label but to learn from users about
genres are understood and used in the LibraryThing community.
77
6.4 Case Study: Mapping Literary Genres on LibraryThing
6.4.1 Data from LibraryThing
LibraryThing is a social reading website, similar to Goodreads in much of its
functionality but more transparent in its data. LibraryThing contains an enor-
mous number of books and reviews. After years of user input, it also contains
an enormous number of tags: over 167 million. We find that these tags form a
“long tail” in which the majority of the tags are applied to a very small num-
ber of books. We cannot analyze all of these tags, both because of lack of space
and because many of the tags are not associated with enough reviews to make
reliable comparisons with other tags. Our methods require that we control for
several review characteristics — including rating polarity, review length, and
book title — and most tags do not have enough data to properly control for
all of these features. Restricting and holding constant our target tags also al-
lows us to more easily make comparisons across different metrics. While the
unconstrained, creative use of tags is part of what makes LibraryThing genres
so interesting, we must find ways to scope down the tags for analysis.
Therefore, we manually identify a set of 20 target genres (shown in Table ??)
by examining the most frequent 75 tags on LibraryThing. We discard tags that
are too broad (e.g., fiction, to-read) or that are near duplicates of other tags (e.g.,
classic and classics).1 Prior to our collection, LibraryThing already combined
some synonymous tags, e.g., the fantasy tag includes Fantasy, fantasia, fantası́a,
and FANTASY. We choose these target genres rather than more creative or user-
1There are cases where tags that are close in name operate very differently; e.g., books tagged
french are usually books written in French while books tagged france are usually books set in
France.
78
specific tags, because we are interested in how LibraryThing users re-imagine
more conventional literary genres. While this decision leaves unexplored many
areas of LibraryThing, and perhaps could be read as re-imposing traditional
genres on the collaborative tagging system, we see these target genres as both
touchstones and starting points. But there are also surprising and unconven-
tional genres even in the most frequent 75 tags, like vampires, family, and animals,
which do not fit traditional or scholarly conceptions of genre.
We scrape metadata for the 1,000 top books for each of these 20 target gen-
res, where “top” books are those that have most often received the target tag.
Scraped book metadata includes the title, author, rating distribution, publica-
tion date, and tag cloud (counts for all the tags that all users have applied to the
book). We scrape the full set of public reviews (review text, user ID, date, star
rating) for each book, and for each reviewer, we scrape their public tag cloud
(the tags they have personally applied). This results in a total of 17,440 books,
319,850 reviews, and 33,849 users.
The top books for each tag are not mutually exclusive. For example, a top
book for the tag fantasy might also be a top book for the tag science-fiction. Even
if the book is tagged science-fiction more often than fantasy, we will still add the
book to the fantasy genre if its fantasy ranking is in the top 1,000. In other words,
the top books are the most popular books for the tag, not the books most specific
to the tag.
We find significant differences between the target genres, including mean
review length, vocabulary size, mean star rating, and mean number of ratings.
For example, picture books have a very high mean star rating and a very low
mean number of ratings, while horror has a higher number of ratings but a much
79
lower mean star rating. Users infrequently review picture books, but when they
do, they rate them very positively, whereas users tend to be more critical of
horror books, even though they review them more overall. For most genres, the
vocabulary size is correlated with the mean length of the reviews, but outliers
include vampires and young adult, which have small vocabulary sizes given their
mean review lengths. These outliers suggest that reviews of vampires and young
adult books tend to discuss more similar subjects in similar ways. In order to
compare the genre features of research interest, we use the following sampling
sequence to control for features like review length which are not of interest. This
method also controls for the influence of extremely popular books such as the
Twilight series.
We remove reviews without ratings, reviews not written in English (using a
simple filter requiring at least five English stopwords and fewer than five Span-
ish stopwords), duplicate reviews (where duplicates require identical review
IDs, user IDs, and book IDs), and reviews with fewer than 100 words. To con-
trol for polarity, we randomly sample two positive and two negative reviews
for each book. We define negative reviews as those with ratings between 0.5-3.5
stars and positive reviews as those with ratings between 4-5 stars. We choose a
higher cut-off for negative reviews, rather than choosing the midpoint 2.5, be-
cause there is a strong skew across the book reviews towards positive ratings,
and qualitatively, we find that a rating of 3.5 stars usually indicates serious crit-
icisms of the book.
To control for the review length, which can vary significantly by genre, we
retain only the last 100 words of each review text. This is a common prepro-
cessing step in NLP analyses of texts with variables lengths; for example, see
80
the discussion and sampling decisions in Danescu-Niculescu-Mizil et al. [2013].
Controlling for review length is particularly important in our analyses of di-
versity of themes present in reviews (e.g., our use of topic entropy), as longer
reviews could by nature of their length contain more diverse language. In the
case of online book reviews, we observe that reviewers are more likely to be-
gin reviews with meta-content (e.g., where they read the book, personal stories
unrelated to the book) while they are more likely to end the reviews with sum-
maries of their thoughts, re-stating the different themes mentioned earlier. We
use the last 100 words because our analysis is focused on the reviewer’s judge-
ments of the books.
Books that do not meet these all of these filtering requirements are discarded.
Of the remaining books, we randomly sample 300 books per genre. We al-
low books to appear in multiple categories, as this reflects the reality of genre-
crossing books, and we allow multiple books from the same author, as this re-
flects the outsized influence of prolific authors. Our sampling results in a total
of 4,934 unique books (100 words per review, 2 reviews per polarity per book,
300 books per genre).
6.4.2 Ethical Considerations
As discussed in Chapter 3, online book reviews pose challenges for ethical data
science, especially with regard to citation and quotation. On the one hand, Li-
braryThing reviews are public and usually intended to be read by a wide audi-
ence of other book lovers. Many reviewers clearly take pride in their reviews
and tags, as evidenced by their profiles full of badges, descriptions of their read-
81
ing habits, and interactions with other reviewers. Reviewers often use their real
names or include information in their public profile (e.g., location, age, profes-
sion, photos) that make them easily identifiable. Some reviewers are compen-
sated for their reviews by authors or publishers, or they receive free books in
exchange for reviews. All of this suggests that LibraryThing users and their
labor deserve credit. On the other hand, book reviews represent personal opin-
ions on a wide variety of sensitive topics, and this information could be harmful
if revealed in a new context or to an unexpected audience. We can view the re-
viewers as writers who deserve credit for their work, or we can view them as
people who might not want or expect their work to appear beyond Library-
Thing. Most likely, different reviewers will have different perspectives on these
questions.
Our study was considered exempt from our institution’s IRB. Given the ten-
sions discussed in Chapter 3, we do not release review texts or any data that
is not easily viewable on the LibraryThing Zeitgeist web page.2 Instead, we
release the names of the 20 target genres as well as the 300 book IDs for each
genre.3 This maintains the review authors’ abilities to edit and delete their
reviews, while still giving credit to the creative work that enabled this study
[Bruckman, 2002]. For the reviews that we directly quote in this article, we
contacted the authors, disclosed our identities and publication intentions, and
asked permission for use of their creative work and whether they would like
their username credited. If the authors did not want to be included or did not
respond, we replaced these quotations with reviews written by authors who
have given consent.
2https://www.librarything.com/zeitgeist
3https://github.com/maria-antoniak/librarything-genres
82
6.4.3 Mapping Methods
Book and User Overlap
We measure genre similarity using two metrics. First, using our sampled book
sets, we measure the book overlap between each pair of genres — that is, how
many books have been tagged as both one genre and another genre. Second, we
measure the reviewer overlap between each pair of genres — that is, how many
reviewers have tagged a book in one genre and a book in another genre. We
convert both measurements into ranks, where the genre pair with the greatest
book overlap has Rank 0.
We expect the user overlap rankings to largely mirror the book overlap rank-
ings. If two genres share many books in common, it follows that they would also
share many reviewers in common. First, the shared books will necessarily in-
clude shared reviewers and, second, the high overlap in books implies that the
genres are thematically related. If a user finds one of the genres appealing, they
are likely to also find the other genre appealing.
We quantify these patterns by taking the difference between the user and
book overlap rankings.
overlap difference = user overlap rank − book overlap rank (6.1)
Genre pairs with very high or very low scores are outlier pairs, which deviate
from the expectation that book overlap rank should match user overlap rank.
83
Quantifying Lexical Fit: “Mistakes” and Surprises
We can classify the genre of a book being reviewed based on the text of the
review alone — without the book’s own text, title, or author — because users
focus on different aspects (e.g., characters, plot, suspense) for certain genres in
their reviews. We follow similar work that has sought to predict genres from
texts [Underwood, 2016, Kar et al., 2018], but our training set is reviews, rather
than book texts, so that we can focus on the reception of a book rather than its
content. Note that although we are training a classifier to quantify the associ-
ation between words and labels, we are not running a predictive experiment
with held-out testing data, but rather an evaluation on the full data set, more
like a standard linear regression. Our goal is not to maximize predictive perfor-
mance, but rather simply to computationally represent ambiguity and similar-
ity between genres. As a result, our results should be interpreted as an upper
bound for predictive accuracy, and not as a measure of generalization. This ap-
proach allows us to analyze the collection in two ways: first, if reviews for two
genres cannot be easily distinguished even when the labels are available at training
time, that is evidence that they serve the same values and expectations, and sec-
ond, if a review is “surprising,” it may describe a setting in which a reader has
a unique or idiosyncratic experience of a book.
We train a supervised classifier on the review texts, using as labels the genre
of each review in our sampled data. We use a logistic regression (one-vs-all)
model with TF-IDF weighted unigram features, using the last 100 words of each
review (as described in our sampling procedure). Genre labels are the most fre-
quent tags assigned to the book, after filtering out a small set of high level tags
84
that do not resemble genres.4 We measure the surprisal of the review text given
the genre using the probability of the true label: surprisal = 1 − P(true label).
High surprisal scores indicate that the predicted probability of the true label
was low and that the review was difficult to classify as its target genre. Low
surprisal scores indicate that the predicted probability of the true label was high
and that the classifier was able to predict the review’s target genre. This method
generally works well at identifying reviews for books that blend different gen-
res. When averaged, these scores can tell us which genres blend more with other
genres.
Values and Expectations: Measuring Thematic Signatures
So far, we have summarized genres as one or two dimensional scores. This has
allowed us to map genres onto an interpretable space where we can compare
genres, measure their similarity, and identify outliers. However, while user and
book overlap, predictive surprise, and community density are strong signals
of genre similarity, they do not tell us why these genres are or are not similar.
Learning review aspects might help answer this question. Aspects are themes
of a review, usually focused on features of the product being reviewed; in the
case of books, these might include plot, characters, and writing style. Reviews
are generally written to explain a rating, not a genre tag, but by measuring the
amount users choose to write about particular aspects and averaging over re-
views for a specific genre, we hope to approximate which aspects are most sig-
nificant for that genre. Measuring which aspects users focus on for each genre
will teach us the expectations and values that the LibraryThing community at-
taches to each genre.
4[fiction, non-fiction, to-read, ebook, kindle, literature, unread, own, hardcover, wishlist]
85
To answer these questions, we measure the thematic similarity of the review
texts for our target genres. For our purposes, we take a relatively simple un-
supervised approach, as we would like to discover themes rather than ordain
them. We train a latent Dirichlet model (LDA) [Blei et al., 2003] on the full set
of scraped reviews, removing duplicate texts. Before training, we probabilisti-
cally downsample words associated with specific genres by using the Author-
less Topic Models package [Thompson and Mimno, 2018b].5 This downsam-
pling reduces the incidence of genre-specific topics, as we are more interested
in cross-cutting themes that could be important for more than one genre (e.g., a
Harry Potter topic would not be useful outside a narrow band of genres).
We experiment with different numbers of topics and find that 30 topics pro-
duce interpretable and not overly broad or narrow topics. For readability, we
remove a set of common stopwords from the topic keywords, and we assign
labels to each topic through manual examination of each topic’s most probable
words and highest ranked documents.
Mapping Genres by Community Homogeneity
Do users “specialize” in specific genres—that is, often tag books in the same
genre or write reviews that are lexically similar to other reviews in the genre?
If so, how can we best measure this specialization, and what can we learn from
this specialization about tagging and genre on LibraryThing? We hypothesize
that there are different kinds of genre specialization. (1) A reviewer could be
well-read in a particular genre and write reviews that are lexically similar to
other reviews for books in that genre. (2) A reviewer could fit a genre lexically
5https://github.com/laurejt/authorless-tms
86
but only read one or two books in that genre. (3) A reviewer could be well-read
in a particular genre but their reviews might be lexical outliers, indicating that
they apply a different framework to these books from other reviewers.
We explore both of these possibilities through measures of lexical homogeneity
and community homogeneity for each genre. For lexical homogeneity, we rely on
the surprisal scores learned previously. For community homogeneity, we use
the personal tag cloud associated with each user that represents all the tags they
have assigned to books. We filter tags that occur in fewer than 20 tag clouds, and
we find the cosine similarity between each pair of normalized vectors, where
each vector represents the tags used by a user who has reviewed in that genre.
A high cosine similarity indicates a high degree of similarity between the re-
viewers. This tagging similarity could be interpreted as a similarity in reading
habits.
Relying on a user’s tagging history comes with some limitations. Users of-
ten tag books that they have not read, either for personal reasons (e.g., to mark
the book for future reading) or as volunteer labor for the community (e.g., to
add missing metadata for unpopular books). Users also employ tags for dif-
ferent functions, including personal cataloging (using idiosyncratic tags) and
community contribution, and it’s possible that these preferences align with dif-
ferent communities. However, by limiting our comparison sets to those who
have written at least one review for the target genre, we enforce a lower bound
on user-genre relatedness.
87
6.4.4 Results
Measuring reader expectations and how these expectations are shaped by online
communities can improve recommendations and increase engagement. Using
the classics as a touchstone, we find that affordances of the Goodreads website
can shape the discourse around these books. In the larger LibraryThing study,
I find that reviewer expertise, distribution of review discourses, and sharing of
minority and majority opinions can be genre-dependent, which leads to insights
in how genres are re-defined outside of an academic setting. This suggests a
model in which future reading habits and shifts in tastes are predicted by the
similarities between a review and its neighboring reviews.
Book and User Overlap
Some genre pairs share no book overlap (e.g., politics and mystery; classics and
graphic novel) while others share many books in common (e.g., children and ani-
mals; memoir and biography). 24% of the genre pairs have no book overlap. How-
ever, some outliers emerge. For example, in Figure 6.1, we notice that classics
and animals have higher book overlap than we would expect given their user
overlap. In contrast, classics and politics have higher user overlap than we would
expect given their book overlap.
88
memoir + mystery
mystery + politics
memoir + crime biography + mystery
graphic novel + classics graphic novel + animals
175 graphic novel + mystery
science fiction + mystery science fiction + crime
fantasy + crime
150
crime + politics
historical fiction + memoir
125 historical fiction + humor
historical fiction + biography
fantasy + picture book
100
picture book + classics
75 picture book + biography
classics + politics
humor + picture book
historical fiction + mystery
50 young adult + animals
fantasy + animals
family + politics
25 humor + animals
classics + animals
children + humor
children + young adult
0
0 25 50 75 100 125 150 175
User Rank
<------ Higher User Overlap                                           Lower User Overlap ------>
Figure 6.1: A mapping of LibraryThing genres. User overlap between genre
pairs correlates with book overlap, but there are outliers. Each point represents
two genres, and the axes represent the rank of the genre pair, where lower num-
bers indicate higher ranks and therefore higher overlap. For example, the genre
pair classics + animals has a mid-range user overlap rank and a high book over-
lap rank, indicating that these genres share surprisingly few users given how
many books are shared. Pearson correlation between book and user overlap is
significant (r = 0.68, p ¡ 0.05).
For example, given the low number of books that have been tagged as both
graphic novel and classics, it is surprising to see how many users read within both
of these genres. This high user overlap could be explained by a tendency of
89
Book Rank
<------ Higher Book Overlap                                           Lower Book Overlap ------>
users who review within the graphic novel tag to also review within the classics
tag—or it could be explained by a tendency of users who review classics to
read widely across many genres, including graphic novels. On the other hand,
humor and picture book have relatively high book overlap but relatively low user
overlap. Perhaps picture book reviewers read more frequently in other genres
and only occasionally review a picture book, e.g., when they give a book as a
gift or when reading a book to a child.
Lexical Surprisal
Figure 6.2 shows the relationship between misclassification counts and book
overlap for each pair of genres. The high number of misclassifications between
memoir and biography seems to conform to our expectations, as both genres are
stories of a person’s life. Similarly, the frequent misclassifications of reviews
for the animals, picture book, and children genres points to their commonality, as
these sets of genres have high book and reviewer overlap. By comparing to
book overlap, we can identify pairs with unusually high or low numbers of
misclassifications given their similarity. For example, romance and horror have
an unusually low number of misclassifications given their high book overlap,
while animals and psychology have an unusually high number of misclassifica-
tions given that they share no books in common.
90
True: animals
True: children Predicted: picture book
Predicted: animals
True: memoir
4 Predicted: biography True: mystery
Predicted: crime
True: children
True: classics Predicted: humor
3 Predicted: animals
True: romance True: memoir
Predicted: horror Predicted: psychology
True: horror
2 Predicted: graphic novelTrue: fantasy
Predicted: animals
True: mystery
Predicted: romance
True: humor
1 Predicted: psychology
True: historical fiction
Predicted: psychology
0 True: animals
Predicted: psychology
True: historical fiction
Predicted: horror
0 1 2 3 4 5
Misclassification Count (Log)
Figure 6.2: The number of overlapping books and the number of genre misclas-
sifications of user reviews for each pair of genres. Each point represents a pair
of genres in which one is the true tag applied to the review text and one is the
predicted tag from our model. As expected, we find a significant relationship
using Pearson correlation (r = 0.65, p < 0.05) between the book overlap and
misclassification count, but we highlight outlier genre pairs, e.g., animals and
psychology have an unusually high misclassification count given their very low
book overlap.
Often, the classifier’s mistakes indicate similarities and overlaps between
genres. But on other occasions, the classifier’s mistakes indicate a mismatch
between the reviewer’s priorities and the typical priorities for that genre. For
91
Book Overlap (Log)
example, the following review of Ann Bronte’s novel The Tenant of Wildfell Hall
(1848) was misclassified as psychology when the book was actually tagged as
romance:
I was in awe of Anne Bronte’s ability to tell such a relevant story in 1848. There are
so many women who find themselves in the same situation today. She was young
and naı̈ve when she married Arthur Huntingdon and by the time she learned his
true character it was too late. The writing is wonderful and for me that story
pulled me in completely. The author tells the story from Gilbert’s point-of-view at
times and from Helen’s at other times. The changing narrative flowed well and
never rang false.Bronte covers some intense subjects in the book. In addition to
infidelity and alcoholism, she makes some disturbing observations about women’s
rights during this time period. Sometimes it’s easy to forget how far we’ve come in
the last few years.
—bookworm12
The reviewer focuses on elements of The Tenant of Wildfell Hall that pertain to the
characters’ psychological states and mental and physical health, as well as how
these conditions relate to broader society of the 19th century and of the present.
A review that more easily conformed to the romance genre might have discussed
the ending, the romantic plot, or the attractiveness of the characters. But these
are not the elements that this particular reviewer discussed. The surprisal scores
thus helps us better understand the elements that readers seem to really care
about or gravitate toward in a particular genre, as well identify and interpret
outlier reviews for the genre.
We show examples of the classifier output and surprisal scores in Table 6.1.
For example, we show a review excerpt of Anna Sewell’s Black Beauty. The most
popular tag for this book was animals but our model misclassified this review as
92
classics with high confidence, resulting in a high surprisal score. The reviewer
writes about the book’s popularity and compares its sales rates to well-known
classics. The misclassification, in other words, flags the distinctiveness of both
the book and the review, suggesting that what shapes genre perception is likely
more than the text itself, which is discussed in the context of other books tagged
as classics.
True Genre Predicted Surprisal Example Misclassified Reviews
Genre
romance romance 0.00 I am not normally a fan of romance novels as I find them too mushy
and cutesy, but this one had a sense of humor about it that I really en-
joyed...The heroine was very independent and snarky and the main ro-
mance was full of comedic situations with a smattering of seriousness
that made it seem fairly realistic for the genre. It was a book that was
absorbing and fast to read.
—Arualanne (The Perfect Rake)
romance historical fiction 0.03 There were some moments though where I had to wonder about the his-
torical accuracy of some of the attitudes and that broke the reading spell
for me.Pretty predictable but I enjoyed the ride. Almost a 4 read for me
but not quite.
—wyvernfriend (Simply Unforgettable)
animals classics 0.20 ...it’s no wonder it’s been so popular since it was first published. I was
surprised to learn that Black Beauty is one of the top thirty best-selling
books of all time in the English language, selling over 50 million copies–
more than The Odyssey, To Kill a Mockingbird, Pride and Prejudice, and
Gone with the Wind...
—nsenger (Black Beauty)
Table 6.1: Examples classifications and surprisal scores. Excerpts are selected
from the last 100 words of the reviews. Higher surprisal indicates greater confi-
dence in the incorrect label.
We can also arrive at single surprisal score for each genre by taking the mean
of the surprisal scores for the reviews assigned to that genre. The most sur-
prising genres include young adult, family, classics, children, and fantasy. Genres
that have higher mean surprisal scores are harder to classify; these genres are
“fuzzier” and the language used in the reviews for these genres is more wide-
ranging. These genres are often mistaken for similar genres, but it could also
be the case that these genres are simply broader. For example, the classics genre
93
contains a wide range of themes and discourses, in both its books and reviews.
The high surprisal scores emphasize the view of genres as fuzzy, overlapping
tags, rather than the rigid hierarchy sought by Wikipedia editors.
94
Community Homogeneity
family young adult
classics
children
0.82 humor historical fiction fantasy
0.80 memoir
romance
mystery horror
0.78 biography crime science fiction
vampires
0.76 animals
politics
graphic novel
0.74 picture book
psychology
0.72
0.06 0.07 0.08 0.09
Community Homogeneity
Figure 6.3
3.72 picture book
children
3.70
animals
3.68 biography historical fiction
family
3.66 graphic novel
crime fantasy
3.64
mystery memoir
3.62 classics
young adult
science fiction
3.60 politics vampires
humor romance
3.58 psychology
horror
3.56
0.06 0.07 0.08 0.09
Community Homogeneity
Figure 6.4
Figure 6.5: Are tighter communities easier to predict? Are tighter communi-
ties more critical? Figure 6.3 shows the target genres plotted along surprisal
(the ability of a classifier to predict the genre of a review) and community ho-
mogeneity (averaged cosine similarities between reviewers’ tagsets). Figure 6.4
shows the target genres plotted along rating and community homogeneity. Gen-
res whose reviewers have more similar reading habits tend to also have higher
ratings according to a Pearson correlation test (r = -0.60, p ¡ 0.05).
95
Rating Review Surprisal
6.4.5 Discussion and Impact
There is not a single right way to map tags and genres in the LibraryThing com-
munity. Different maps reveal different outliers, pairings, and patterns. Reduc-
ing the rich tags to two dimensions will not answer all of our questions, but
creating multiple mappings and comparing them has allowed us to tease apart
some of the ways in which LibraryThing users see genre. Unlike much prior
work, we do not seek to normalize the tags. While the unconstrained vocab-
ulary of tags on LibraryThing means that “errors” like synonyms, typos, and
overly personalized tags do exist, we take advantage of this information and
use it to discover what is new, rather than force it to fit a traditional structure.
By exploring thematic signatures of LibraryThing genres, we learn which
aspects of the reading experience are valued by LibraryThing reviewers and
how these values vary depending on the genre of the book being reviewed. We
discover strange similarities — e.g., the resemblance between young adult and
more “adult” genres like horror — and we also find peculiarities in the topic
signatures of strongly related genres, as in the case of memoir and biography.
These patterns connect to a broader set of themes which we discuss below.
There are many parallels between our work and recent digital humanities
studies of genre. For example, our use of classification to measure an aspect of
literary style is similar to Underwood [2016] and our attempts to map genres
are similar to the book clusters in Wilkens [2016]. Our approach is founded in
this tradition, which often uses computational tools in a non-normative way
to explore ambiguity and to find outliers and “misclassifications” rather than
to make “accurate” predictions. However, much prior computational work on
genre in the digital humanities has focused not on reception but on book texts
96
[Underwood, 2016], whereas we focus on reception via online book reviews.
Reception scholars such as Fish [1982] have argued that readers’ experiences
of texts are strongly shaped by their “interpretive communities” — groups that
share common strategies for interpreting texts (e.g., a group of professional lit-
erary critics). We find that LibraryThing users’ reviews and tagging behaviors
similarly correspond to their audiences on the site, with reviewers for certain
genres writing more about certain aspects than others (§??). The shared norms
in this tagging community might be driven not only by personal tagging moti-
vations (tagging and curating one’s own library) but by communal and perfor-
mative ones, too (publishing reviews, ratings, and tags).
LibraryThing is not just a virtual meeting place for book lovers; it also pro-
vides a cataloging service, TinyCat, to physical lending libraries around the
world. TinyCat allows librarians to input their own metadata, but it also pro-
vides genre labels for books, which saves these librarians additional work. The
process for genre assignment is not publicly explained but presumably relies to
some extent on the tags provided by users on LibraryThing. Our exploration of
how genre is defined on LibraryThing thus has implications for small libraries
in addition to online communities. The non-conventional genres of Library-
Thing may be shaping how today’s library patrons discover books. It could be
the case that “non-prestigious” genres are shaping our libraries and that patrons
will be able to find books categorized by vampires enthusiasts on LibraryThing.
97
CHAPTER 7
CONCLUSION
My dissertation work has combined unsupervised computational methods
with sets of personal narratives and experiences. The disclosures made in on-
line communities are rich sources of emotional reactions, stories grounded in
healthcare experiences, and personal relationships with cultural objects, but this
data also requires careful handling and attention to biases and instabilities. This
work opens up new questions into both probing distributional models and us-
ing those models for cultural analysis.
My work has questioned the evaluation of statistical results, highlighting
their instability and examining the conditions under which such NLP methods
can provide robust results to social and humanistic research questions. Treating
the training corpus as the central object of study, as is common in the digital
humanities and computational social science, necessitates increased attention
to errors and robustness, but it also creates a natural pathway from computa-
tional results back to a close reading of the data, a practice that future work will
continue to refine.
There is a tension throughout this dissertation between the “spurious story-
telling” critiqued in Chapter 4 and the “computational reading” demonstrated
in Chapters 5 and 6. When using computational tools for the study of social and
cultural questions, there are multiple possible views or interpretative lenses of
the data, each of which might be useful for researchers in different contexts.
Community members themselves might not agree on a single interpretation of
their community. In these settings, stability and coherence of the computational
results and comparison of the computational results with qualitative methods
98
(e.g., close reading) are more important than determining a single “correct”
model of the data. Careful data controls, as in Chapter 6, and comparing results
across bootstrapped samples of the documents as in Chapter 5, can ground the
reliability of the results.
More work is needed to continue incorporating lessons from (and contribut-
ing to) the rich literature on bias and ethics in data science, with particular
attention to tensions between averaged patterns and outlier voices. As mod-
els get bigger, questions increase about how large pretrained models can be
used for small, socially-specific datasets in unique domains. A promising fu-
ture direction would include probing these models using comparison sets across
different health and/or literary genre subdomains, measuring discrepancies in
each model’s coverage and performance. Research is needed that measures both
dangers (e.g., domain mismatches, unintended biases) and opportunities (e.g.,
model errors as an analytic lens on the training corpus), as well as research that
translates between these models, combining computational methods with nat-
ural language data in specific social contexts.
99
BIBLIOGRAPHY
Jacob Abbott, Haley MacLeod, Novia Nurain, Gustave Ekobe, and Sameer Patil.
Local standards for anonymization practices in health, wellness, accessibility,
and aging research at chi. In Proceedings of the 2019 CHI Conference on Human
Factors in Computing Systems, CHI ’19, pages 462:1–462:14, New York, NY,
USA, 2019. ACM. ISBN 978-1-4503-5970-2. doi: 10.1145/3290605.3300692.
URL http://doi.acm.org/10.1145/3290605.3300692.
Irwin Altman and Dalmas A. Taylor. Social penetration: The development of inter-
personal relationships. Holt, Rinehart & Winston, 1973.
Maria Antoniak and David Mimno. Evaluating the stability of embedding-
based word similarities. Transactions of the Association for Computational
Linguistics, 6:107–119, 2018. doi: 10.1162/tacl a 00008. URL https://
aclanthology.org/Q18-1008.
Maria Antoniak and David Mimno. Bad seeds: Evaluating lexical methods for
bias measurement. In Proceedings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th International Joint Conference on Natu-
ral Language Processing (Volume 1: Long Papers), pages 1889–1904, Online, Au-
gust 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.
acl-long.148. URL https://aclanthology.org/2021.acl-long.148.
Maria Antoniak, David Mimno, and Karen Levy. Narrative paths and ne-
gotiation of power in birth stories. Proc. ACM Hum.-Comput. Interact., 3
(CSCW), nov 2019. doi: 10.1145/3359190. URL https://doi.org/10.
1145/3359190.
Maria Antoniak, Melanie Walsh, and David Mimno. Tags, borders, and catalogs:
Social re-working of genre on librarything. Proc. ACM Hum.-Comput. Interact.,
100
5(CSCW1), apr 2021. doi: 10.1145/3449103. URL https://doi.org/10.
1145/3449103.
Danielle Arigo and Joshua M. Smyth. The benefits of expressive writing on sleep
difficulty and appearance concerns for college women. Psychology & Health,
27(2):210–226, 2012.
JinYeong Bak, Suin Kim, and Alice Oh. Self-disclosure and relationship strength
in Twitter conversations. In Proceedings of the 50th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 2: Short Papers), pages 60–64,
Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL
https://aclanthology.org/P12-2012.
JinYeong Bak, Chin-Yew Lin, and Alice Oh. Self-disclosure topic model for clas-
sifying and analyzing twitter conversations. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Language Processing (EMNLP), pages
1986–1996, 2014.
Sairam Balani and Munmun De Choudhury. Detecting and characterizing men-
tal health related self-disclosure in social media. In Proceedings of the 33rd
Annual ACM Conference Extended Abstracts on Human Factors in Computing Sys-
tems, pages 1373–1378, 2015.
David Bamman and Noah A. Smith. Unsupervised discovery of biographical
structure from text. Transactions of the Association for Computational Linguistics,
2:363–376, 2014. doi: 10.1162/tacl a 00189. URL https://www.aclweb.
org/anthology/Q14-1029.
David Bamman, Brendan O’Connor, and Noah A. Smith. Learning latent per-
sonas of film characters. In Proceedings of the 51st Annual Meeting of the As-
101
sociation for Computational Linguistics (Volume 1: Long Papers), pages 352–361,
Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL
https://www.aclweb.org/anthology/P13-1035.
Jack Bandy and Nicholas Vincent. Addressing ”documentation debt” in ma-
chine learning research: A retrospective datasheet for BookCorpus. arXiv
preprint arXiv:2105.05241, 2021. URL https://arxiv.org/abs/2105.
05241.
Azy Barak and Orit Gluck-Ofri. Degree and reciprocity of self-disclosure in
online forums. CyberPsychology & Behavior, 10(3):407–417, 2007.
Peishan Bartley. Book tagging on LibraryThing: How, why, and what are in the
tags? Proceedings of the American Society for Information Science and Technology,
46(1):1–22, 2009.
Eric P.S. Baumer, David Mimno, Shion Guha, Emily Quan, and Geri K. Gay.
Comparing grounded theory and topic modeling: Extreme divergence or un-
likely convergence? Journal of the Association for Information Science and Tech-
nology, 68(6):1397–1410, 2017.
Emily M. Bender and Batya Friedman. Data statements for natural language
processing: Toward mitigating system bias and enabling better science.
Transactions of the Association for Computational Linguistics, 6:587–604, 2018.
doi: 10.1162/tacl a 00041. URL https://www.aclweb.org/anthology/
Q18-1041.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret
Shmitchell. On the dangers of stochastic parrots: Can language models be
too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability,
102
and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Associ-
ation for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.
3445922. URL https://doi.org/10.1145/3442188.3445922.
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neu-
ral probabilistic language model. Journal of Machine Learning Research, 3(Feb):
1137–1155, 2003.
Sudeep Bhatia, Geoffrey P. Goodwin, and Lukasz Walasek. Trait associations for
Hillary Clinton and Donald Trump in news media: A computational analysis.
Social Psychological and Personality Science, 9(2):123–130, 2018.
Douglas Biber. Spoken and written textual dimensions in english: Resolving the
contradictory findings. Language, pages 384–414, 1986.
Lucas M. Bietti, Ottilie Tilston, and Adrian Bangerter. Storytelling as adaptive
collective sensemaking. Topics in Cognitive Science, 11:710 – 732, 2019.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.
Journal of Machine Learning research, 3(Jan):993–1022, 2003.
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Lan-
guage (technology) is power: A critical survey of “bias” in NLP. In Pro-
ceedings of the 58th Annual Meeting of the Association for Computational Linguis-
tics, pages 5454–5476, Online, July 2020. Association for Computational Lin-
guistics. doi: 10.18653/v1/2020.acl-main.485. URL https://www.aclweb.
org/anthology/2020.acl-main.485.
Olivier Bodenreider. The unified medical language system (UMLS): Integrating
biomedical terminology. Nucleic Acids Research, 32(suppl 1):D267–D270, 2004.
103
Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and
Adam T. Kalai. Man is to computer programmer as woman is to homemaker?
Debiasing word embeddings. pages 4349–4357, 2016a.
Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and
Adam T Kalai. Man is to computer programmer as woman is to homemaker?
Debiasing word embeddings. In Advances in Neural Information Processing Sys-
tems, pages 4349–4357, 2016b.
Karen Bourrier and Mike Thelwall. The Social Lives of Books: Reading Victorian
Literature on Goodreads. Journal of Cultural Analytics, page 12049, February
2020. doi: 10.22148/001c.12049.
Margaret M. Bradley and Peter J. Lang. Affective norms for English words
(ANEW): Instruction manual and affective ratings. Technical report, 1999.
The Center for Research in Psychophysiology, University of Florida.
Amy Bruckman. Studying the amateur artist: A perspective on disguising data
collected in human subjects research on the internet. Ethics and Information
Technology, 4(3):217–231, 2002.
Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ashton Anderson, and
Richard Zemel. Understanding the origins of bias in word embeddings. In
International Conference on Machine Learning, pages 803–811, 2019.
Tarana Burke. “It made my heart swell to see women using this idea - one
that we call ‘empowerment through empathy’ #metoo”. Twitter, October 15,
2017a.
Tarana Burke. “to not only show the world how widespread and pervasive
104
sexual violence is, but also to let other survivors know they are not alone.
#metoo”. Twitter, October 15, 2017b.
Carma L. Bylund. Mothers’ involvement in decision making during the birthing
process: A quantitative analysis of women’s online birth stories. Health Com-
munication, 18(1):23–39, 2005.
Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. Semantics derived
automatically from language corpora contain human-like biases. Science, 356
(6334):183–186, 2017.
Lynn Clark Callister. Making meaning: Women’s birth narratives. Journal of
Obstetric, Gynecologic, & Neonatal Nursing, 33(4):508–518, 2004.
Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative
event chains. In Proceedings of ACL-08: HLT, pages 789–797, Columbus, Ohio,
June 2008. Association for Computational Linguistics. URL https://www.
aclweb.org/anthology/P08-1090.
Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative
schemas and their participants. In Proceedings of the Joint Conference of the 47th
Annual Meeting of the ACL and the 4th International Joint Conference on Natu-
ral Language Processing of the AFNLP, pages 602–610, Suntec, Singapore, Au-
gust 2009. Association for Computational Linguistics. URL https://www.
aclweb.org/anthology/P09-1068.
Stevie Chancellor, Michael L. Birnbaum, Eric D. Caine, Vincent M. B. Silenzio,
and Munmun De Choudhury. A taxonomy of ethical tensions in inferring
mental health states from social media. In Proceedings of the Conference on Fair-
ness, Accountability, and Transparency, FAT* ’19, page 79–88, New York, NY,
105
USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi:
10.1145/3287560.3287587. URL https://doi.org/10.1145/3287560.
3287587.
Danqi Chen and Christopher Manning. A fast and accurate dependency parser
using neural networks. In Proceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar,
October 2014. Association for Computational Linguistics. doi: 10.3115/v1/
D14-1082. URL https://aclanthology.org/D14-1082.
Esther H. Chen, Frances S. Shofer, Anthony J. Dean, Judd E. Hollander,
William G. Baxt, Jennifer L. Robey, Keara L. Sease, and Angela M. Mills. Gen-
der disparity in analgesic treatment of emergency department patients with
acute abdominal pain. Academic Emergency Medicine, 15(5):414–418, 2008.
Colin Cherry and Hongyu Guo. The unreasonable effectiveness of word rep-
resentations for Twitter named entity recognition. In Proceedings of the 2015
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 735–745, Denver, Colorado,
May–June 2015. Association for Computational Linguistics. doi: 10.3115/v1/
N15-1075. URL https://aclanthology.org/N15-1075.
Billy Chiu, Gamal Crichton, Anna Korhonen, and Sampo Pyysalo. How to
train good word embeddings for biomedical NLP. In Proceedings of the 15th
Workshop on Biomedical Natural Language Processing, pages 166–174, Berlin,
Germany, August 2016. Association for Computational Linguistics. doi:
10.18653/v1/W16-2922. URL https://www.aclweb.org/anthology/
W16-2922.
David Alan Cruse. Lexical Semantics. Cambridge University Press, 1986.
106
Cristian Danescu-Niculescu-Mizil, Robert West, Dan Jurafsky, Jure Leskovec,
and Christopher Potts. No country for old members: User lifecycle and
linguistic change in online communities. In Proceedings of the 22nd Interna-
tional Conference on World Wide Web, WWW ’13, page 307–318, New York, NY,
USA, 2013. Association for Computing Machinery. ISBN 9781450320351. doi:
10.1145/2488388.2488416. URL https://doi.org/10.1145/2488388.
2488416.
Janet S. de Moor, Lemuel Moyé, David Low, Edgardo Rivera, S. Eva Singletary,
Rachel T. Fouladi, and Lorenzo Cohen. Expressive writing as a presurgical
stress management intervention for breast cancer patients. Journal of the Soci-
ety for Integrative Oncology, 6(2), 2008.
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer,
and Richard Harshman. Indexing by latent semantic analysis. Journal of the
American Society for Information Science, 41(6):391, 1990.
Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, and Ju-
nichi Tsujii, editors. Proceedings of the 19th SIGBioMed Workshop on Biomedical
Language Processing, Online, July 2020. Association for Computational Lin-
guistics. URL https://aclanthology.org/2020.bionlp-1.0.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:
Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019.
Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL
https://aclanthology.org/N19-1423.
107
Brianna Dym and Casey Fiesler. Ethical and privacy considerations for research
using online fandom data. Transformative Works and Cultures, 33, 2020.
James F. English, Scott Enderle, and Rahul Dhakecha. Bad habits on Goodreads?
In In preparation for James F. English and Heather Love, eds., Literary Studies and
Human Flourishing. Oxford UP, 2022.
Daniel A. Epstein, Nicole B. Lee, Jennifer H. Kang, Elena Agapie, Jessica
Schroeder, Laura R. Pina, James Fogarty, Julie A. Kientz, and Sean Munson.
Examining menstrual tracking to inform the design of personal informatics
tools. In Proceedings of the 2017 CHI Conference on Human Factors in Computing
Systems, CHI ’17, page 6876–6888, New York, NY, USA, 2017. Association for
Computing Machinery. ISBN 9781450346559. doi: 10.1145/3025453.3025635.
URL https://doi.org/10.1145/3025453.3025635.
Kawin Ethayarajh, David Duvenaud, and Graeme Hirst. Understanding unde-
sirable word embedding associations. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics, pages 1696–1705, Florence,
Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/
P19-1166. URL https://www.aclweb.org/anthology/P19-1166.
Ethan Fast, Binbin Chen, and Michael S. Bernstein. Empath: Understanding
topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on
Human Factors in Computing Systems, CHI ’16, page 4647–4657, New York, NY,
USA, 2016. Association for Computing Machinery. ISBN 9781450333627. doi:
10.1145/2858036.2858535. URL https://doi.org/10.1145/2858036.
2858535.
Casey Fiesler and Nicholas Proferes. ”Participant” perceptions of Twitter re-
search ethics. Social Media+Society, 4(1):2056305118763366, 2018.
108
Stanley Fish. Is There a Text in This Class? The Authority of Interpretive Communi-
ties. Harvard University Press, Cambridge Mass., June 1982. ISBN 978-0-674-
46726-2.
Michael Flor and Swapna Somasundaran. Sentiment analysis and lexical cohe-
sion for the story cloze task. In Proceedings of the 2nd Workshop on Linking Mod-
els of Lexical, Sentential and Discourse-level Semantics, pages 62–67, Valencia,
Spain, April 2017. Association for Computational Linguistics. doi: 10.18653/
v1/W17-0909. URL https://www.aclweb.org/anthology/W17-0909.
Patrick S. Forscher, Calvin K. Lai, Jordan R. Axt, Charles R. Ebersole, Michelle
Herman, Patricia G. Devine, and Brian A Nosek. A meta-analysis of change
in implicit bias.
Jolene Galegher, Lee Sproull, and Sara Kiesler. Legitimacy, authority, and com-
munity in electronic support groups. Written Communication, 15(4):493–530,
1998.
Ryan J. Gallagher, Elizabeth Stowell, Andrea G. Parker, and Brooke Fou-
cault Welles. Reclaiming stigmatized narratives: The networked disclosure
landscape of metoo. Proc. ACM Hum.-Comput. Interact., 3(CSCW), nov 2019.
doi: 10.1145/3359198. URL https://doi.org/10.1145/3359198.
Adriana Gallardo. How we collected nearly 5,000 stories of maternal harm.
ProPublica, 2018.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles
Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The
Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint
arXiv:2101.00027, 2020.
109
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. Word embed-
dings quantify 100 years of gender and ethnic stereotypes. Proceedings of the
National Academy of Sciences, 115(16):E3635–E3644, 2018.
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman
Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets
for datasets. Proceedings of the 5th Workshop on Fairness, Accountability, and
Transparency in Machine Learning, PMLR, 2018.
Shelagh K. Genuis and Jenny Bronstein. Looking for “normal”: Sense making
in the context of health disruption. Journal of the Association for Information
Science and Technology, 68(3):750–761, 2017.
Katherine J. Gold, Martha E. Boggs, Emeline Mugisha, and Christie Lancaster
Palladino. Internet message boards for pregnancy loss: Who’s on-line and
why? Women’s health issues : official publication of the Jacobs Institute of Women’s
Health, 22 1:e67–72, 2012.
Yoav Goldberg. Neural Network Methods for Natural Language Processing. Synthe-
sis Lectures on Human Language Technologies. Morgan & Claypool Publish-
ers, 2017.
Scott A. Golder and Bernardo A. Huberman. Usage patterns of collaborative
tagging systems. Journal of Information Science, 32(2):198–208, 2006.
Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty, Ralph Weischedel, and
Nanyun Peng. Content planning for neural story generation with Aristotelian
rescoring. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 4319–4338, Online, November 2020. As-
110
sociation for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.
351. URL https://aclanthology.org/2020.emnlp-main.351.
Connie Golsteijn and Serena Wright. Using narrative research and portraiture
to inform design research. In IFIP Conference on Human-Computer Interaction,
pages 298–315. Springer, 2013.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
2016.
Geoffrey P Goodwin, Jared Piazza, and Paul Rozin. Moral character predom-
inates in person perception and evaluation. Journal of Personality and Social
Psychology, 106(1):148, 2014.
Andrew Gordon and Reid Swanson. Identifying personal stories in millions of
weblog entries. In Third International Conference on Weblogs and Social Media,
Data Challenge Workshop, San Jose, CA, volume 46, 2009.
Jonathan Gordon and Benjamin Van Durme. Reporting bias and knowledge
extraction. 2013.
Amit Goyal, Ellen Riloff, and Hal Daumé III. Automatically producing plot
unit representations for narrative text. In Proceedings of the 2010 Conference
on Empirical Methods in Natural Language Processing, pages 77–86, Cambridge,
MA, October 2010. Association for Computational Linguistics. URL https:
//www.aclweb.org/anthology/D10-1008.
Anthony G. Greenwald, Debbie E. McGhee, and Jordan L.K. Schwartz. Measur-
ing individual differences in implicit cognition: the implicit association test.
Journal of Personality and Social Psychology, 74(6):1464, 1998.
111
Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie Huang. A
knowledge-enhanced pretraining model for commonsense story generation.
Transactions of the Association for Computational Linguistics, 8:93–108, 2020.
doi: 10.1162/tacl a 00302. URL https://aclanthology.org/2020.
tacl-1.7.
William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word em-
beddings reveal statistical laws of semantic change. In Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 1489–1501, Berlin, Germany, August 2016. Association
for Computational Linguistics. doi: 10.18653/v1/P16-1141. URL https:
//aclanthology.org/P16-1141.
Alex Hanna, Emily Denton, Andrew Smart, and Jamila Smith-Loud. Towards
a critical race methodology in algorithmic fairness. In Proceedings of the 2020
Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 501–512,
New York, NY, USA, 2020. Association for Computing Machinery. ISBN
9781450369367. doi: 10.1145/3351095.3372826. URL https://doi.org/
10.1145/3351095.3372826.
Allison Hegel. Social Reading in the Digital Age. University of California, Los
Angeles, 2018.
Ryan Heuser. Word vectors in the eighteenth-century. In IPAM Workshop: Cul-
tural Analytics, 2016.
Paul Heymann, Andreas Paepcke, and Hector Garcia-Molina. Tagging human
knowledge. In Proceedings of the Third ACM International Conference on Web
Search and Data Mining, WSDM ’10, page 51–60, New York, NY, USA, 2010.
112
Association for Computing Machinery. ISBN 9781605588896. doi: 10.1145/
1718487.1718495. URL https://doi.org/10.1145/1718487.1718495.
Charles T. Hill and Donald E. Stull. Gender and self-disclosure. In Self-
Disclosure, pages 81–100. Springer, 1987.
Diane E. Hoffmann and Anita J. Tarzian. The girl who cried pain: A bias against
women in the treatment of pain. The Journal of Law, Medicine & Ethics, 28
(4 suppl):13–27, 2001.
Alexander Hoyle, Pranav Goel, Andrew Hian-Cheong, Denis Peskov, Jordan
Boyd-Graber, and Philip Resnik. Is automated topic model evaluation bro-
ken? The incoherence of coherence. Advances in Neural Information Processing
Systems, 34, 2021.
Alexander Miserlis Hoyle, Lawrence Wolf-Sonkin, Hanna Wallach, Isabelle Au-
genstein, and Ryan Cotterell. Unsupervised discovery of gendered language
through latent-variable modeling. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics, pages 1706–1716, Florence,
Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/
P19-1167. URL https://www.aclweb.org/anthology/P19-1167.
Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Pro-
ceedings of the Tenth ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, KDD ’04, page 168–177, New York, NY, USA, 2004. As-
sociation for Computing Machinery. ISBN 1581138881. doi: 10.1145/1014052.
1014073. URL https://doi.org/10.1145/1014052.1014073.
Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber, and Hal
Daumé III. Feuding families and former Friends: Unsupervised learning for
113
dynamic fictional relationships. In Proceedings of the 2016 Conference of the
North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, pages 1534–1544, San Diego, California, June 2016. As-
sociation for Computational Linguistics. doi: 10.18653/v1/N16-1180. URL
https://www.aclweb.org/anthology/N16-1180.
Kokil Jaidka, Sharath Chandra Guntuku, Anneke Buffone, H Andrew Schwartz,
and Lyle H Ungar. Facebook vs. Twitter: Cross-platform differences in self-
disclosure and trait prediction. In Proceedings of the Twelfth International AAAI
Conference on Web and Social Media, pages 141–150, 2018.
Bram Jans, Steven Bethard, Ivan Vulić, and Marie Francine Moens. Skip n-grams
and ranking functions for predicting script events. In Proceedings of the 13th
Conference of the European Chapter of the Association for Computational Linguistics,
pages 336–344, Avignon, France, April 2012. Association for Computational
Linguistics. URL https://www.aclweb.org/anthology/E12-1034.
A. Cecile J.W. Janssens and Peter Kraft. Research conducted using data ob-
tained through online communities: ethical implications of methodological
limitations. PLoS Medicine, 9(10):e1001328, 2012.
Mukund Jha and Noémie Elhadad. Cancer stage prediction based on patient
online discourse. In Proceedings of the 2010 Workshop on Biomedical Natu-
ral Language Processing, pages 64–71, Uppsala, Sweden, July 2010. Associa-
tion for Computational Linguistics. URL https://aclanthology.org/
W10-1908.
Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, H. Lehman Li-wei, Mengling
Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony
114
Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.
Scientific Data, 3:160035, 2016.
Adam N. Joinson. Knowing me, knowing you: Reciprocal self-disclosure in
internet-based surveys. CyberPsychology & Behavior, 4(5):587–591, 2001.
Adam N. Joinson and Carina B. Paine. Self-disclosure, privacy and the internet.
The Oxford handbook of Internet psychology, 2374252, 2007.
Adam N. Joinson, Ulf-Dietrich Reips, Tom Buchanan, and Carina B. Paine
Schofield. Privacy, trust, and self-disclosure online. Human–Computer Interac-
tion, 25(1):1–24, 2010.
Kenneth Joseph, Wei Wei, and Kathleen M Carley. Girls rule, boys drool: Ex-
tracting semantic and affective stereotypes from twitter. In Proceedings of the
2017 ACM Conference on Computer Supported Cooperative Work and Social Com-
puting, pages 1362–1374. ACM, 2017.
Sidney M. Jourard. The transparent self. Van Nostrand Reinhold Company, 1971.
Sidney M. Jourard and Paul Lasakow. Some factors in self-disclosure. The Jour-
nal of Abnormal and Social Psychology, 56(1):91, 1958.
Sudipta Kar, Suraj Maharjan, A. Pastor López-Monroy, and Thamar Solorio.
MPST: A corpus of movie plot synopses with tags. In Proceedings of the
Eleventh International Conference on Language Resources and Evaluation (LREC
2018), Miyazaki, Japan, May 2018. European Language Resources Associa-
tion (ELRA). URL https://www.aclweb.org/anthology/L18-1274.
Brett Kessler, Geoffrey Nunberg, and Hinrich Schutze. Automatic detection of
text genre. In 35th Annual Meeting of the Association for Computational Lin-
115
guistics and 8th Conference of the European Chapter of the Association for Com-
putational Linguistics, pages 32–38, Madrid, Spain, July 1997. Association
for Computational Linguistics. doi: 10.3115/976909.979622. URL https:
//www.aclweb.org/anthology/P97-1005.
Os Keyes. Stop mapping names to gender. https://ironholds.org/
names-gender/, 2017. Accessed: 2021-05-26.
Evgeny Kim, Sebastian Padó, and Roman Klinger. Investigating the relation-
ship between literary genres and emotional plot development. In Proceedings
of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage,
Social Sciences, Humanities and Literature, pages 17–26, Vancouver, Canada,
August 2017. Association for Computational Linguistics. doi: 10.18653/v1/
W17-2203. URL https://www.aclweb.org/anthology/W17-2203.
Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. Tempo-
ral analysis of language through neural language models. In Proceedings of the
ACL 2014 Workshop on Language Technologies and Computational Social Science,
pages 61–65, Baltimore, MD, USA, June 2014. Association for Computational
Linguistics. doi: 10.3115/v1/W14-2517. URL https://aclanthology.
org/W14-2517.
Markus Knoche, Radomir Popović, Florian Lemmerich, and Markus
Strohmaier. Identifying biases in politically biased wikis through word em-
beddings. In Proceedings of the 30th ACM Conference on Hypertext and Social
Media, HT ’19, pages 253–257, New York, NY, USA, 2019. ACM. ISBN 978-1-
4503-6885-8. doi: 10.1145/3342220.3343658. URL http://doi.acm.org/
10.1145/3342220.3343658.
Christian Körner, Dominik Benz, Andreas Hotho, Markus Strohmaier, and Gerd
116
Stumme. Stop thinking, start tagging: Tag semantics emerge from collabora-
tive verbosity. In Proceedings of the 19th International Conference on World Wide
Web, WWW ’10, page 521–530, New York, NY, USA, 2010. Association for
Computing Machinery. ISBN 9781605587998. doi: 10.1145/1772690.1772744.
URL https://doi.org/10.1145/1772690.1772744.
Austin C. Kozlowski, Matt Taddy, and James A. Evans. The geometry of cul-
ture: Analyzing the meanings of class through word embeddings. American
Sociological Review, 84(5):905–949, 2019.
Sicong Kuang. Semantic and context-aware linguistic model for bias detection.
2016.
Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. Freshman or fresher? Quan-
tifying the geographic variation of language in online social media. In
ICWSM, pages 615–618, 2016.
Mei Chun Louisa Lam, Christine Urquhart, and Dervin L. Brenda. Sense-
Making/Sensemaking. Oxford University Press, United Kingdom, June 2016.
doi: 10.1093/obo/9780199756841-0112.
Thomas K. Landauer. Statistical semantics-analysis of the potential performance
of keyword information-systems, and a cure for an ancient problem. In Journal
of Psycholinguistic Research, volume 13, pages 495–496, 1984.
Thomas K. Landauer and Susan T. Dumais. A solution to Plato’s problem: The
latent semantic analysis theory of acquisition, induction, and representation
of knowledge. Psychological Review, 104(2):211, 1997.
Brian Larson. Gender as a variable in natural-language processing: Ethical
considerations. In Proceedings of the First ACL Workshop on Ethics in Nat-
117
ural Language Processing, pages 1–11, Valencia, Spain, April 2017. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/W17-1601. URL
https://www.aclweb.org/anthology/W17-1601.
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas
Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data
makes language models better. arXiv preprint arXiv:2107.06499, 2021.
Wendy G. Lehnert. Plot units and narrative summarization. Cognitive Science, 5:
293–331, 1981.
Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix
factorization. Advances in Neural Information Processing Systems, 27, 2014.
Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity
with lessons learned from word embeddings. Transactions of the Association for
Computational Linguistics, 3:211–225, 2015. doi: 10.1162/tacl a 00134. URL
https://aclanthology.org/Q15-1016.
Yao Li, Yubo Kou, Je Seok Lee, and Alfred Kobsa. Tell me before you stream me:
Managing information disclosure in video game live streaming. Proceedings
of the ACM on Human-Computer Interaction, 2(CSCW):1–18, 2018.
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld.
S2ORC: The semantic scholar open research corpus. In Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics, pages
4969–4983, Online, July 2020. Association for Computational Linguistics.
doi: 10.18653/v1/2020.acl-main.447. URL https://aclanthology.org/
2020.acl-main.447.
118
Li Lucy and David Bamman. Gender and representation bias in GPT-3 gen-
erated stories. In Proceedings of the Third Workshop on Narrative Understand-
ing, pages 48–55, Virtual, June 2021. Association for Computational Linguis-
tics. doi: 10.18653/v1/2021.nuse-1.5. URL https://aclanthology.org/
2021.nuse-1.5.
Stephanie Lukin, Kevin Bowden, Casey Barackman, and Marilyn Walker.
PersonaBank: A corpus of personal narratives and their story intention
graphs. In Proceedings of the Tenth International Conference on Language Re-
sources and Evaluation (LREC’16), pages 1026–1033, Portorož, Slovenia, May
2016. European Language Resources Association (ELRA). URL https://
aclanthology.org/L16-1163.
Ian Lundberg, Arvind Narayanan, Karen Levy, and Matthew J Salganik. Pri-
vacy, ethics, and data access: A case study of the fragile families challenge.
Socius, 5:2378023118813023, 2019.
Haiwei Ma, C. Estelle Smith, Lu He, Saumik Narayanan, Robert A. Giaquinto,
Roni Evans, Linda Hanson, and Svetlana Yarosh. Write for life: Persisting
in online health communities through expressive writing and social support.
Proc. ACM Hum.-Comput. Interact., 1(CSCW), dec 2017a. doi: 10.1145/3134708.
URL https://doi.org/10.1145/3134708.
Xiao Ma, Jeffery T Hancock, Kenneth Lim Mingjie, and Mor Naaman. Self-
disclosure and perceived trustworthiness of airbnb host profiles. In Proceed-
ings of the 2017 ACM conference on computer supported cooperative work and social
computing, pages 2397–2409, 2017b.
Suraj Maharjan, Manuel Montes, Fabio A. González, and Thamar Solorio. A
genre-aware attention model to improve the likability prediction of books.
119
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, pages 3381–3391, Brussels, Belgium, October-November 2018. As-
sociation for Computational Linguistics. doi: 10.18653/v1/D18-1375. URL
https://www.aclweb.org/anthology/D18-1375.
Suman Kalyan Maity, Ayush Kumar, Ankan Mullick, Vishnu Choudhary, and
Animesh Mukherjee. Understanding book popularity on Goodreads. In Pro-
ceedings of the 2018 ACM Conference on Supporting Groupwork, GROUP ’18,
page 117–121, New York, NY, USA, 2018. Association for Computing Ma-
chinery. ISBN 9781450355629. doi: 10.1145/3148330.3154512. URL https:
//doi.org/10.1145/3148330.3154512.
Divine Maloney, Samaneh Zamanifard, and Guo Freeman. Anonymity vs. fa-
miliarity: Self-disclosure and privacy in social virtual reality. In 26th ACM
Symposium on Virtual Reality Software and Technology, pages 1–9, 2020.
Diane Maloney-Krichmar and Jennifer Preece. The meaning of an online health
community in the lives of its members: Roles, relationships and group dy-
namics. In IEEE 2002 International Symposium on Technology and Society (IS-
TAS’02). Social Implications of Information and Communication Technology. Pro-
ceedings (Cat. No. 02CH37293), pages 20–27. Ieee, 2002.
Lena Mamykina, Drashko Nakikj, and Noemie Elhadad. Collective sensemak-
ing in online health forums. In Proceedings of the 33rd Annual ACM Con-
ference on Human Factors in Computing Systems, CHI ’15, page 3217–3226,
New York, NY, USA, 2015. Association for Computing Machinery. ISBN
9781450331456. doi: 10.1145/2702123.2702566. URL https://doi.org/
10.1145/2702123.2702566.
Thomas Manzini, Lim Yao Chong, Alan W Black, and Yulia Tsvetkov. Black is to
120
criminal as caucasian is to police: Detecting and removing multiclass bias in
word embeddings. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages 615–621, Minneapolis, Min-
nesota, June 2019. Association for Computational Linguistics. doi: 10.18653/
v1/N19-1062. URL https://www.aclweb.org/anthology/N19-1062.
Lynne M. Markus. Toward a theory of knowledge reuse: Types of knowledge
reuse situations and factors in reuse success. Journal of Management Information
Systems, 18(1):57–93, 2001.
Nina Martin, Emma Cillekens, and Alessandra Freitas. The last person you’d
expect to die in childbirth. ProPublica, 2017a.
Nina Martin, Emma Cillekens, and Alessandra Freitas. Lost mothers. ProPublica,
2017b.
Erin L. Merz, Rina S. Fox, and Vanessa L. Malcarne. Expressive writing inter-
ventions in cancer patients: A systematic review. Health Psychology Review, 8
(3):339–361, 2014.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems, pages 3111–3119, 2013a.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in
continuous space word representations. In Proceedings of the 2013 Conference
of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 746–751, Atlanta, Georgia, June 2013b.
121
Association for Computational Linguistics. URL https://aclanthology.
org/N13-1090.
David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew
McCallum. Optimizing semantic coherence in topic models. In Proceedings of
the 2011 Conference on Empirical Methods in Natural Language Processing, pages
262–272, Edinburgh, Scotland, UK., July 2011. Association for Computational
Linguistics. URL https://aclanthology.org/D11-1024.
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasser-
man, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Ge-
bru. Model cards for model reporting. In Proceedings of the Conference on Fair-
ness, Accountability, and Transparency, FAT* ’19, page 220–229, New York, NY,
USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi:
10.1145/3287560.3287596. URL https://doi.org/10.1145/3287560.
3287596.
Barbara M. Montgomery. Verbal immediacy as a behavioral indicator of open
communication content. Communication Quarterly, 30(1), 1981.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv
Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and
cloze evaluation for deeper understanding of commonsense stories. In Pro-
ceedings of the 2016 Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Technologies, pages 839–
849, San Diego, California, June 2016. Association for Computational Lin-
guistics. doi: 10.18653/v1/N16-1098. URL https://www.aclweb.org/
anthology/N16-1098.
Aaron Mueller, Zach Wood-Doughty, Silvio Amir, Mark Dredze, and Ali-
122
cia Lynn Nobles. Demographic representation and collective storytelling in
the me too twitter hashtag activism movement. Proc. ACM Hum.-Comput. In-
teract., 5(CSCW1), apr 2021. doi: 10.1145/3449181. URL https://doi.org/
10.1145/3449181.
Md National Commission for the Proptection of Human Subjects of Biomedi-
caland Behavioral Research, Bethesda. The Belmont report: Ethical principles
and guidelines for the protection of human subjects of research. Superintendent of
Documents, 1978.
National Endowment for the Arts. Reading at Risk: A Survey of Literary Read-
ing in America, 2004.
Dong Nguyen, Dolf Trieschnigg, A. Seza Doğruöz, Rilana Gravel, Mariët The-
une, Theo Meder, and Franciska de Jong. Why gender and age prediction
from tweets is hard: Lessons from a crowdsourcing experiment. In Pro-
ceedings of COLING 2014, the 25th International Conference on Computational
Linguistics: Technical Papers, pages 1950–1961, Dublin, Ireland, August 2014.
Dublin City University and Association for Computational Linguistics. URL
https://www.aclweb.org/anthology/C14-1184.
Pok-Ja Oh and Soo Hyun Kim. The effects of expressive writing interventions
for patients with cancer: A meta-analysis. In Oncology Nursing Forum, vol-
ume 43, 2016.
Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kıcıman. Social
data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big
Data, 2:13, 2019.
Debra L Oswald, Eddie M Clark, and Cheryl M Kelly. Friendship maintenance:
123
An analysis of individual and dyad behaviors. Journal of Social and Clinical
Psychology, 23(3):413–441, 2004.
Jessica Ouyang and Kathy McKeown. Towards automatic detection of nar-
rative structure. In Proceedings of the Ninth International Conference on
Language Resources and Evaluation (LREC-2014), pages 4624–4631, Reyk-
javik, Iceland, May 2014. European Languages Resources Association
(ELRA). URL http://www.lrec-conf.org/proceedings/lrec2014/
pdf/1154_Paper.pdf.
Galen Panger. Reassessing the facebook experiment: critical thinking about the
validity of big data research. Information, Communication & Society, 19(8):1108–
1126, 2016.
Dilisha Patel, Ann Blandford, Mark Warner, Jill Shawe, and Judith Stephen-
son. ”I feel like only half a man”: Online forums as a resource for find-
ing a ”new normal” for men experiencing fertility issues. Proc. ACM Hum.-
Comput. Interact., 3(CSCW), nov 2019. doi: 10.1145/3359184. URL https:
//doi.org/10.1145/3359184.
Thomas Pavel. Literary Genres as Norms and Good Habits. New Literary His-
tory, 34(2):201–210, 2003. ISSN 0028-6087. URL https://www.jstor.org/
stable/20057776.
James W. Pennebaker. Writing about emotional experiences as a therapeutic
process. Psychological Science, 8(3):162–166, 1997.
James W. Pennebaker and Sandra K. Beall. Confronting a traumatic event: To-
ward an understanding of inhibition and disease. Journal of Abnormal Psychol-
ogy, 95(3):274, 1986.
124
James W. Pennebaker, Martha E. Francis, and Roger J. Booth. Linguistic inquiry
and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):
2001, 2001.
Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global
vectors for word representation. In Proceedings of the 2014 Conference on Em-
pirical Methods in Natural Language Processing (EMNLP), pages 1532–1543,
Doha, Qatar, October 2014. Association for Computational Linguistics. doi:
10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162.
Bethany Percha and Russ B. Altman. A global network of biomedical relation-
ships derived from text. Bioinformatics, 34(15):2614–2624, 2018.
Lawrence Phillips, Kyle Shaffer, Dustin Arendt, Nathan Hodas, and Svitlana
Volkova. Intrinsic and extrinsic evaluation of spatiotemporal text represen-
tations in Twitter streams. In Proceedings of the 2nd Workshop on Representa-
tion Learning for NLP, pages 201–210, Vancouver, Canada, August 2017. As-
sociation for Computational Linguistics. doi: 10.18653/v1/W17-2624. URL
https://aclanthology.org/W17-2624.
Karl Pichotta and Raymond J. Mooney. Learning statistical scripts with lstm re-
current neural networks. In Thirtieth AAAI Conference on Artificial Intelligence,
2016.
Andrew Piper, Richard Jean So, and David Bamman. Narrative theory for
computational narrative understanding. In Proceedings of the 2021 Confer-
ence on Empirical Methods in Natural Language Processing, pages 298–311, On-
line and Punta Cana, Dominican Republic, November 2021. Association for
Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.26. URL
https://aclanthology.org/2021.emnlp-main.26.
125
Peter Pirolli. Rational analyses of information foraging on the web. Cognitive
Science, 29(3):343–373, 2005.
Peter Pirolli and Stuart Card. The sensemaking process and leverage points for
analyst technology as identified through cognitive task analysis. In Proceed-
ings of International Conference on Intelligence Analysis, volume 5, pages 2–4.
McLean, VA, USA, 2005.
J. D. Porter. Popularity/Prestige. Stanford Literary Lab, Pamphlet 17, 2018.
Vinodkumar Prabhakaran, William L. Hamilton, Dan McFarland, and Dan Ju-
rafsky. Predicting the rise and fall of scientific topics from trends in their
rhetorical framing. In Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 1170–1180, Berlin,
Germany, August 2016. Association for Computational Linguistics. doi:
10.18653/v1/P16-1111. URL https://aclanthology.org/P16-1111.
Rebecca M. Puhl and Chelsea A. Heuer. The stigma of obesity: a review and
update. Obesity, 17(5):941–964, 2009.
Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ana-
niadou. Distributional semantics resources for biomedical text processing.
Languages in Biology and Medicine, pages 39–44, 2013.
Emilee Rader and Rick Wash. Influences on tag choices in del.icio.us. In Pro-
ceedings of the 2008 ACM Conference on Computer Supported Cooperative Work,
CSCW ’08, page 239–248, New York, NY, USA, 2008. Association for Com-
puting Machinery. ISBN 9781605580074. doi: 10.1145/1460563.1460601. URL
https://doi.org/10.1145/1460563.1460601.
126
Janice A. Radway. Reading the Romance: Women, Patriarchy, and Popular Literature.
The University of North Carolina Press, Chapel Hill, November 1991. ISBN
978-0-8078-4349-9.
Stephen Ramsay. Reading Machines: Toward and Algorithmic Criticism. University
of Illinois Press, 2011.
Abhilasha Ravichander and A. Black. An empirical study of self-disclosure in
spoken dialogue systems. In SIGDIAL Conference, 2018.
Andrew J. Reagan, Lewis Mitchell, Dilan Kiley, Christopher M. Danforth, and
Peter Sheridan Dodds. The emotional arcs of stories are dominated by six
basic shapes. EPJ Data Science, 5:1–12, 2016.
Felix Reer and Nicole C. Krämer. Underlying factors of social capital acquisition
in the context of online-gaming: Comparing world of warcraft and counter-
strike. Computers in Human Behavior, 36:179–189, 2014.
Jeremy Rosen. Literary Fiction and the Genres of Genre Fiction. Post45: Peer-
Reviewed, August 2018.
Rachel Rudinger, Chandler May, and Benjamin Van Durme. Social bias in
elicited natural language inferences. In Proceedings of the First ACL Workshop on
Ethics in Natural Language Processing, pages 74–79, Valencia, Spain, April 2017.
Association for Computational Linguistics. doi: 10.18653/v1/W17-1609. URL
https://www.aclweb.org/anthology/W17-1609.
Anna Rumshisky, Kirk Roberts, Steven Bethard, and Tristan Naumann, editors.
Proceedings of the 3rd Clinical Natural Language Processing Workshop, Online,
November 2020. Association for Computational Linguistics. URL https:
//aclanthology.org/2020.clinicalnlp-1.0.
127
Diego Saez-Trumper, Carlos Castillo, and Mounia Lalmas. Social media news
communities: gatekeeping, coverage, and statement bias. In Proceedings of the
22nd ACM International Conference on Information & Knowledge Management,
pages 1679–1684. ACM, 2013.
Maarten Sap, Marcella Cindy Prasettio, Ari Holtzman, Hannah Rashkin, and
Yejin Choi. Connotation frames of power and agency in modern films. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing, pages 2329–2334, Copenhagen, Denmark, September 2017. As-
sociation for Computational Linguistics. doi: 10.18653/v1/D17-1247. URL
https://aclanthology.org/D17-1247.
Roger C. Schank and Robert P. Abelson. Scripts, plans, goals, and understand-
ing: An inquiry into human knowledge structures. Psychology Press, 1977.
Alexandra Schofield and David Mimno. Comparing apples to apple: The ef-
fects of stemmers on topic models. Transactions of the Association for Compu-
tational Linguistics, 4:287–300, 2016. doi: 10.1162/tacl a 00099. URL https:
//aclanthology.org/Q16-1021.
Alexandra Schofield, Mans Magnusson, and David Mimno. Pulling out the
stops: Rethinking stopword removal for topic models. In Proceedings of the
15th Conference of the European Chapter of the Association for Computational Lin-
guistics: Volume 2, Short Papers, pages 432–436, Valencia, Spain, April 2017a.
Association for Computational Linguistics. URL https://aclanthology.
org/E17-2069.
Alexandra Schofield, Laure Thompson, and David Mimno. Quantifying the
effects of text duplication on semantic models. In Proceedings of the 2017 Con-
ference on Empirical Methods in Natural Language Processing, pages 2737–2747,
128
Copenhagen, Denmark, September 2017b. Association for Computational
Linguistics. doi: 10.18653/v1/D17-1290. URL https://aclanthology.
org/D17-1290.
João Sedoc and Lyle Ungar. The role of protected class word lists in bias iden-
tification of contextualized word representations. In Proceedings of the First
Workshop on Gender Bias in Natural Language Processing, pages 55–61, Florence,
Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/
v1/W19-3808. URL https://www.aclweb.org/anthology/W19-3808.
Maya Sen and Omar Wasow. Race as a bundle of sticks: Designs that estimate
effects of seemingly immutable characteristics. Annual Review of Political Sci-
ence, 19, 2016.
Shilad Sen, Shyong K. Lam, Al Mamunur Rashid, Dan Cosley, Dan Frankowski,
Jeremy Osterhouse, F. Maxwell Harper, and John Riedl. Tagging, communi-
ties, vocabulary, evolution. In Proceedings of the 2006 20th Anniversary Con-
ference on Computer Supported Cooperative Work, CSCW ’06, page 181–190,
New York, NY, USA, 2006. Association for Computing Machinery. ISBN
1595932496. doi: 10.1145/1180875.1180904. URL https://doi.org/10.
1145/1180875.1180904.
Noam Shazeer, Ryan Doherty, Colin Evans, and Chris Waterson. Swivel: Im-
proving Embeddings by Noticing What’s Missing. arXiv:1602.02215, 2016.
URL http://arxiv.org/abs/1602.02215.
Gene Smith. Tagging: people-powered metadata for the social web. New Riders, 2007.
Jessie J. Smith, Saleema Amershi, Solon Barocas, Hanna Wallach, and Jen-
nifer Wortman Vaughan. REAL ML: Recognizing, exploring, and articulating
129
limitations of machine learning research. In Proceedings of the 2022 ACM Con-
ference on Fairness, Accountability, and Transparency, FAccT ’22, New York, NY,
USA, 2022. Association for Computing Machinery.
Efstathios Stamatatos, Nikos Fakotakis, and George Kokkinakis. Automatic text
categorization in terms of genre and author. Computational Linguistics, 26(4):
471–495, 2000. URL https://www.aclweb.org/anthology/J00-4001.
Donna E. Stewart and Simone Vigod. Postpartum depression. New England
Journal of Medicine, 375(22):2177–2186, December 2016.
Roger Suss. The hero with a thousand faces visits the doctor. Canadian Family
Physician (Medecin de famille canadien), 60 7:656, 2014.
Chris Sweeney and Maryam Najafian. A transparent framework for evaluat-
ing unintended demographic bias in word embeddings. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, pages
1662–1667, Florence, Italy, July 2019. Association for Computational Lin-
guistics. doi: 10.18653/v1/P19-1162. URL https://www.aclweb.org/
anthology/P19-1162.
Chenhao Tan, Dallas Card, and Noah A. Smith. Friendships, rivalries, and
trysts: Characterizing relations between ideas in texts. In Proceedings of
the 55th Annual Meeting of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 773–783, Vancouver, Canada, July 2017. Asso-
ciation for Computational Linguistics. doi: 10.18653/v1/P17-1072. URL
https://aclanthology.org/P17-1072.
Timothy R. Tangherlini. Heroes and lies: Storytelling tactics among paramedics.
Folklore, 111(1):43–66, 2000.
130
Jaime Teevan, Christine Alvarado, Mark S. Ackerman, and David R. Karger. The
perfect search engine is not enough: A study of orienteering behavior in di-
rected search. In Proceedings of the SIGCHI Conference on Human Factors in Com-
puting Systems, CHI ’04, page 415–422, New York, NY, USA, 2004. Association
for Computing Machinery. ISBN 1581137028. doi: 10.1145/985692.985745.
URL https://doi.org/10.1145/985692.985745.
Laure Thompson and David Mimno. Authorless topic models: Biasing models
away from known structure. In Proceedings of the 27th International Conference
on Computational Linguistics, pages 3903–3914, Santa Fe, New Mexico, USA,
August 2018a. Association for Computational Linguistics. URL https://
aclanthology.org/C18-1329.
Laure Thompson and David Mimno. Authorless topic models: Biasing models
away from known structure. In Proceedings of the 27th International Conference
on Computational Linguistics, pages 3903–3914, Santa Fe, New Mexico, USA,
August 2018b. Association for Computational Linguistics. URL https://
www.aclweb.org/anthology/C18-1329.
Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. Word representations: A
simple and general method for semi-supervised learning. In Proceedings of the
48th Annual Meeting of the Association for Computational Linguistics, pages 384–
394, Uppsala, Sweden, July 2010. Association for Computational Linguistics.
URL https://aclanthology.org/P10-1040.
William E. Underwood. The Life Cycles of Genres. Journal of Cultural Ana-
lytics, page 11061, May 2016. URL https://culturalanalytics.org/
article/11061-the-life-cycles-of-genres.
131
Thomas Vander Wal. Folksonomy definition and Wikipedia, 2005. URL http:
//www.vanderwal.net/random/entrysel.php?blog=1750.
Thomas Vander Wal. Folksonomy, 2007. URL https://www.vanderwal.
net/essays/051130/folksonomy.pdf.
Effy Vayena, Marcel Salathé, Lawrence C. Madoff, and John S. Brownstein. Eth-
ical challenges of big data in public health. Public Library of Science, 2015.
Ivan Vulić and Marie-Francine Moens. Bilingual word embeddings from non-
parallel document-aligned data applied to bilingual lexicon induction. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Lin-
guistics and the 7th International Joint Conference on Natural Language Process-
ing (Volume 2: Short Papers), pages 719–725, Beijing, China, July 2015. As-
sociation for Computational Linguistics. doi: 10.3115/v1/P15-2118. URL
https://aclanthology.org/P15-2118.
Hanna Wallach, David Mimno, and Andrew McCallum. Rethinking LDA: Why
priors matter. Advances in Neural Information Processing Systems, 22, 2009.
Melanie Walsh and Maria Antoniak. The Goodreads ‘classics’: A computational
study of readers, Amazon, and crowdsourced amateur criticism. Journal of
Cultural Analytics, 4:243–287, 2021.
Mengting Wan and Julian J. McAuley. Item recommendation on monotonic be-
havior chains. In Sole Pera, Michael D. Ekstrand, Xavier Amatriain, and John
O’Donovan, editors, Proceedings of the 12th ACM Conference on Recommender
Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, pages 86–94.
ACM, 2018. doi: 10.1145/3240323.3240369. URL https://doi.org/10.
1145/3240323.3240369.
132
Mengting Wan, Rishabh Misra, Ndapa Nakashole, and Julian McAuley. Fine-
grained spoiler detection from large-scale review corpora. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, pages
2605–2610, Florence, Italy, July 2019. Association for Computational Lin-
guistics. doi: 10.18653/v1/P19-1248. URL https://www.aclweb.org/
anthology/P19-1248.
Jianling Wang, Ziwei Zhu, and James Caverlee. User recommendation in con-
tent curation platforms. In Proceedings of the 13th International Conference on
Web Search and Data Mining, pages 627–635, 2020a.
Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang
Yang, Darrin Eide, Kathryn Funk, Rodney Kinney, Ziyang Liu, William Mer-
rill, et al. Cord-19: The covid-19 open research dataset. ArXiv, 2020b.
Xi Wang, Kang Zhao, and Nick Street. Analyzing and predicting user partici-
pations in online health communities: A social support perspective. Journal of
medical Internet research, 19(4):e6834, 2017.
Yi-Chia Wang, Moira Burke, and Robert Kraut. Modeling self-disclosure in so-
cial networking sites. In Proceedings of the 19th ACM Conference on Computer-
Supported Cooperative Work amp; Social Computing, CSCW ’16, page 74–85,
New York, NY, USA, 2016. Association for Computing Machinery. ISBN
9781450335928. doi: 10.1145/2818048.2820010. URL https://doi.org/
10.1145/2818048.2820010.
Rick Wash and Emilee Rader. Public bookmarks and private benefits: An anal-
ysis of incentives in social computing. Proceedings of the American Society for
Information Science and Technology, 44(1):1–13, 2007.
133
Jonathan Weber. Folksonomy and controlled vocabulary in librarything. Un-
published Final Project, University of Pittsburgh, pages 5–6, 2006.
Karl E. Weick, Kathleen M. Sutcliffe, and David Obstfeld. Organizing and the
process of sensemaking. Organization Science, 16(4):409–421, 2005.
Miaomiao Wen and Carolyn Penstein Rosé. Understanding participant behav-
ior trajectories in online health support groups using automatic extraction
methods. In Proceedings of the 17th ACM International Conference on Supporting
Group Work, pages 179–188, 2012.
Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. Factors in-
fluencing the surprising instability of word embeddings. In Proceedings of the
2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2092–
2102, New Orleans, Louisiana, June 2018. Association for Computational Lin-
guistics. doi: 10.18653/v1/N18-1190. URL https://www.aclweb.org/
anthology/N18-1190.
Matthew Wilkens. Genre, computation, and the varieties of twentieth-century
U.S. fiction. Journal of Cultural Analytics, 11 2016.
John E. Williams and Susan M. Bennett. The definition of sex stereotypes via the
adjective check list. Sex Roles, 1(4), 1975.
John E. Williams and Deborah L. Best. Sex stereotypes and trait favorability on
the adjective check list. Educational and Psychological Measurement, 1977.
John E Williams and Deborah L Best. Measuring sex stereotypes: A multination
study, Rev. Sage Publications, Inc, 1990.
134
Joseph Worsham and Jugal Kalita. Genre identification and the compositional
effect of genre in literature. In Proceedings of the 27th International Confer-
ence on Computational Linguistics, pages 1963–1973, Santa Fe, New Mexico,
USA, August 2018. Association for Computational Linguistics. URL https:
//www.aclweb.org/anthology/C18-1167.
Diyi Yang, Robert E. Kraut, Tenbroeck Smith, Elijah Mayfield, and Dan Jurafsky.
Seekers, providers, welcomers, and storytellers: Modeling social roles in on-
line health communities. In CHI, CHI ’19, pages 344:1–344:14, New York, NY,
USA, 2019a. ACM. ISBN 978-1-4503-5970-2. doi: 10.1145/3290605.3300574.
URL http://doi.acm.org/10.1145/3290605.3300574.
Diyi Yang, Zheng Yao, Joseph Seering, and Robert Kraut. The channel mat-
ters: Self-disclosure, reciprocity and social support in online cancer support
groups. In CHI, CHI ’19, pages 31:1–31:15, New York, NY, USA, 2019b.
ACM. ISBN 978-1-4503-5970-2. doi: 10.1145/3290605.3300261. URL http:
//doi.acm.org/10.1145/3290605.3300261.
Alyson L. Young and Andrew D. Miller. ”This girl is on fire”: Sensemaking
in an online health community for vulvodynia. In Proceedings of the 2019
CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–13,
New York, NY, USA, 2019. Association for Computing Machinery. ISBN
9781450359702. doi: 10.1145/3290605.3300359. URL https://doi.org/
10.1145/3290605.3300359.
Mohammadzaman Zamani, H. Andrew Schwartz, Johannes Eichstaedt,
Sharath Chandra Guntuku, Adithya Virinchipuram Ganesan, Sean Clous-
ton, and Salvatore Giorgi. Understanding weekly COVID-19 concerns
through dynamic content-specific LDA topic modeling. In Proceedings of
135
the Fourth Workshop on Natural Language Processing and Computational So-
cial Science, pages 193–198, Online, November 2020. Association for Com-
putational Linguistics. doi: 10.18653/v1/2020.nlpcss-1.21. URL https:
//aclanthology.org/2020.nlpcss-1.21.
Zachariah Zhang, Jingshu Liu, and Narges Razavian. BERT-XML: Large
scale automated ICD coding using BERT pretraining. In Proceedings of
the 3rd Clinical Natural Language Processing Workshop, pages 24–34, On-
line, November 2020. Association for Computational Linguistics. doi:
10.18653/v1/2020.clinicalnlp-1.3. URL https://aclanthology.org/
2020.clinicalnlp-1.3.
Chen Zhao, Pamela Hinds, and Ge Gao. How and to whom people share:
The role of culture in self-disclosure in online communities. In Proceedings
of the ACM 2012 Conference on Computer Supported Cooperative Work, CSCW
’12, page 67–76, New York, NY, USA, 2012. Association for Computing Ma-
chinery. ISBN 9781450310864. doi: 10.1145/2145204.2145219. URL https:
//doi.org/10.1145/2145204.2145219.
Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. Learn-
ing gender-neutral word embeddings. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, pages 4847–4853, Brus-
sels, Belgium, October-November 2018. Association for Computational Lin-
guistics. doi: 10.18653/v1/D18-1521. URL https://www.aclweb.org/
anthology/D18-1521.
Arkaitz Zubiaga, Christian Körner, and Markus Strohmaier. Tags vs shelves:
From social tagging to social classification. In Proceedings of the 22nd ACM
Conference on Hypertext and Hypermedia, HT ’11, page 93–102, New York, NY,
136
USA, 2011. Association for Computing Machinery. ISBN 9781450302562. doi:
10.1145/1995966.1995981. URL https://doi.org/10.1145/1995966.
1995981.
137