MACHINE LEARNING METHODS FOR
MARKETING MANAGERS AND POLICY MAKERS
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Piyush Anand
May 2021
© 2021 Piyush Anand
ALL RIGHTS RESERVED
MACHINE LEARNING METHODS FOR MARKETING MANAGERS AND
POLICY MAKERS
Piyush Anand, Ph.D.
Cornell University 2021
This dissertation is a combination of three papers that use machine learning
methods to investigate research questions of interest to marketing managers and
regulators.
In the first chapter, the authors investigate if firms that face crises have sys-
tematically different corporate culture, and whether employee reviews provide
early warnings. The authors examine these questions in the Wells Fargo con-
sumer banking crisis context, and compare the culture discussions in online re-
views posted by employees of Wells Fargo to those of other banks. They measure
two important dimensions of corporate culture – control of employees and stabil-
ity of processes, and focus on competition-based goals and outcomes. They find
that sentiment of culture discussions on competition goals, and on rules and sta-
bility at Wells Fargo is more negative than other banks, and this is visible as far
back as ten quarters prior to the crisis reveal. These are the same causes of crisis
identified in a definitive post-mortem report. Additionally, they identify another
bank which could potentially be at crisis, and also find similar results for other
consumer-harming crises at General Motors in 2014, Chipotle in 2015, and Mylan
in 2016.
In the second chapter, the authors use social media images data to estimate
the impact of taxes on underage vaping. Various states in the US have enacted
taxes to discourage E-cigarette usage, especially since underage usage has grown
significantly. Data is difficult for underage consumption since it is illegal for them
to purchase these products, and estimating tax impact has not been limited due to
these constraints. The authors use publicly available user-posted images on social
media from Jan 2016 - Dec 2018 to measure the impact of greater taxes on under-
age posting behavior. These posts are a rough proxy for normalization, and poten-
tially for consumption among underage population. Age and other demographics
are detected using an ensemble of image analysis methods - Mask R-CNN (He et
al., 2017) and Aggregated Residual Neural Networks (Xie et al., 2017). The authors
also develop methods to estimate disguised posting of usage images, given their
purported utilization by underage users. With the generalized synthetic controls
(Xu, 2017), the authors find that only the states with higher taxes - Pennsylvania
and California-saw a decline in underage e-cigarette posts. California’s decline is
preceded by an increase in disguised posting, and Pennsylvania’s decline is ac-
companied by increased engagement for the underage posts. The authors also
estimate impact of taxes on posting by race and gender.
In the third chapter, the authors examine generative adversarial networks
(GANs) as a privacy protecting approach to customer data transfer. As cus-
tomer privacy becomes increasingly important to marketers, the authors inves-
tigate GANs ability to transfer a generative model, instead of data, thereby avoid-
ing the process of sampling and anonymizing customer data for release for use
in various analytic use cases. The authors find that GANs excel in preserving de-
sired characteristics of original data and protecting privacy as compared to bench-
marks. With real world data, the authors find that GANs achieve double the accu-
racy as compared to the best benchmark. Additionally, they demonstrate GANs
in different marketing contexts of pricing for optimal profits, and customer target-
ing, and show that a individual GAN can handle multiple problems. Finally, they
demonstrate volume and velocity advantages of GANs in handling larger data
and real-time data streams.
BIOGRAPHICAL SKETCH
Piyush Anand’s research examines how machine learning methods can enrich
managers’ and regulators’ understanding of consumer and firm behavior. His
dissertation focuses on topics of corporate culture and consumer harm crises us-
ing text analysis, health tax policy impact using image analysis, and customer
data privacy protection using machine learning methods. Prior to joining the
Ph.D. program at Cornell University, Piyush was a Category Manager at Ama-
zon. He obtained a Post Graduate Diploma in Management from Indian Institute
of Management Ahmedabad, and a Bachelor of Technology from Indian Institute
Technology Guwahati.
iii
ACKNOWLEDGEMENTS
My dissertation has benefited tremendously from the valuable feedback and sup-
port of my advisor, Dr. Vrinda Kadiyali. Throughout the different stages of my
PhD, she encouraged me to learn the different skills needed and to explore re-
search topics of managerial and regulatory importance. I am grateful to her for
her guidance during my doctoral studies.
I thank my committee members, Dr. Bharath Hariharan, Dr. David Mimno,
and Dr. Shawn Mankad for their feedback during my PhD. I am especially grate-
ful to Dr. Bharath Hariharan for his suggestions for the second chapter of my
dissertation. I thank Dr. Clarence Lee and Dr. Manoj Thomas for their insightful
suggestions during the course of our co-authored projects. I would also like to ex-
press my gratitude to the marketing faculty and colleagues for their helpful feed-
back, the NLP and computer vision community at Cornell for the opportunity to
present my work and for their suggestions. I am thankful for the recognition and
financial supports of ISMS Doctoral Dissertation Award, Shankar-Spiegel Doc-
toral Dissertation Proposal Award, Dyckman Research Grant, and Byron E. Grote,
MS ’77, Ph.D. ’81 Johnson Professional Scholarship.
I am indebted to my family for their support and encouragement - my parents,
Mrs. Asha Anand and Dr. Manoj Anand, my wife, Aanchal Raisahib, and my
siblings, Rahul Anand and Aarti Anand Ohri. I dedicate my dissertation to my
late grandfather - Prabh Dyal Anand, from whom I learned the values of patience,
hard work, and perseverance.
iv
TABLE OF CONTENTS
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Employee Reviews as Leading Indicators of Consumer Harm Crises: A
Text Mining Approach 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Independent Directors of the Board of Wells Sales Practices Investi-
gation Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Culture and Textual Analysis Model . . . . . . . . . . . . . . . . . . 14
1.5.1 A framework for measuring dimensions of culture . . . . . 14
1.5.2 Text measure of culture . . . . . . . . . . . . . . . . . . . . . 17
1.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.1 Wells Employees’ Discussion of Corporate Culture . . . . . 23
1.6.2 Risk Assessment of Other Banks . . . . . . . . . . . . . . . . 33
1.6.3 Generalizability to Other Consumer Facing Crises . . . . . . 34
1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.9.1 Wells List of Firms . . . . . . . . . . . . . . . . . . . . . . . . 50
1.9.2 Seed Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.9.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . 53
1.9.4 Word2Vec Hierarchical Softmax . . . . . . . . . . . . . . . . 53
1.9.5 List of Other Consumer Facing Crises Firms . . . . . . . . . 54
2 Smoke and Mirrors: Impact of E-Cigarette Taxes on Underage Social Me-
dia Posting 56
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2 Electronic Cigarettes and State-Wide Taxes in the US . . . . . . . . . 59
2.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . 68
v
2.5.2 Detecting Demographics in Images . . . . . . . . . . . . . . 71
2.5.3 Detecting Disguising in Images . . . . . . . . . . . . . . . . . 80
2.5.4 Estimating the Effect of Tax Policies . . . . . . . . . . . . . . 90
2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.6.1 Impact of Taxes on Underage Vaping Posts . . . . . . . . . . 94
2.6.2 Impact of Taxes on Underage Disguising . . . . . . . . . . . 98
2.6.3 Impact of Taxes by Race and Gender . . . . . . . . . . . . . . 103
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
2.9.1 Number of Social Media Posts by Users and Proportion of
Posts Related to Vaping . . . . . . . . . . . . . . . . . . . . . 112
2.9.2 Detecting Disguising in Images . . . . . . . . . . . . . . . . . 113
2.9.3 Supplementary Analysis with Difference-in-Differences . . 114
2.9.4 Engagement Results . . . . . . . . . . . . . . . . . . . . . . . 123
2.9.5 Impact of Taxes by Race and Gender . . . . . . . . . . . . . . 139
2.9.6 Other Common Objects in Posts . . . . . . . . . . . . . . . . 152
3 Using Deep Learning to Overcome Privacy and Scalability Issues in Cus-
tomer Data Transfer 155
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.2 Existing Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
3.3 Methodology: Extant Approach and Benchmarks . . . . . . . . . . 164
3.3.1 Extant versus Proposed Data Transfer Paradigm . . . . . . . 164
3.3.2 Benchmark Methodology . . . . . . . . . . . . . . . . . . . . 168
3.3.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 171
3.4 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
3.4.1 Generative Adversarial Networks . . . . . . . . . . . . . . . 184
3.4.2 Design of Neural Networks . . . . . . . . . . . . . . . . . . . 185
3.4.3 The Picture-Data Analogy and Extension to Heterogeneity . 187
3.4.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
3.5 Empirical Context and Results . . . . . . . . . . . . . . . . . . . . . . 193
3.5.1 Accuracy - Privacy Trade-off . . . . . . . . . . . . . . . . . . 193
3.5.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
3.5.3 Generalizability to Marketing Problems . . . . . . . . . . . . 208
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
vi
3.8.1 Inference Model and Information Loss . . . . . . . . . . . . . 224
3.8.2 Monte Carlo Experiment 1 Data . . . . . . . . . . . . . . . . 226
3.8.3 Monte Carlo Experiment 2 Data - Customer Targeting . . . . 227
3.8.4 Monte Carlo Experiment 3 Data - Tackling Multiple Market-
ing Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.8.5 GAN Design and Training . . . . . . . . . . . . . . . . . . . . 229
3.8.6 Relationship between Hyperparameters and Accuracy . . . 229
3.8.7 Model Architecture and Accuracy . . . . . . . . . . . . . . . 233
vii
LIST OF TABLES
1.1 Number of Employee Reviews (ranked by assets as of 03/31/2018) 10
1.2 Most Similar Words for Culture Topics in Pros and Cons Sections
of Employee Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3 Employee Reviews: Discussions of Culture Topics . . . . . . . . . . 26
1.4 Semi-Elasticities for Wells in Discussions of Culture . . . . . . . . . 28
1.5 Employee Reviews: Discussions of Culture Topics Over Time . . . 31
1.6 Semi-Elasticities for Wells in Discussions of Culture Over Time . . 32
1.7 Generalizability to Other Crises . . . . . . . . . . . . . . . . . . . . 37
1.8 Generalizability to Other Crises – Elasticities for Overall Culture . 40
2.1 Number of Posts by Hashtag . . . . . . . . . . . . . . . . . . . . . . 64
2.2 Information Observed for Each Post . . . . . . . . . . . . . . . . . . 67
2.3 Summary Statistics for State-Week Level Variables . . . . . . . . . . 93
2.4 State-Wide Effect of Taxes: Underage . . . . . . . . . . . . . . . . . 117
2.5 State-Wide Effect of Taxes: Other Demographics . . . . . . . . . . . 118
2.6 Robustness Results for Table 2.4: Model Specifications . . . . . . . 120
2.7 State-Wide Effect of Taxes: Disguising . . . . . . . . . . . . . . . . . 121
2.8 Robustness Results for Table 2.7: Model Specifications . . . . . . . 122
2.9 Disguising and Common Objects . . . . . . . . . . . . . . . . . . . . 153
2.10 Underage and Common Objects . . . . . . . . . . . . . . . . . . . . 153
2.11 Underage, Disguising, and Common Objects . . . . . . . . . . . . 154
3.1 Description of Benchmark Methods . . . . . . . . . . . . . . . . . . 171
3.2 Summary Statistics for Monte Carlo Data (N=10,400) . . . . . . . . 196
3.3 Distribution Metrics (Lower the Better) . . . . . . . . . . . . . . . . 198
3.4 Data Volume and Estimation Time . . . . . . . . . . . . . . . . . . . 204
3.5 Data Volume and Generator Compression . . . . . . . . . . . . . . 205
3.6 Price Markups from Eq. 3.9 for True Data and Benchmarks . . . . . 209
3.7 Optimal Profit Ratio from Eq. 3.10 for Benchmark Methods w.r.t
True Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
3.8 Summary Statistics for Monte Carlo Data (N=10,400) . . . . . . . . 212
3.9 Summary Statistics for Monte Carlo Data (N=10,400) . . . . . . . . 214
viii
LIST OF FIGURES
1.1 Competing Values Framework . . . . . . . . . . . . . . . . . . . . . 15
1.2 Word2Vec Skip-gram Model . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Number of Vaping Related Posts by State and Year in the US . . . 65
2.2 Building Blocks for CNNs . . . . . . . . . . . . . . . . . . . . . . . . 70
2.3 ResNet-18 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.4 Residual Block of ResNet . . . . . . . . . . . . . . . . . . . . . . . . 75
2.5 ResNeXt and ResNet Blocks . . . . . . . . . . . . . . . . . . . . . . . 77
2.6 Distribution of Age in Training Data . . . . . . . . . . . . . . . . . . 78
2.7 Example of Object Detection for Juul and Emoji . . . . . . . . . . . 81
2.8 Faster R-CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . 84
2.9 Mask R-CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . 87
2.10 Example of Annotated Data for Disguising . . . . . . . . . . . . . . 88
2.11 Effect of Taxes on Underage Posts . . . . . . . . . . . . . . . . . . . 94
2.12 Effect of Taxes on Underage Disguising . . . . . . . . . . . . . . . . 99
2.13 Number of Posts and Users based on Proportion Vaping Posts Cutoff.112
2.14 Emoji List for Detection . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.15 Effect of Taxes on Likes: Underage . . . . . . . . . . . . . . . . . . . 124
2.16 Effect of Taxes on Comments: Underage . . . . . . . . . . . . . . . 127
2.17 Effect of Taxes on Likes: Underage Disguised . . . . . . . . . . . . 130
2.18 Effect of Taxes on Comments: Underage Disguised . . . . . . . . . 133
2.19 Effect of Taxes on Solo Faces in Posts . . . . . . . . . . . . . . . . . 136
2.20 Effect of Taxes on Gender: Female . . . . . . . . . . . . . . . . . . . 140
2.21 Effect of Taxes on Race: Asian . . . . . . . . . . . . . . . . . . . . . 143
2.22 Effect of Taxes on Race: Black . . . . . . . . . . . . . . . . . . . . . . 146
2.23 Effect of Taxes on Race: White . . . . . . . . . . . . . . . . . . . . . 149
3.1 Data Transfer Paradigm Comparison . . . . . . . . . . . . . . . . . 169
3.2 Benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
3.3 Proposed using GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 169
3.4 Design of Generator Neural Network . . . . . . . . . . . . . . . . . 186
3.5 Design of Discriminator Neural Network . . . . . . . . . . . . . . . 187
3.6 Picture Data Analogy . . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.7 A “Picture” in Panel Data Context: Without Heterogeneity. . . . . 189
3.8 A “Picture” in Panel Data Context: With Heterogeneity. . . . . . . 189
3.9 Distributional Accuracy for Synthetic Data . . . . . . . . . . . . . . 195
3.10 Performance on Monte Carlo Data. . . . . . . . . . . . . . . . . . . 199
ix
3.11 Information Loss and Loss of Privacy on Real Data . . . . . . . . . 202
3.12 Generator Size and GAN Complexity . . . . . . . . . . . . . . . . . 206
3.13 Streaming Data and Information Loss . . . . . . . . . . . . . . . . . 208
3.14 Performance for Customer Targeting . . . . . . . . . . . . . . . . . 213
3.15 Performance for Tackling Multiple Problems . . . . . . . . . . . . . 216
3.16 GAN Training with Loss Function Gradients . . . . . . . . . . . . . 230
3.17 Information Loss vs. Neuron Dimensionality. . . . . . . . . . . . . 231
x
CHAPTER 1
EMPLOYEE REVIEWS AS LEADING INDICATORS OF CONSUMER HARM
CRISES: A TEXT MINING APPROACH
1.1 Introduction
Do firms that face crises have systematically different corporate culture? Can on-
line reviews posted by employees in public forums, provide early warning of po-
tentially dysfunctional culture? We examine these questions for corporate crises
that cause consumer harm- we do this first in the context of the Wells Fargo con-
sumer banking crisis. In this crisis, Wells Fargo (“Wells” hereon) created millions
of unauthorized bank and credit card accounts without their customers’ knowl-
edge (Arnold 2016; Ochs 2016). These revelations came to light in September 2016.
Since then, much has been written in the press about how Wells’ corporate cul-
ture was directly responsible for this crisis. According to the business press since,
some aspects of this allegedly problematic culture included aggressive sales tar-
gets, lack of oversight by the leadership of different divisions in the bank, and
strong leaders in divisions who demanded loyalty1. Since September 2016, Wells
has paid up to $3 billion in federal fines alone, and state fines on top of that, and
has lost significant firm value.
This chapter is: Anand, Piyush, Vrinda Kadiyali, and Vishal Narayan (2020). “Employee
Reviews as Leading Indicators of Consumer Harm Crises: A Text Mining Approach.”
1Wells continues to struggle with finding fixes. For example, see Charles Scharf Puts Stamp on
Wells Fargo With Overhaul of Reporting Lines, Wall Street Journal, Feb 11 2020
1
In the aftermath of the consumer banking division crisis, Wells conducted its
own investigation in to what led to the crisis, knowing that regulators were closely
watching its process and conclusions and hence having every incentive to report
accurately. Resultantly, a definitive investigative report was produced by its inde-
pendent board of directors on the improper sales practices (“the Report” hence-
forth).2 Unfortunately, even as discussions of the Wells crisis and its extensive and
extended aftermath on Wells’ profits have been discussed in the press, there has
been some recent concern that similar crises might be brewing in other banks.3
Expectedly, this is worrying the regulatory authorities because of its impact on
consumers and investors.
We measure corporate culture at Wells before the crisis came to light in the
press, to see whether this measured corporate culture is different from its peers,
and whether it is concordant with the findings of the definitive post-mortem in-
vestigation. We also want to understand if similar culture problems currently af-
flict other banks. While corporate culture has been discussed in business press and
in academic research, academic measurement has traditionally been via employee
surveys, using firm 10K documents, etc. There are no public-use Wells employee
surveys available from before and after the crisis, nor are there similar surveys
available for other banks currently. Therefore, following Bhandari et al. (2017)
and Corritore, Goldberg, and Srivastava (2019), we use textual data from anony-
2Independent Directors of the Board of Wells Fargo Company, Wells Fargo, April 10 2017
3See, for e.g., Economist, June 14, 2018, “Other American banks may have misbehaved as Wells
Fargo did. Which ones?”. Recently, American Express was accused of using similar tactics in its
small business division – “AmEx Staff Misled Small-Business Owners to Boost Card Sign-Ups”,
Wall Street Journal, March 1 2020
2
mous employee reviews from a leading jobs website in the US.4 We are especially
interested in estimating whether these employee reviews can provide any indica-
tions of dysfunctional culture before crises become public.5 The presence of the
Report provides the “ground truth” to see if employee reviews are valuable, and
differentiates our work from other research in this area. We use insights from an
organizational framework - competing values framework (CVF hereon; Cameron
et al. 2014) - to guide our measurement of corporate culture. This framework
categorizes corporate culture along two key dimensions - the extent of control of
employees and stability of processes, and the extent to which the firm is organized
to focus on internal structure or external market/ competition-oriented goals.
By all accounts, Wells had a problematic culture before the crisis was reported
in the media. Therefore, we cannot use causal inference methods to isolate which
aspects of corporate culture caused the crisis. Instead, we compare Wells’ culture
to that of its bank rivals who did not face a corporate crisis like Wells (or at least,
did not have a crisis that became public until July 2018 when our data period
ends). We restrict ourselves to large national commercial banks as reported by the
Federal Reserve – banks with no foreign ownership and at least 100 branches in
the US as of March 31, 2018.6 Data for smaller banks is sparser. We obtain 34,306
4The literature on employee voice discussed how Employees are often the “silent” stakehold-
ers of organizations, despite employees being closest to the firm and having a significant effect on
firm (Chang, Oh, and Park 2017; Huang et al. 2015; Moniz 2017; Moniz and Jong 2014).
5As the previous paragraph indicates, the corporate culture at Wells could well have been
viewed positively by those rewarded for the achievement of aggressive sales targets, or even non-
cheating employees who found attractive the more free-wheeling culture with few oversights. It is
likely there were other employees with different views. In our measurement, we capture aggregate
positive and negative sentiment across all employees.
6Source: Federal Reserve Statistical Release - Large Commercial Banks, release date: March 31,
2018
3
reviews for 32 large national commercial US banks before the Wells crisis becomes
public, spanning the period May 2008 – Sep 2016.7
We then compare these measures of culture at Wells to those identified as the
cause of crisis by the Report. We find that employees at Wells discuss fewer
culture strengths and more culture weaknesses than employees at other large
national commercial banks. Importantly, these culture drawbacks overlap with
those identified as key causes of the crisis in the Report. Next, we analyze em-
ployee reviews for the top 10 large national commercial banks in US to see how
close culture measures are to those in the pre-crisis employee reviews of Wells. We
find troubling similarity between pre-crisis corporate culture of Wells and today’s
corporate culture in another bank.8
Additionally, we examine the generalizability of our results outside of the
banking industry by measuring culture discussions in employee reviews for three
other consumer-facing crises of General Motors in 2014 (faulty ignition), Chipotle
in 2015 (food contamination), and Mylan in 2016 (exorbitant pricing of a pharma
product), starting from ten quarters before these crises became public. This anal-
ysis shows similar results and provides convergent evidence on the usefulness of
employee reviews to identify corporate culture dysfunction in not just Wells and
the banking industry, but in other industries too.
The substantive contribution of our paper is to demonstrate across several
7See Table 1 for details on these banks
8We do not reveal the name of this bank in this paper
4
crises, that employee reviews can provide systematic signals of corporate crises
before these crises become public. Previous studies in marketing have studied
what happens to marketing outcomes like brand strength, consumer engagement,
etc. after product crises (Borah and Tellis 2016; Van Heerde, Helsen, and Dekimpe
2007; Zhong and Schweidel 2020), financial impact (Chen, Ganesan, and Liu 2009)
and the determinants of firm responses to such crises (Liu, Liu, and Luo 2016). To
the best of our knowledge, we are the first to demonstrate the association of nega-
tive corporate culture with the occurrence of crises i.e. as leading indicators before
the crises are revealed in the press. Previous scholars in marketing have used CVF
(Deshpande and Farley 2004; Lukas, Whitwell, and Heide 2013) to study firm cul-
ture. Our methodological contribution is combining CVF and current text mining
methods on a new source of publicly available data in marketing- employee re-
views. Employee reviews are free from potential biases in employee survey data
(e.g., demand effects, order effects, social desirability bias). Our results of identi-
fying elements of problematic culture are likely to be useful both to managers and
regulators to reduce consumer harm. Managers of firms facing consumer harm
crises can use signals from reviews as a basis for investment in improvement of
culture. Regulators can use these publicly available signals to identify which firms
might be at risk of causing consumer harm, and accordingly direct their monitor-
ing efforts.
The rest of the paper proceeds as follows. In section 2, we discuss related
literature. In section 3, we discuss the data. Section 4 details the text mining
model. Section 5 presents results, and we conclude in section 6.
5
1.2 Literature Review
Given the interdisciplinary nature of our research question, we discuss several
areas of research that are related to our work.
Consider first the definition of culture. Economists have defined culture as
comprising shared language, shared knowledge and established rules of behav-
ior in ways that are unique to a company (Cremer 1993). Graham et al. (2017)
find in their survey of executives that executives characterize culture as a “beliefs
system”, “coordination mechanism”, and “invisible hand”. According to O’Reilly
(1989), an organizations scholar, culture is/ comprises control systems and norma-
tive order. Therefore, culture can be efficiency improving but it can also become
too ingrained in suboptimal ways. This definition accounts for both formal and
informal culture-defining practices. We follow this definition.
With this definition in mind, we first discuss the literature on the impact of
corporate culture on various outcomes. O’Reilly, Chatman, and Caldwell (1991)
find better organizational performance when there is a fit between personal and
organizational values. Green et al. (2019) suggest good corporate culture is con-
nected to better stock returns. Culture has also been found to be a driver of radical
innovation (Tellis, Prabhu, and Chandy 2009), of merger success (Chang, Oh, and
Park 2017), of financial reporting risk (Ji, Rozenbaum, and Welch 2017), etc. In
Graham et al.’s (2017) survey, senior executives believe corporate culture is a top-
three driver of firm value, and 92% believe better culture would lead to higher
6
value. Guiso, Sapienza, and Zingales (2015b) find that an increase in integrity is
associated with increases in Tobin’s q and increase in profits.
The papers above provide important background motivation. However, our
paper differs from these papers on a number of dimensions. First, we are inter-
ested in trying to isolate the link between (dysfunctional) corporate culture and
(lower) corporate performance - specifically, the occurrence of the crisis. Our fo-
cus on negative outcome- a major crisis- is also quite new in the literature. The
closest paper to ours is Ji, Rozenbaum, and Welch (2017), who use employee rat-
ings of culture and values on an employee reviews website, and find that lower
ratings for culture and values is associated with reporting fraud. However, we
differ from Ji, Rozenbaum, and Welch (2017) on five major points. First, we use
CVF to measure the nature of discussions of culture from employee reviews, as
opposed to the ratings on a 5-point scale. This approach allows us to examine
culture in an organizational theory framework which is tied to agency theory and
competition. Second, we exploit the demarcated nature of employee reviews to
get separate measures for positive and negative discussions of culture. Third, our
substantive interest is in understanding the role of corporate culture in consumer
facing crises, and not in accounting frauds. Fourth, we do text analysis to extract
semantic measures of discussions in text. Fifth, the existence of the Report allows
us to validate our measures as being potentially causal.
Expectedly, our work is also related to literature on firm crises. These papers
focus on effects of firm crises. Van Heerde, Helsen, and Dekimpe (2007) found
7
a crisis firm faces reduced effectiveness for its marketing activities. Chen, Gane-
san, and Liu (2009) find that a product-harm crisis leads to a decrease in brand
equity, consumer preferences, firm’s reputation, and market share. Borah and
Tellis (2016) study the impact of product recalls on online discussions and social
media. Zhong and Schweidel (2020) use a Dirichlet process – hidden Markov
model to capture change in discussions on social media for recent brand crises.
As crisis events unfold, different stakeholders get affected negatively, and their
responses vary from regulatory fines and penalties, decreased brand loyalty from
consumers, lower employee morale, increased negative mentions in media, and
lower financial performance (Liu, Liu, and Luo 2016; Pearson and Clair 1998; Wei,
Ouyang, and Chen 2017; Zhao, Zhao, and Helsen 2011). Unlike these papers, we
want to identify cultural leading indicators of crises rather than the effect of crises.
Importantly, this allows us to identify another at-risk bank. Furthermore, we use
CVF to infer correlates of crises, which can serve as leading indicators.
Another relevant stream of literature is measuring corporate culture. Weber
and Camerer (2003) and Burks and Krupa (2012) use laboratory experiments to
“treat” culture and observe outcomes (difficulty after merger and ethics respec-
tively). As mentioned earlier, Guiso, Sapienza, and Zingales (2015a, b) use “Great
places to work” survey of employees and corporations’ own description of cul-
ture to measure differences between differences between employees’ perception
of culture and the company’s self-stated culture. Our approach of text mining
employee reviews is close to Bhandari et al. (2017) who text-mined 10K state-
ment of companies to extract culture traits using Latent Dirichlet Allocation. In
8
terms of text mining methods, we are closest to Timoshenko and Hauser (2019)
who use Word2Vec (Mikolov et al. 2013) to capture semantic word embeddings
that are subsequently used as features in a hybrid deep learning model to capture
consumer needs from user generated content. We use Word2Vec in this paper to
extract discussions of culture from employee reviews.
1.3 Data
As discussed earlier, we collect publicly available employee reviews from a lead-
ing online jobs and reviews website in the U.S. We collect employee reviews for
Wells before the crisis became public. We collect all reviews from January 2008
to July 2018, for Wells and for 31 other banks. For comparability, and to ensure
there are enough reviews, we confined ourselves to the large commercial banks
with domestic ownership. As of March 2018, these banks had at least $300 Mil-
lion in consolidated assets and at least 100 branches in the U.S.9 Table 1.1 lists the
number of reviews by year for each of these banks. Of the 46,385 reviews, 34,306
were posted before September 7, 2016, the date on which the crisis at Wells was
revealed (Arnold 2016; Ochs 2016). 7 banks (including Wells) account for 89.9%
of reviews. Banks with greater assets elicit more reviews, so our analysis controls
for assets.
9The financial details of these banks from the Federal Reserve are listed in the Appendix 1.9.1
9
10
Table 1.1: Number of Employee Reviews (ranked by assets as of
03/31/2018)
This table shows the total number of reviews by bank and year (beginning May 1, 2008, ending on June 30, 2018)
Bank 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Grand Total
JPMorgan Chase 192 164 317 399 618 852 1403 2020 1875 1279 599 9718
Bank of America 240 218 345 460 566 924 1389 2081 1918 1214 537 9892
Wells Fargo 224 169 352 463 720 891 1548 1999 1874 1366 550 10156
Citibank 145 119 173 158 197 322 445 621 722 571 253 3726
US Bank 66 54 73 101 148 229 375 480 510 369 172 2577
PNC Bank 13 14 71 67 122 194 300 467 464 364 160 2236
Capital One 74 74 84 90 171 243 372 686 687 574 340 3395
KeyBank 17 9 29 30 49 61 64 111 116 118 63 667
Citizens Bank 9 11 23 36 39 53 125 217 211 178 81 983
Huntington bank 5 6 9 12 18 33 69 112 115 97 49 525
Zions Bank 4 2 14 11 15 22 31 64 54 44 21 282
Peoples United Bank 2 3 3 4 9 20 22 27 37 24 15 166
First Tennessee Bank 2 7 6 8 20 12 17 22 29 27 25 175
BOK Bank 2 1 1 2 2 6 9 12 20 18 25 98
First National Bank of Pennsylvania 2 1 1 13 6 16 9 6 54
Associated Bank 2 3 2 7 9 13 15 30 16 35 8 140
Sterling National Bank 1 2 1 4 10 13 11 5 47
Valley National Bank 6 1 7 5 11 20 27 15 11 103
Webster Bank 1 5 6 9 12 14 10 5 62
TCF National Bank 4 3 14 20 22 45 63 74 74 43 20 382
MB Financial 2 1 3 6 5 9 9 13 12 12 9 81
First National Bank of Omaha 1 3 7 8 5 23 21 25 20 14 127
Old National Bank 1 1 1 3 1 5 3 6 15 3 3 42
Washington Federal Bank 2 4 6 29 22 35 21 8 127
Trustmark National Bank 6 3 6 11 11 5 2 44
Fulton Bank 1 1 4 2 5 11 18 9 1 52
Centerstate Bank 1 2 1 3 2 9
NBT Bank 1 2 1 4 5 10 3 26
Park National Bank 1 1 4 2 5 2 15
Woodforest National Bank 3 6 6 9 16 32 37 78 78 67 29 361
DBA First Convenience Bank 1 2 10 9 17 25 34 12 7 117
Grand Total 1009 867 1537 1904 2798 4005 6418 9270 9034 6523 3020 46385
For each review, we observe the date when the review was posted, whether
the employee posting the review was currently working at the bank at the time
of posting the review, and whether they disclosed their job title in the review or
not, i.e. anonymously. In our data, we find that 59% of the reviews are posted
by employees who were working at the firm at the time of posting the review,
i.e., current employees, while 41% of the reviews are posted by employees who
no longer worked at the firm, i.e., former employees. We also observe whether
reviewers disclose their job title- 14% of the reviews do not disclose the job title
of the employee posting the review, i.e., anonymously, while 86% of the reviews
disclose the employees’ job title. There can be differences in posting content for
anonymous and non-anonymous reviews, as well as current and former employ-
ees, so we control for these variables.
In the text-based portion of a review, the website has three prompts – pros,
cons, and advice to management. Following Lee and Bradlow (2011), we use
this demarcation to label as positive the sentiment of topics extracted from pros,
and as negative the sentiment of topics extracted from cons. The mean length of
employee reviews is 50.80 words (SD = 33.21). The pros section has a mean of
17.33 words (SD = 13.57), and the cons section has a mean of 20.15 words (SD =
16.96). Since the depth of discussion of any topics and number of topics is likely
affected by the length of the review, we control for the length of the reviews in our
analysis.
Finally, research on employee reviews from this website (Corritore, Goldberg,
11
and Srivastava 2019; Marinescu and Posner 2019) has reported evidence against
any systematic selection biases due to non-random sampling of employees who
post reviews on this website. Another bias could arise if Wells successfully pres-
sured employees to post positive reviews. Our finding that employees of Wells
post fewer positive reviews than those of comparable firms suggests that any such
pressure was perhaps ineffective.
1.4 Independent Directors of the Board of Wells Sales Practices
Investigation Report
In the aftermath of the Wells crisis, the independent directors of the board of Wells
conducted a sales practices investigation. The findings of the investigation were
made public on April 10, 2017 in a 110-page long report that lists the root-causes
for the improper sales practices at Wells. An excerpt from the report summary is
as follows:
”The root cause of sales practice failures was the distortion of the
Community Bank’s sales culture and performance management sys-
tem, which, when combined with aggressive sales management, cre-
ated pressure on employees to sell unwanted or unneeded products to
customers and, in some cases, to open unauthorized accounts.”
12
The report discusses the sales practices and culture at Wells, and how it led to
the crisis. The report highlights two aspects of corporate culture that led to these
poor sales practices. First, there was a high weight placed on performance as
measured by sales, since these were seen as directly linked to higher market share
and higher profits. Second, the decentralized structure of the bank with lack of
formal procedures was a major reason why the sales practices went unnoticed.
Even when there were reported incidents of improper sales practices, they were
considered as independent incidents and were not seen as manifestation of the
same underlying issue.
A more recent report was released by Wells on Jan 30, 2019. This report, a 104
pages long document titled “Learning from the past, transforming for the future”,
discusses Wells learning from the crisis and how it intends to prevent such inci-
dents from happening again in the future. Both reports highlight that the root
cause of the crisis was the combination of high sales focus with a decentralized
business model, and a failure of control functions that allowed the improper sales
practices to go unchecked. The findings of these reports serve as a benchmark for
our analysis.
13
1.5 Culture and Textual Analysis Model
1.5.1 A framework for measuring dimensions of culture
The key conceptual framework for measuring culture is the Competing Values
Framework (CVF). This framework categorizes culture on two dimensions (see
Figure 1.1). The first dimension is the degree of control of processes- one end of
the spectrum represents firms that are flexible and dynamic, the other end of the
spectrum represents stability and order.10 Among other things, this dimension
captures the extent to which the firm formally deals with or needs to deal with
agency.The second dimension is the degree of market or internal focus- one end of
the spectrum represents firms that focus on integration and uniformity of various
activities inside the company. The other end of the spectrum represents firms
that are more focused on externally oriented, specifically, market-based drivers of
activities.11
Based on these two dimensions, four major culture types are obtained. First,
consider a company with a higher focus on stability and internal consistency. This
culture type places importance on predictability, conformity, and internal pro-
10Among other things, this dimension captures the extent to which the firm formally deals with
or needs to deal with agency, including those caused by incentives set by the firm. This is directly
relevant to the Wells’ case.
11This dimension captures, among other things, the extent of competitive pressures faced by a
firm. These competitive pressures can come from economic conditions in the industry (e.g. declin-
ing revenues or high fixed costs) and compensation schemes of CEOs and managers (Aggarwal
and Samwick 1999). The Report suggests managers incentive compensation schemes were tied to
achieving aggressive competitive targets.
14
Figure 1.1: Competing Values Framework
cesses, and has a lower focus on external markets (lower left quadrant). This
corporate culture, termed “control”, captures Weber’s theory / organization of
a firm for the industrial age. In years past, conglomerates like GE or Westing-
house were predominantly structured this way. Next consider a company with
high process/employee stability and order and high focus on markets/high fo-
cus on internal consistency (lower right quadrant). This corporate culture, which
is competition-oriented and termed “compete” in this framework is related to
Williamson’s (Williamson 1981) framework where a firm’s key role is to transact
with outside entities. Additionally, for these firms, the goal is competitive strength
and shareholder value. Many companies of the old economy are described well
by this- PG, Coca Cola, IBM, etc.
Third, consider a company with flexibility and dynamism in em-
15
ployee/process control and internal focus (upper left quadrant). This corporate
culture is called “collaborative”. These are characterized by loyalty to leaders,
tradition, commitment and cohesion. These features can potentially substitute for
formal employee control/ agency solutions. This culture is common in several
family firms and several Chinese firms (Chen 2001). Fourth, consider a firm with
flexible employee/process control, and with an emphasis on market transactions
(upper right quadrant). This culture is most salient for the information age where
companies face rapidly changing industry landscape and have work patterns that
are best supported by temporary teams that form and dissolve over the course of
projects. These companies are also responsive to market pressure (e.g., because of
increasing returns technologies) and are focused on how to beat these pressures.
Two aspect of the framework are important to highlight for our application.
First, the framework is descriptive and not normative; there is no guidance on
what constitutes optimal culture. In our application, we measure the association
of culture and crisis by comparing crisis and non-crisis firms. Second, a company
can have multiple types of culture, in different divisions, and at different levels
within a division.
Cameron et al. (2014) argue that CVF has a high degree of congruence with
other well-known constructs of values, the way people think and their assump-
tions, such as McKenney and Keen (1974), Mitzroff and Killman (1978), and Myers
(1962). CVF has been used as well as adapted in management and organizations
literature to assess organizational culture (Hartnell, Ou, and Kinicki 2011; Lavine
16
2014; Panayotopoulou, Bourantas, and Papalexandris 2003). Using CVF, Desh-
pande and Farley (2004) found that collaborate and control cultures have a signif-
icantly negative impact on firm performance, while create and compete cultures
have a significantly positive impact on firm performance. Lukas, Whitwell, and
Heide (2013) found in the context of product design and overprovision in mar-
kets that higher focus on create and compete cultures in firms leads to potential
mismatch between customer needs and firms product capability decisions.
Our paper differs from existing CVF literature in the following ways. First,
we have the Report that establishes the causes of the crisis, and therefore we can
verify the cultural causes of crises. Second, conditional on measuring problematic
culture at Wells, we use it as a yardstick to see if other companies are similarly at
risk. In the absence of theory-driven hypotheses about problematic culture, our
empirical approach to creating a yardstick for problematic culture is novel.
1.5.2 Text measure of culture
We are interested in estimating the nature and extent of discussions of four dimen-
sions of corporate culture: create, collaborate, control, and compete, which are
captured by a set of seed words as listed in the Appendix 1.9.2. Adding or drop-
ping a few randomly selected seed words does not change our results. Counts
of seed words as measure of discussion is not ideal since they are not extensively
used in online reviews (Puranam, Narayan, and Kadiyali 2017). Furthermore, us-
17
ing seed words by themselves to estimate extent of discussion also suffers from
researcher subjectivity bias, as the word lists may not exhaustively cover the na-
ture of discussion of these topics. Word2Vec captures semantic similarity of words
using the contexts in which they appear.
We take two steps to measure culture discussions using Word2Vec. First, we
train the Word2Vec model on the reviews data. The model learns N-dimensional
vector representations, subsequently referred to as embeddings, for each of word
in the corpus. To construct the embeddings for the culture topics, we take the av-
erage of the embeddings for each of the seed words for the culture topics. Second,
we estimate the extent of discussion of culture topics in the employee reviews as
the cosine distance between the culture topics obtained from the first step and the
words in the text, aggregated for each employee review. We describe these two
steps below. We discuss pre-processing of text in the Appendix 1.9.3.
In the first step, the Word2Vec model estimates embeddings for words based
on the contexts they occur in. Consider the example the phrase: “Manager pushes
sales targets, high stress job.” If we select a context window of 2 words and se-
lect the focal word as “targets”, then the model observes that the word “targets”
occurs with the context: “pushes”, “sales”, “high”, “stress”. That is, 2 words pre-
ceding the focal word and 2 words post the focal word are the context when we
set the context window as 2.12 The model then estimates embeddings such that
the probability of predicting this context given the focal word is maximized for
all possible focal word and context combinations observed in the corpus. This
18
prediction task is the skip-gram model for Word2Vec (also see Figure 2).
Figure 1.2: Word2Vec Skip-gram Model
This figure illustrates the skip-gram model for Word2Vec using the example “Manager pushes
sales targets, high stress job”. The focal word is ”targets”, and the model predicts the probability
of observing the context words around the focal word.
The skip-gram Word2Vec model maximizes the following objective function:
1 ∑T ∑
logp(wt+ j|wt) (1.1)T
t=1 −J≤ j≤J, j,0
19
where T is the total number of focal words, J is the context window size, and
wi is the embedding for word i. The probability of observing a context word i
given focal word j is:
′
exp(vw >vwi)
p(wi|w jj) = ∑V (1.2)
j=1 exp(v
′
w >v )j wi
where vw j represents the input vector for word w j and
′
vw represents the outputj
vector for word w jout of a vocabulary of size V The model thus has 2 × N × V
number of parameters to be estimated, where N is the number of neurons in the
hidden layer. The denominator in the softmax poses a computational challenge
as the vocabulary size is often large, and the computational cost of the gradient
of p(wi|w j) is directly proportional to V. To deal with this, we use the hierarchical
softmax approximation for the softmax (Mikolov et al. 2013). It has the advantage
that the computation cost is proportional to log2V instead of V with the softmax.12
The objective function for the skip-gram with hierarchical softmax Word2Vec
is maximized using stochastic gradient descent. In order to measure culture senti-
ment, we exploit the demarcated pros and cons sections structure of the reviews.
We train two Word2Vec models on the pros and cons sections of the reviews sep-
arately. The seed words are the basis for forming culture topic embeddings. We
construct culture topic embeddings as the average of the embeddings for the seed
12We discuss the details of hierarchical softmax in the Appendix section 1.9.4
20
words corresponding to that culture topic. Thus, we get positive embedding for
a culture topic c as vcp which is obtained from the Word2Vec model trained on the
pros section of the reviews, and we get the negative embedding for a culture topic
c as v from the cons section of the reviews. 13cn
Based on the word embeddings estimated from these models, we measure the
discussion of a culture topic in terms of cosine distance between the culture topic
embeddings and the text in employee review pros (or cons), using the cosine sim-
ilarity metric. More formally, we measure positive (and negative) discussions of
culture topics in reviews as follows:
1 ∑ vw >vc
d (r v ) j sc,s , cs = (1.3)ls w j∈ ||vr w j || ∗ ||vcs ||
where dc,s is one of two the cosine distance measures – positive and negative
discussions based on the sentiment s for a culture topic c,ls is the length (number
of words) of the section of the review: rs and vw j the word embedding for a word
w j. The support for these measures is [-1,1], with the discussion in a review being
perfectly similar to the culture topic for a score of 1, orthogonal to the culture
topic for a score of 0, and perfectly dissimilar to the culture topic for a score of
-1.14 Thus, for the reviews we construct eight measures of culture discussions –
13The model hyper-parameters for the two Word2Vec models that are optimized by maximizing
the held-out log-likelihood on 5% of the corpus are: dimension of the embeddings, size of context
window, and number of iterations over the corpus. We changed the held-out corpus to 2%, 5%, and
10% during Word2Vec training and found that optimal model hyper-parameters do not change.
14We find empirically in our reviews data that these measures are in the range [0.053, 0.685],
implying that our discussion measures vary from low similarity to high similarity.
21
positive discussion and negative discussion for each of the four culture topics.
For both positive and negative sentiments, we also construct an overall measure
of culture discussion. Our overall measure of discussion of culture in review r is a
weighted average of discussion measures of the four culture topics for that review
in a section s, given by:
∑∑c,s Wc,sdc,sdc,s = (1.4)
c,s Wc,s
where dc,s is the measured discussion of culture topic c for a section s – pros
and cons, and Wc,s is the mean discussion for sentiment s of topic c across all re-
views. An important point to note is that our measures of discussion are based on
similarities to culture topics, i.e. how similar the content of a review is to a culture
topic, and not on the magnitude (e.g. high or low) of discussion of the culture
topic in the review. Furthermore, longer reviews could include lower proportion
of culture topics if there is discussion of other topics. Therefore, we control for the
length of a review (in words) in our subsequent analysis in order to account for
volume of discussion in a review.
1.6 Results
The results section is organized in three parts. First, we discuss the analysis of the
employee reviews for Wells relative to other banks. Second, to see if they might
22
potentially be at risk for (future) crises, we assess how close our estimates of other
banks’ culture topics are to Wells’. Finally, we study three consumer facing crises
outside the banking industry.
1.6.1 Wells Employees’ Discussion of Corporate Culture
We discuss the results of the Word2Vec model. In Table 1.6.1, we present the top
20 words that are most similar to the culture topics in the pros and cons section of
the reviews respectively.
We first want to understand whether Wells’ culture is systematically different
from its rivals in the ten quarters prior to the crisis reveal in the press. Therefore,
we run the following regressions for each of the culture topics (including overall
culture topic) and sentiment for the employee reviews in the 10 quarters leading
up to the crisis unveil:
∑
dr, j,s = α j,s + βδWells + ρ δqtr + γRr + r, j,s (1.5)
qtr
23
24
Table 1.2: Most Similar Words for Culture Topics in Pros and Cons Sections
of Employee Reviews
This table lists the 20 most similar words (using cosine similarity) for the culture topics embeddings as estimated from the Word2Vec models trained on the pros (and
cons) sections of the reviews. A culture topic embedding is estimated as the average of the embeddings of its seed words as listed in the appendix.
Create Collaborate Control Compete
Pros Cons Pros Cons Pros Cons Pros Cons
technically entrepreneurial inviting appreciative procedures onerous ridiculous aggressive
agenda creative emphasized humans written impede targets pressure
inspirational innovative demonstrated participation conservatively principles evaluation sales
innovation innovation seeks unified workflow flows exceeding quotes
transformation evolve participation resolutions straightforward bloated timelines cooker
insightful stifle eachother mistreatment micromanagement adapting reachable campaigns
acceptance unified appreciative instill mature prioritization unrealistic aggravating
entrepreneurial stifled enthusiasm interpersonal overly protocol struggle lofty
bold stifles differences acknowledgment timings structuring hitting obscene
innovative averse advocate foster predictable regs achievable exceeded
contributors explore principles recognizing defined rigidity strictly unachievable
tune progressive disabilities outgoing consistent needlessly hustle ridiculous
implementing creativity treating sought structured robust sells payoff
emphasized adapting admirable faith consistent bureaucracy production obsession
exceptionally leverage sincere affecting rigid abrupt producer upsell
dynamic horizontal moral sole equitable workflow target goals
prides ease fostered sympathy staffs interference producing baseline
agile methodology communicative smartest informal synergy scores stressing
savvy deeply consensus degrading swag unforgiving objective appealing
evolving necessity enthusiastic synergy affected formalized quotas berated
where dr, j,s for a review r and sentiment s, and j is one of the four culture topics
or the overall culture topics. δWells is the dummy variable for whether the review
is for Wells. δqtr is the quarter fixed effect. Rr are the review level characteristics of
log of length of the review, whether it is by a current employee or not, whether it is
by an anonymous employee or not, and whether it is by a current and anonymous
employee. r, j,s is the idiosyncratic error term. We clustered the standard errors at
firm-year level to account for correlations in the error term across reviews of a firm
in a given year. Adding bank specific fixed effects does not change the results.
Table 1.3 shows the results. We find higher discussions of both negative and
positive discussions of competition-focused goals – compete culture (β=0.024,
p<0.01 for negative discussions; β=0.002, p<0.01 for positive discussions). For
rules and procedures – compete culture, we find lower negative discussions:
β=0.0043, p<0.01, while positive discussions are not significantly different from
zero. For create culture, we find that both positive and negative discussions are
lower (β=0.008, p<0.01 for negative discussions; β=0.007, p<0.01 for positive dis-
cussions). Finally, for overall culture discussions we find that Wells has lower pos-
itive discussion: β=0.002, p<0.05, and higher negative discussion: β=0.004, p<0.01.
25
26
Table 1.3: Employee Reviews: Discussions of Culture Topics
This table reports the output from Eq. 1.5. The dependent variables are positive (and negative) discussions in employee reviews of the four culture topics - create,
collaborate, control, and compete, as well as the overall culture topic. Current employee and anonymous employees are dummy variables, and we also include their
interaction. Robust standard errors (in parenthesis) are clustered at firm year level.
Positive Discussions Negative Discussions
VARIABLES Create Collaborate Control Compete Overall Create Collaborate Control Compete Overall
Wells -0.007*** -0.002* -0.001 0.002*** -0.002** -0.008*** -0.001 -0.004*** 0.024*** 0.004***
(0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
Log (Length of Review) -0.006*** -0.014*** -0.016*** -0.006*** -0.011*** -0.006*** -0.002*** -0.004*** -0.001 -0.003***
(0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
Current Employee 0.009*** 0.004*** 0.003*** 0.003*** 0.005*** 0.004*** -0.005*** 0.002*** 0.001 0.001
(0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001)
Anonymous Employee 0.007*** 0.003** 0.003*** 0.001 0.003*** 0.005*** 0.004*** 0.003*** -0.001 0.002***
(0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.002) (0.001)
Current and Anonymous Employee -0.003 0.001 -0.001 -0.001 -0.001 0.001 0.001 -0.001 -0.002 -0.001
(0.002) (0.002) (0.002) (0.001) (0.001) (0.002) (0.002) (0.001) (0.002) (0.001)
Constant 0.262*** 0.355*** 0.330*** 0.315*** 0.319*** 0.238*** 0.298*** 0.290*** 0.300*** 0.284***
(0.003) (0.003) (0.002) (0.002) (0.002) (0.002) (0.002) (0.002) (0.002) (0.001)
Observations 20,099 20,099 20,099 20,099 20,099 18,826 18,826 18,826 18,826 18,826
R-squared 0.021 0.042 0.067 0.018 0.050 0.023 0.010 0.010 0.048 0.012
Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05.
To enable comparison of coefficients of different cultures, we compute the
semi-elasticities (see Table 1.4) for δWells from Equation 1.5. The standard errors
for the semi-elasticities are calculated using the delta method (Gupta et al. 1996).
From the net-sentiment (positive – negative) semi-elasticities, we find that em-
ployees at Wells negatively review the compete culture (semi-elasticity of -0.070,
p¡0.01), and positively review the control culture (semi-elasticity of 0.013, p¡0.01),
whereas other semi-elasticities are not significant. Recall from the Report that
these are the two main crisis causes identified in the report- too much emphasis
on competitive sales goals, and too little oversight and control of poor practices.
It is interesting that employees view positively the control culture, that we know
from the report was too loose and dysfunctional for the company. It is possible
some employees who indulged in fraudulent sales practices wrote these reviews,
or employees who didn’t but still viewed positively the looser oversight. If these
control dysfunctions caused the crisis, as the Report suggests they did, the result
can be interpreted as employees viewing some of these dysfunctional culture as-
pects as positive (despite being harmful for consumers). However, note that over-
all, across all culture topics, the semi-elasticity is negative and significant (-0.020,
p¡0.01), implying Wells employees review more culture cons than pros.
27
Table 1.4: Semi-Elasticities for Wells in Discussions of Culture
This table reports the semi-elasticities for the dummy variables for Wells from Table 1.3. Robust
standard errors (in parenthesis) are clustered at firm year level.
Sentiment
Culture Positive Negative Net
Create -0.029*** -0.037*** 0.008
(0.005) (0.004)
Collaborate -0.007* -0.001 -0.006
(0.003) (0.002)
Control -0.002 -0.015*** 0.013***
(0.004) (0.002)
Compete 0.008*** 0.078*** -0.070***
(0.002) (0.004)
Overall -0.006** 0.014*** -0.020***
(0.003) (0.002)
Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05.
Next, we want to understand how long before the crisis reveal Wells’ culture
differed from its rivals. Therefore, we run the following regressions for each of the
culture topics (including overall culture topic) and sentiment, interacting Wells
dummy with quarter dummy for the employee reviews in the 10 quarters leading
up to the crisis unveil:
∑ ∑
dr, j,s = α j,s + β δqtr × δWells + ρ δqtr + γRr + r, j,s (1.6)
qtr qtr
Table 1.5 shows the regression results. Again, to compare across different
culture topics over time, we focus on the semi-elasticities. We report the semi-
28
elasticities for δqtr × δWells for the two sentiments with the four culture topics as
well as the overall culture topics in Table 1.6. Here we find that the most consis-
tent results are up to ten quarters ahead of crisis reveal – employees quite consis-
tently negatively review the compete culture. The net sentiment (positive discus-
sion – negative discussion) semi-elasticities for compete culture are at least -0.053,
p<0.01 or lower, and consistently significant at p<0.01 across the ten quarters. For
rules and procedures – control culture, the net sentiment (positive discussion –
negative discussion) semi-elasticities are positive and significant up until three
quarters ahead of crisis. For other culture topics, we do not find any trends in the
quarters ahead of crisis. Finally, for the overall culture topic we find that employ-
ees reviewed negatively across the ten quarters ahead of crisis (with the exception
of quarter 4). This suggests that employees discussed more negative aspects of
compete culture, where as they reviewed positively the control culture up until
three quarters ahead of crisis.15 This window of 2.5 years before the crisis reveal
show the potential usefulness for managers and regulators to using employee re-
views as leading indicator of firm crises.
Summarizing, we find that employees at Wells exhibit more negative views of
the overall culture at Wells. Across the four culture topics, we see that this is true
especially for the compete culture at Wells. Interestingly, employees viewed rules
and procedures – control culture more positively, however these views disappear
in the three quarters ahead of crisis. This is consistent with the findings of the
15We did not observe any substantial changes in the number of reviews for any bank over time.
Thus, these results are not driven by a sudden increase or decrease in number of reviews.
29
Report which identified the lack of formal controls at Wells that let the improper
and aggressive sales practices continue uncorrected, which subsequently caused
the crisis. Since the Report recognizes these as key causes for the crisis, we have
greater confidence in the ability of employee reviews to capture critical weak-
nesses of corporate culture that potentially caused the crisis. Employee reviews
are publicly available for both own and rival companies, and over time, they can
be used by managers to benchmark with their competitors and with firms in other
industries that faced crises. Regulators can also use these reviews to monitor both
individual firm and industry-level risks.
30
31
Table 1.5: Employee Reviews: Discussions of Culture Topics Over Time
This table reports the output from Eq. 1.6. The dependent variables are positive (and negative) discussions in employee reviews of the four culture topics - create,
collaborate, control, and compete, as well as the overall culture topic. Controls (not reported in the table for brevity) include log of the length of review, current
employee dummy, anonymous employee dummy, and their interaction. Robust standard errors (in parenthesis) are clustered at firm year level
Positive Discussions Negative Discussions
Wells Create Collaborate Control Compete Overall Create Collaborate Control Compete Overall
Pre 10 Qtr -0.006* 0.002 -0.001 0.001 -0.001 -0.010*** -0.003** -0.004*** 0.024*** 0.003***
(0.003) (0.002) (0.001) (0.002) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001)
Pre 9 Qtr -0.005*** -0.004* 0.001 0.004*** -0.001 -0.007*** 0.003*** -0.003*** 0.028*** 0.006***
(0.002) (0.002) (0.002) (0.001) (0.001) (0.002) (0.001) (0.001) (0.001) (0.001)
Pre 8 Qtr -0.010*** -0.006*** -0.003** 0.003*** -0.004*** -0.009*** 0.003*** -0.006*** 0.030*** 0.006***
(0.002) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.002) (0.000)
Pre 7 Qtr -0.012*** -0.010*** -0.005*** 0.001 -0.006*** -0.008*** 0.002 -0.004*** 0.026*** 0.005***
(0.002) (0.002) (0.002) (0.001) (0.001) (0.002) (0.0020) (0.001) (0.002) (0.001)
Pre 6 Qtr -0.010*** -0.001 0.002 0.003** -0.001 -0.011*** -0.001 -0.007*** 0.027*** 0.003***
(0.003) (0.001) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.002) (0.001)
Pre 5 Qtr -0.006*** -0.001 0.001 0.004*** -0.001 -0.008*** -0.001 -0.003*** 0.025*** 0.004***
(0.002) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.001) (0.000)
Pre 4 Qtr -0.006** 0.001 0.003*** 0.004*** 0.001 -0.006*** -0.001 -0.004*** 0.020*** 0.003***
(0.003) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.001) (0.001)
Pre 3 Qtr -0.009*** -0.002 -0.001 0.002* -0.002 -0.011*** -0.001** -0.004*** 0.028*** 0.004***
(0.002) (0.002) (0.003) (0.001) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001)
Pre 2 Qtr -0.007*** -0.004** -0.004*** 0.001 -0.003*** -0.009*** -0.001 -0.006*** 0.018*** 0.001
(0.003) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.001) (0.001)
Pre 1 Qtr -0.005 -0.001 -0.002*** 0.001 -0.002 -0.006*** -0.001 -0.002** 0.021*** 0.004***
(0.003) (0.002) (0.001) (0.000) (0.001) (0.002) (0.001) (0.001) (0.003) (0.001)
Observations 20,099 20,099 20,099 20,099 20,099 18,826 18,826 18,826 18,826 18,826
R-squared 0.021 0.042 0.068 0.019 0.050 0.023 0.010 0.010 0.049 0.013
Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05.
32
Table 1.6: Semi-Elasticities for Wells in Discussions of Culture Over Time
This table reports the semi-elasticities for the joint dummy variables for Wells and quarter coefficient from Table 1.5 for culture topics. Robust standard errors (in parenthesis) are
clustered at firm year level.
Create Collaborate Control Compete Overall Culture
Wells Positive Negative Net Positive Negative Net Positive Negative Net Positive Negative Net Positive Negative Net
Pre 10 Qtr -0.023* -0.046*** 0.023* 0.006 -0.011** 0.017* -0.002 -0.014*** 0.011*** 0.001 0.077*** -0.075*** -0.003 0.0010*** -0.013**
(0.012) (0.006) (0.007) (0.004) (0.004) (0.003) (0.005) (0.005) (0.006) (0.002)
Pre 9 Qtr -0.020*** -0.029*** 0.009 -0.013* 0.010*** -0.023*** 0.005 -0.011*** 0.015*** 0.014*** 0.088*** -0.074*** -0.003 0.023*** -0.025***
(0.007) (0.009) (0.007) (0.003) (0.005) (0.003) (0.004) (0.004) (0.004) (0.002)
Pre 8 Qtr -0.039*** -0.040*** 0.001 -0.018*** 0.011*** -0.029*** -0.011** -0.021*** 0.011* 0.010*** 0.096*** -0.086*** -0.013*** 0.021*** -0.034***
(0.010) (0.008) (0.006) (0.002) (0.005) (0.003) (0.003) (0.007) (0.004) (0.002)
Pre 7 Qtr -0.047*** -0.033*** -0.014 -0.031*** 0.006 -0.036*** -0.017*** -0.013*** -0.004 0.004 0.084*** -0.080*** -0.021*** 0.019*** -0.040***
(0.009) (0.008) (0.005) (0.006) (0.006) (0.004) (0.003) (0.005) (0.004) (0.003)
Pre 6 Qtr -0.039*** -0.048*** 0.009 -0.001 -0.001 0.001 0.005 -0.026*** 0.031*** 0.008** 0.086*** -0.078*** -0.004 0.012*** -0.016***
(0.010) (0.008) (0.004) (0.002) (0.004) (0.004) (0.004) (0.005) (0.004) (0.003)
Pre 5 Qtr -0.021*** -0.038*** 0.016 -0.002 -0.001 -0.001 0.003 -0.012*** 0.014*** 0.012*** 0.079*** -0.068*** -0.001 0.015*** -0.016***
(0.008) (0.007) (0.007) (0.003) (0.004) (0.003) (0.003) (0.005) (0.005) (0.002)
Pre 4 Qtr -0.022** -0.024*** 0.002 0.004 -0.004 0.008 0.011*** -0.015*** 0.026*** 0.012*** 0.065*** -0.053*** 0.003 0.011*** -0.008
(0.011) (0.007) (0.006) (0.003) (0.003) (0.004) (0.003) (0.004) (0.004) (0.002)
Pre 3 Qtr -0.037*** -0.048*** 0.012 -0.005 -0.005** 0.001 -0.003 -0.014*** 0.010 0.006* 0.089*** -0.083*** -0.008 0.015*** -0.022***
(0.008) (0.006) (0.006) (0.002) (0.009) (0.004) (0.003) (0.007) (0.005) (0.003)
Pre 2 Qtr -0.028*** -0.039*** 0.011 -0.012** -0.005 -0.008 -0.013*** -0.022*** 0.001* 0.003 0.058*** -0.055*** -0.011*** 0.004 -0.016***
(0.010) (0.008) (0.006) (0.003) (0.003) (0.004) (0.002) (0.005) (0.004) (0.003)
Pre 1 Qtr -0.019 -0.028*** 0.010 -0.003 -0.001 -0.002 -0.007*** -0.008** 0.001 0.001 0.068*** -0.067*** -0.006 0.014*** -0.020***
(0.013) (0.007) (0.006) (0.004) (0.002) (0.003) (0.001) (0.008) (0.004) (0.004)
Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05.
1.6.2 Risk Assessment of Other Banks
Next, we analyze if the culture discussions by employees of any bank in our data,
are similar to the culture discussions by Wells employees. For this purpose, we
compare culture discussions of other bank employees for the four quarters after
September 2016, with Wells’ culture discussion before September 2016. We em-
ploy the Kolmogorov–Smirnov (KS) test to estimate this distance. The KS test pro-
vides an upper bound for the difference in the cumulative distribution function
for two distributions. More formally, for two samples P and Q, the two-sample
KS test is given by:
KS (P,Q) = maxi|CPi −CQi | (1.7)
Where CP is the cumulative distribution function for the distribution P. Thus,
with the Wells pre-crisis employee reviews as the benchmark, we estimate the KS-
test for employee reviews for the banks for the duration 4 quarters post crisis. In
the KS test, a lower deviation value implies that the two distributions are closer.
For the purposes of this analysis, we restrict ourselves to the top 10 largest
banks in US and estimate the cumulative significant total deviation of employee
reviews for the culture topics discussion in the pros and cons sections as com-
pared to Wells employee reviews for the three quarters prior to the crisis reveal.
We find that one bank has reviews with zero cumulative deviation of culture dis-
33
cussions (p < 0.05), that is, closest to Wells’ reviews.16 To the extent that the Report
helps us identify these culture measures as causing the crisis, this similarity of cul-
tures should be worrying for both managers and regulators. While the above text
analysis is useful to uncover troubling cultural similarity between pre-crisis Wells
Fargo and current banks, it cannot reveal the causes of this problematic culture at
this bank any more than it did at Wells. This is outside of the scope of our paper.
1.6.3 Generalizability to Other Consumer Facing Crises
As discussed previously, the Wells crisis Report provides the ground truth, a yard-
stick to verify whether employee reviews can indeed capture potentially causal
culture dysfunctions. We now turn to employee reviews from the three other cri-
sis firms. Since we are more confident about the causal effects of dysfunctional
culture on corporate crises, we test whether these results generalize to other con-
sumer facing crises. We measure culture differences between crises and non-crises
companies in three other consumer facing crises that were unveiled during 2014-
16 – General Motors, Chipotle Mexican Grill, and Mylan.17 We briefly describe
these crises.
In the General Motors crisis in 2014 (GM from hereon), GM had failed to iden-
tify and report faulty ignition switches in 2.6 Million of its cars manufactured
16For liability reasons we do not disclose the name of this bank
17We found these crises by searching for keywords: “consumers”, ”firm”, and ”crisis” on Wall
Street Journal and New York Times for all articles for the time period: Jan 2014 – Dec 2016.
34
during 2000-14. GM self-initiated a recall of these cars on Feb 7, 2014, which was
followed by a regulator-initiated investigation on Feb 26, 2014.18 In the Chipo-
tle Mexican Grill crisis of E-coli infection in 2015 (Chipotle from hereon), the first
outbreak happened in Aug, 2015, with more than 55 people infected by Jan, 2016.
As a result, the FDA launched a criminal investigation against Chipotle on Jan
6, 2016.19 In the Mylan crisis in 2016, Mylan had increased the price for its drug
that stops life threatening allergy reactions – Epipen in May 2016 for the 17th time
since 2007 –up 568% to $608 per dose from its price of $94 per dose in 2007. My-
lan CEO had to appear in front of the US House Committee on Oversight and
Government Reform, to explain its reasons for the price increases.20 These crises
are very different from Wells crisis; they share the common feature of substantial
consumer harm that prompted regulatory investigation.
We randomly sample 20 firms with the same SIC industry code from COM-
PUSTAT to find comparable firms for these three crisis-facing firms. The firms are
listed in the Appendix 1.9.5. There are 39,651 reviews across these firms for the
time period of Jan, 2008 - April, 2018. Thus, using the culture word embeddings
in the pros and cons sections, we calculate the similarity of the reviews to the cul-
ture topics. We estimate probit regressions to test if the culture discussions are
significant predictors of these crises. More formally, the regression is specified as
18See, CNN Business, Feb 13, 2015, “51 deaths linked to GM ignition switch flaw”; NHTSA May
2014 Recall Notice to GM
19See, USA Today, Oct 31, 2015,“Chipotles close in Ore., Wash., after 22 sick from E. coli”; NBC
News, Jan 6, 2016, “Chipotle Says It Faces Criminal Investigation in Food Illness Case”
20See, WSJ, Aug 24, 2016, “Mylan Faces Scrutiny Over EpiPen Price Increases”; FHCOGR, Sep
21, 2016, “Full Committee Hearing: “Reviewing the Rising Price of EpiPens”
35
follows for the each of the ten quarters leading up to the crises unveil:
∑∑
y∗r,t, f = α + βc,sDr,c,s + γrLr + ρX f + Ind.F.E. + r,t, f (1.8)
c s
Where yr,t, f is the indicator that the review r at a quarter t is for a crisis firm.
Dr,c,s is the measured discussion of one of the culture topics c and sentiment s, Rr
is the log length of review, and X f re firm level characteristics of log of assets and
profitability in the previous quarter. We include industry fixed effects, and cluster
the standard errors at the firm level. For ease of comparison of coefficients, we
once again convert them to semi-elasticities. Table 1.7 shows the semi-elasticities.
Similar to the findings from Wells, we find that employees review negative the
compete culture at these crisis firms; these effects are visible three quarters ahead
of the crisis. Like in Wells, employee reviews write positively about the control
culture at crisis firms 9- ten quarters ahead of the crisis. Unlike Wells these effects
are no longer significant after that. Furthermore, employees express negatively
about compete culture. These views are visible starting from six quarters ahead of
crises, and are consistently negative and significant starting from three quarters
ahead of crises. Combining the analysis from Wells and these three crises, the
common element is employees negative discussion of compete culture, and that
employee reviews of culture can be leading indicators of upcoming crises.
36
37
Table 1.7: Generalizability to Other Crises
This table lists elasticities from the probit regressions estimated for the crises of Mylan, Chipotle, and GM. Controls include log (lagged assets), lagged profitability, log
(length of review), and industry fixed effects. Robust standard errors (in parenthesis) are clustered at firm level.
Quarter Prior to Crisis
Variables 10 9 8 7 6 5 4 3 2 1
Create Positive 1.066 3.760*** 0.928 -0.892 1.115 2.004 4.444*** 0.076 1.966* 2.137***
(1.634) (0.869) (0.797) (1.734) (1.187) (1.377) (1.137) (1.823) (1.187) (0.747)
Negative 3.156*** 5.536*** 1.335* 2.654** 2.391 0.141 0.403 4.643*** 3.062** 2.376**
(1.090) (1.230) (0.796) (1.099) (2.216) (1.325) (2.185) (1.494) (1.226) (1.021)
Net -2.090 -1.777 -0.406* -3.546 -1.275 1.864 4.042 -4.567* -1.096 -0.238
Collaborate Positive -0.257 -2.808** -3.172*** -0.742 -0.154 -1.080 -2.348 -0.004 1.670* 2.177**
(1.339) (1.357) (1.201) (1.111) (1.046) (1.022) (1.559) (1.137) (0.887) (1.059)
Negative -2.198 -3.364 -2.041 -0.097 -4.297** -0.750 -4.532** -0.053 -1.747 0.614
(1.974) (2.229) (1.799) (0.696) (1.684) (1.425) (2.114) (1.976) (1.174) (0.771)
Net -2.455 0.556 -1.131 -0.645 4.143** -0.331 2.184 0.050 3.416** 1.563
Control Positive 6.052*** 2.015** 0.222 -4.060*** -2.402*** -1.332 0.153 -0.218 -0.669 -1.128**
(1.747) (0.925) (2.568) (1.021) (0.660) (1.050) (0.892) (0.875) (0.905) (0.478)
Negative -2.543 -1.991 0.093 -2.872** -3.133 1.215 1.248 -0.213 -3.376*** -2.631***
(2.254) (1.411) (1.035) (1.258) (2.080) (1.455) (2.558) (1.525) (0.607) (0.653)
Net 8.595*** 4.006*** 0.129 -1.188 0.732 -2.547 -1.010 -0.005 2.707*** 1.503
Compete Positive -7.183*** 1.520 0.310 3.202 -1.017 -4.244** -3.454* -2.314 -5.453** -6.393***
(2.026) (2.031) (1.913) (2.272) (1.607) (2.083) (1.928) (1.646) (2.331) (2.112)
Negative -0.820 -1.354 2.073** 3.797** 2.755*** 2.316** 1.338 1.753 2.269** 1.743*
(1.396) (1.663) (0.962) (1.638) (0.835) (0.915) (2.522) (1.224) (1.089) (1.030)
Net -6.363*** 2.874 -1.763 -0.595 -3.771** -6.560*** -4.793 -4.066** -7.722*** -8.136***
Observations 524 463 751 693 925 1,074 1,227 1,061 1,451 1,959
Pseudo R2 0.076 0.094 0.086 0.098 0.056 0.085 0.081 0.095 0.071 0.053
Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05.
We are also interested in whether overall culture is predictive of these crises,
we reformulate Eq 8 to include overall culture topic. More formally, we estimate
the following regression for the 10 quarters prior to the crises unveil:
∑∑
y∗r,t, f = α + βc,sOCr,s + γrLr + ρX f + Ind.F.E. + r,t, f (1.9)
c s
where OCr,s is the measured overall culture in a review r with sentiment s. We
directly report the elasticities of overall culture for the purposes of brevity. We find
that the net sentiment elasticity associated with overall culture is negative starting
five quarters ahead of the crises. It is significant for quarters 5, 3, and 1 ahead of
crises (β=-6.282, p<0.01; β=-8.310, p<0.01;and β=-2.672, p<0.01 respectively), and
consistent in sign, i.e. negative, for quarters 4 and 2 ahead of crises.
Summarizing, compared to non-crisis firms in the same industry, employee
reviews from automotive, restaurants, and pharmaceuticals crises all show neg-
ative views of culture several quarters ahead of the crisis revelation. In all these
crises, employees express negative views on “compete” or competition-oriented
goals. In the Wells’ crisis, per the Report, and the consistent evidence in em-
ployee reviews, we know this culture (along with dysfunctional “control” or over-
sight functions) caused the crisis. While we do not have an externally validated
source of causes for the other three crises, we can cautiously claim that aggressive
competition-oriented goals might cause these crises- at the least, employees nega-
tive views of focus on competition-oriented goals appear to be leading indicators
38
of corporate crises. We also do not find evidence that financial variables appear to
cause these crises. This information can be useful for regulators as they can mea-
sure corporate culture using employee reviews and can also subsequently flag po-
tentially “at-risk” firms. Finally, investors and activist shareholders investors can
use these measures of culture to identify and fix firms with problematic cultures.
39
40
Table 1.8: Generalizability to Other Crises – Elasticities for Overall Culture
This table lists the elasticities for overall culture topic (and financial) variables for the crises of Mylan, Chipotle, and GM. Controls include log of the length of review
and industry fixed effects. Robust standard errors (in parenthesis) are clustered at firm level.
Quarter Prior to Crisis
Variables 10 9 8 7 6 5 4 3 2 1
Overall Culture Positive 0.076 3.626*** -1.598** -2.250* -1.000 -2.988*** 0.095 -1.650** -0.718 -0.737
(0.967) (0.639) (0.813) (1.264) (0.776) (0.896) (0.441) (0.685) (0.796) (0.558)
Negative -1.898 1.467 2.157 2.830 -2.292** 3.295** -1.169 6.660*** 0.724 1.936**
(2.149) (1.170) (1.524) (2.309) (0.989) (1.516) (0.820) (1.724) (1.513) (0.862)
Net 1.974 2.160 -3.755** -5.080* 1.292 -6.282*** -1.074 -8.310*** -1.442 -2.6723***
Firm Financials Assets -0.107 -0.087 -0.061 -0.147 -0.083 -0.037 -0.149 -0.200 -0.053 -0.027
(0.122) (0.140) (0.122) (0.126) (0.126) (0.131) (0.151) (0.177) (0.148) (0.137)
Profitability 1.701 0.881 0.559 2.764 0.179 -0.027 4.012 5.909 0.983 0.092
(1.522) (1.469) (0.586) (2.183) (0.358) (0.480) (3.301) (3.721) (0.942) (0.806)
Observations 524 463 751 693 925 1,074 1,227 1,061 1,451 1,959
Pseudo R2 0.046 0.069 0.078 0.082 0.042 0.075 0.062 0.089 0.049 0.031
Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05.
1.7 Conclusion
In this paper we empirically measure the corporate culture at Wells before the cri-
sis became publicly known, and examine whether other banks might currently be
at-risk based on their corporate culture. We also show the generalizability of our
findings to three other consumer facing crises. Unlike traditional corporate cul-
ture research which uses survey-based methods and ask participants specifically
for information on culture, we use publicly available anonymous employee re-
views on a leading job reviews website in the US to extract discussions of culture.
The presence of the post-mortem Report serves the key role of corroborating and
verifying whether employee reviews at Wells, written before the crisis occurred,
capture the causes of crisis. We find employee reviews of corporate reviews are
leading indicators of corporate crisis, with employees reviews overall more nega-
tive than at competitor firms, and especially with a negative view of the compete
or profit-and-sales goal culture.
Our work makes substantive contributions, as well as managerial and regula-
tory contributions. Employee reviews provide a free source of information that
managers can use to assess corporate culture, and benchmark with competitors.
For regulators, employee reviews can provide valuable information in the internal
workings and corporate culture at firms, which can be pivotal in flagging at-risk
firms and subsequently prevent consumer harm.
There are several limitations of our work. First, we have studied consumer
41
harm crises. We have not considered crises such as accounting frauds, sexual ha-
rassment, or significant environmental violations. Therefore, we are unable to
generalize what types of dysfunctional cultures might cause these other crises.
Second, we are not able to causally separate dissatisfied employees and problem-
atic cultures, and we cannot identify other potential causes of problematic cultures
either since we do not have data from when Wells and the other three crisis firms
had non-problematic cultures.
There are several avenues for further research. Measures of culture imple-
mented here can be used to study firm responses a variety of natural experiments
like changes in laws (e.g. requirements on diversity of the board) and other exter-
nal shocks (e.g. the impact of elections on employee satisfaction). With more
research on measures of corporate culture and under different conditions and
changes, we hope will emerge more theory-based and empirically generalizable
understanding of corporate culture (see Puranam et al. 2018).
1.8 References
Aggarwal, Rajesh K., and Andrew A. Samwick (1999). ”Executive compen-
sation, strategic competition, and relative performance evaluation: Theory and
evidence.” The Journal of Finance 54.6: 1999-2043.
Anginer, Deniz, Asli Demirguc-Kunt, Harry Huizinga, and Kebin Ma (2018).
42
”Corporate governance of banks and financial stability.” Journal of Financial Eco-
nomics 130.2: 327-346.
Arnold, Chris (2016). ”Former Wells Fargo employees describe toxic sales cul-
ture, even at HQ.” NPR. October 4.
Bhandari, Avishek, Babak Mammadov, Maya Thevenot, and S. Hamidreza
Vakilzadeh. (2017). ”The Invisible Hand: Corporate Culture and Its Implications
for Earnings Management.”
Borah, Abhishek, and Gerard J. Tellis (2016). ”Halo (spillover) effects in social
media: do product recalls of one brand hurt or help rival brands?.” Journal of
Marketing Research 53.2: 143-160.
Burks, Stephen V., and Erin L. Krupka (2012). ”A multimethod approach to
identifying norms and normative expectations within a corporate hierarchy: Evi-
dence from the financial services industry.” Management Science 58.1: 203-217.
Cameron, Kim S., Robert E. Quinn, Jeff DeGraff, and Anjan V. Thakor. Com-
peting values leadership. Edward Elgar Publishing, 2014.
Chang, Sea-Jin, Ji Yeol Jimmy Oh, and Kwangwoo Park (2017). ”The power of
silent voices: Employee satisfaction and acquirer stock performance.”
Chen, Ming-Jer (2001). Inside Chinese Business: A Guide for Managers World-
wide. Harvard Business Press.
43
Chen, Yubo, Shankar Ganesan, and Yong Liu (2009). ”Does a firm’s product-
recall strategy affect its financial value? An examination of strategic alternatives
during product-harm crises.” Journal of Marketing 73.6: 214-226.
Corritore, Matthew, Amir Goldberg, and Sameer B. Srivastava (2019). ”Duality
in diversity: How intrapersonal and interpersonal cultural heterogeneity relate to
firm performance.” Administrative Science Quarterly
Crémer, Jacques (1993). ”Corporate culture and shared knowledge.” Industrial
and Corporate Change 2.3: 351-386.
Deshpandé, Rohit, and John U. Farley (2004). ”Organizational culture, mar-
ket orientation, innovativeness, and firm performance: an international research
odyssey.” International Journal of Research in Marketing 21.1: 3-22.
Fiordelisi, Franco, and Ornella Ricci (2014). ”Corporate culture and CEO
turnover.” Journal of Corporate Finance 28: 66-82.
Graham, John R., Campbell R. Harvey, Jillian Popadak, and Shivaram Rajgopal
(2017). Corporate culture: Evidence from the field. No. w23255. National Bureau
of Economic Research.
Green, T. Clifton, Ruoyan Huang, Quan Wen, and Dexin Zhou (2019). ”Crowd-
sourced employer reviews and stock returns.” Journal of Financial Economics
134.1: 236-251.
Guiso, Luigi, Paola Sapienza, and Luigi Zingales (2015a). ”Corporate culture,
44
societal culture, and institutions.” American Economic Review 105.5: 336-39.
Guiso, Luigi, Paola Sapienza, and Luigi Zingales (2015b). ”The value of corpo-
rate culture.” Journal of Financial Economics 117.1: 60-76.
Gupta, Sachin, Pradeep Chintagunta, Anil Kaul, and Dick R. Wittink (1996).
”Do household scanner data provide representative inferences from brand
choices: A comparison with store data.” Journal of Marketing Research 33.4: 383-
398.
Hartnell, Chad A., Amy Yi Ou, and Angelo Kinicki (2011). ”Organizational
culture and organizational effectiveness: a meta-analytic investigation of the com-
peting values framework’s theoretical suppositions.” Journal of Applied Psychol-
ogy 96.4: 677.
Huang, Minjie, Pingshu Li, Felix Meschke, and James P. Guthrie (2015). ”Fam-
ily firms, employee satisfaction, and corporate performance.” Journal of Corpo-
rate Finance 34: 108-127.
Ji, Yuan, Oded Rozenbaum, and Kyle T. Welch (2017). ”Corporate culture
and financial reporting risk: Looking through the glassdoor.” Available at SSRN
2945745.
Lavine, Marc (2014). ”Paradoxical leadership and the competing values frame-
work.” The Journal of Applied Behavioral Science 50.2: 189-205.
Lee, Thomas Y., and Eric T. Bradlow (2011). ”Automated marketing research
45
using online customer reviews.” Journal of Marketing Research 48.5: 881-894. Liu,
Angela Xia, Yong Liu, and Ting Luo (2016). ”What drives a firm’s choice of prod-
uct recall remedy? The impact of remedy cost, product hazard, and the CEO.”
Journal of Marketing 80.3: 79-95.
Lukas, Bryan A., Gregory J. Whitwell, and Jan B. Heide (2013). ”Why do cus-
tomers get more than they need? How organizational culture shapes product ca-
pability decisions.” Journal of Marketing 77.1: 1-12.
Marinescu, Ioana Elena, and Eric A. Posner (2019). ”Why Has Antitrust Law
Failed Workers?.” Available at SSRN 3335174.
McKenney, James L., and Peter GW Keen (1974). ”How managers’ minds
work.” Harvard Business Review 52.3 : 79-90.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean
(2013). ”Distributed representations of words and phrases and their composition-
ality.” In Advances in neural information processing systems, pp. 3111-3119.
Mitzroff, I. I., and R. H. Killman (1978). ”Methodological Approaches to Social
Science: Integrating divergent concepts and theories.”
Moniz, Andy (2017). ”Inferring employees’ social media perceptions of corpo-
rate culture and the link to firm value.” Available at SSRN 2768091.
Moniz, Andy, and Franciska de Jong (2014). ”Sentiment analysis and the im-
pact of employee satisfaction on firm earnings.” European Conference on Infor-
46
mation Retrieval. Springer, Cham.
Myers, Isabel Briggs (1962). ”The Myers-Briggs Type Indicator: Manual
(1962).”
Ochs, Susan M (2016). ”The leadership blind spots at Wells Fargo.” Harvard
Business Review 10.
O’Reilly, Charles (1989). ”Corporations, culture, and commitment: Motivation
and social control in organizations.” California Management Review 31.4: 9-25.
O’Reilly III, Charles A., Jennifer Chatman, and David F. Caldwell (1991).
”People and organizational culture: A profile comparison approach to assessing
person-organization fit.” Academy of Management Journal 34.3: 487-516.
Panayotopoulou, Leda, Dimitris Bourantas, and Nancy Papalexandris (2003).
”Strategic human resource management and its effects on firm performance: an
implementation of the competing values framework.” International Journal of
Human Resource Management 14.4: 680-699.
Pearson, Christine M., and Judith A. Clair (1998). ”Reframing crisis manage-
ment.” Academy of Management Review 23.1: 59-76.
Puranam, Dinesh, Vishal Narayan, and Vrinda Kadiyali (2017). ”The effect of
calorie posting regulation on consumer opinion: A flexible latent Dirichlet alloca-
tion model with informative priors.” Marketing Science 36.5: 726-746.
47
Puranam, Phanish, Yash Raj Shrestha, Vivianna Fang He, and Georg von
Krogh (2018). ”Algorithmic induction through machine learning: Using predic-
tion to theorize”, working paper, SSRN.
Tellis, Gerard J., Jaideep C. Prabhu, and Rajesh K. Chandy (2009). ”Radical in-
novation across nations: The preeminence of corporate culture.” Journal of Mar-
keting 73.1: 3-23.
Timoshenko, Artem, and John R. Hauser (2019). ”Identifying customer needs
from user-generated content.” Marketing Science 38.1: 1-20.
Van Heerde, Harald, Kristiaan Helsen, and Marnik G. Dekimpe (2007). ”The
impact of a product-harm crisis on marketing effectiveness.” Marketing Science
26.2: 230-245.
Weber, Roberto A., and Colin F. Camerer (2003). ”Cultural conflict and merger
failure: An experimental approach.” Management Science 49.4: 400-415.
Wei, Jiuchang, Zhe Ouyang, and Haipeng Chen (2017). ”Well known or well
liked? The effects of corporate reputation on firm value at the onset of a corporate
crisis.” Strategic Management Journal 38.10: 2103-2120.
Williamson, Oliver E (1981). ”The economics of organization: The transaction
cost approach.” American Journal of Sociology 87.3: 548-577.
Zhao, Yi, Ying Zhao, and Kristiaan Helsen (2011). ”Consumer learning in a tur-
bulent market environment: Modeling consumer choice dynamics after a product-
48
harm crisis.” Journal of Marketing Research 48.2: 255-267.
Zhong, Ning, and David A. Schweidel (2020). ”Capturing changes in social
media content: a multiple latent changepoint topic model.” Marketing Science
49
1.9 Appendix
1.9.1 Wells List of Firms
The table below lists the set of large national commercial banks in US based on
consolidated assets as of March 31, 2018.
Bank Name Consolidated Assets Domestic Domestic
(Bil $) Assets (Bil $) Branches
JP Morgan Chase 2,198 1,676 5,115
Bank of America 1,765 1,661 4,484
Wells Fargo 1,716 1,662 5,913
Citibank 1,406 821 706
US Bank 452 442 3,138
PNC Bank 368 364 2,515
Capital One 289 289 620
KeyBank 135 135 1,217
Citizens Bank 122 122 797
Huntington National Bank 104 104 1,018
Zions Bank 66 66 435
Peoples United Bank 43 43 403
First National Bank of Tennessee 40 40 347
BOK Bank 33 33 124
First National Bank of Pennsylvania 31 31 416
Associated Bank 31 31 225
Sterling National Bank 30 30 126
Valley National Bank 29 29 239
Webster Bank 26 26 166
TCF National Bank 23 23 331
MB Bank 20 20 106
First National Bank of Omaha 19 19 127
OLD National Bank 17 17 197
Washington Federal Bank 15 15 237
Trustmark National Bank 13 13 202
Fulton Bank 11 11 110
Community Bank 10 10 224
Centerstate Bank 10 10 129
NBT Bank 9.16 9.16 152
Park National Bank 7.46 7.46 108
Woodforest National Bank 5.7 5.7 729
DBA First Convenience Bank 1.89 1.89 307
Source: Federal Reserve Statistical Release - Large Commercial Banks, release date: March 2018
50
1.9.2 Seed Words
We derive the seed words from the descriptions of the four culture types in the
competing values framework as words that qualitatively describe the culture type
of interest.
Create Collaborate Control Compete
entrepreneurial family control market
dynamic mentor controlling competition
innovative mentors coordinate competitive
agile parent coordination profits
initiative loyalty consistent profit
freedom loyal efficient sales
flexibility tradition schedule results
individuality commitment structure driven
entrepreneur cohesion efficiency tough
dynamism cohesive stability demanding
innovation morale stable succeed
creative teamwork smooth external
experimentation team predictability strong
adaptation consensus predictable hard
fad participation centralized success
fast-failure participate formalized tasks
risk-taking concern structured decisive
leading-edge sensitivity procedures productivity
bold sensitive rules deadlines
future friendly policies performance
sharing micromanagement targets
facilitator paperwork aggressive
everyone red-tape difficult
employees roadblocks perform
people directive business
community layoffs
goals
goal
customer
customers
pressure
numbers
products
product
money
overtime
We note that the number of seed words for the four culture topics vary - 20
51
seed words for create culture, 26 seed words for collaborate culture, 25 seed words
for control culture, and 36 seed words for compete culture. We argue that these
seed words are drawn directly from the culture descriptions from Cameron et al.
(2014), and it should not be a concern in our context given that these seed words
are not frequently used in the reviews data.
An alternate to this seed word list is the list used by Bhandari et al. (2017).
They base their seed words following the approach of Fiordelisi and Ricci (2014),
who measure culture discussions from firms’ 10-K filings using a bag of words
approach. Comparing our seed words list to that of Bhandari et al. (2017), we
find the following overlap: 11 out of 20 of our seed words for create culture, 6
out of 26 of our seed words for collaborate culture, 5 out of 25 of seed words for
control culture, and 20 out of 36 of our seed for compete culture overlap with
their seed words list. We argue that Bhandari et al. (2017) context is of formal
financial disclosures (different from employee reviews on a jobs website), and thus
the difference in seeds words is not a potential concern.
52
1.9.3 Data Pre-Processing
We now describe the pre-processing steps we conduct on the reviews corpus to
training the Word2Vec models. First, we convert all text to lower case so that we
treat words such as “Company” and “company” in the same manner. Second,
since we are interested in the nature and meaning of discussions in text, that is,
how are words related to each other in their meaning, we remove commonly oc-
curring stop words, such as “is”, “the”, “of” that do not contribute to the meaning
in the discussions.21 Finally, we exclude reviews that do not have any textual data.
1.9.4 Word2Vec Hierarchical Softmax
For the hierarchical softmax, the vocabulary is defined as a binary Huffman tree,
and a word can be reached as a random walk in the tree, with frequent words
being closer to the root of the tree. Suppose we are interested in reaching a word
w in the tree. Let L(w) be the length of the path to reach the word w from the
root, n(w, j) be the jth node on this path, and ch(n) be any arbitrary child of node
n. Thus, j = 1 is the first node on the path, and by definition,n(w, 1) is the root
of the tree. Similarly, at the end of the path, j = L(w) is the word w itself, thus
n(w, L(w)) = w. Thus the probability of observing a context word w given an input
word wi is given by:
21Stop words source: CoreNLP, Stanford (2016).
53
L∏(w)−1
| ′p(w wi) = σ([[n(w, j + 1) = ch(n(w, j))]]vn(w, j)>vwi) (1.10)
j=1
Where [[n(w, j + 1) = ch(n(w, j))]] is the delta function which takes value 1 if
that particular child node of n(w, j) is the node is n(w, j + 1), and takes value 0
otherwise. σ(x) is the sigmoid function that is given by:
1
σ(x) = − (1.11)1 + exp( x)
The hierarchical softmax alleviates the computational burden of the softmax
by reducing the computational cost to log2V , and words have one representation,
instead of two representations with the softmax specification.
1.9.5 List of Other Consumer Facing Crises Firms
The table below lists the randomly selected firms for the crises of General Motors,
Chipotle, and Mylan. There is difference in the number of firms, as they are lim-
ited by the number of firms in the same industry: based on the same COMPUSTAT
SIC industry code.
54
General Motors Chipotle Mexican Grill Mylan Inc
Daimler AG Ark Restaurants Corp. Abbott Laboratories
Federal Signal Corp. Bj’s Restaurants Inc. Astrazeneca PLC
Ford Motor Co. Buffalo Wild Wings Inc. Avanir Pharmaceuticals Inc.
General Motors Co. Chipotle Mexican Grill Inc. Baxter International Inc.
Honda Motor Co. Ltd. Dennys Corp. Depomed Inc.
Lci Industries Domino’s Pizza Inc. Dr Reddy’s Laboratories Ltd.
Navistar International Corp. Frisch’s Restaurants Inc. Endo International PLC
Nissan Motor Co Ltd. Jack In The Box Inc. Glaxosmithkline PLC
Oshkosh Corp. Jamba Inc. Hospira Inc.
Paccar Inc. Kona Grill Inc. Ionis Pharmaceuticals Inc.
Spartan Motors Inc. Mcdonald’s Corp. Jazz Pharmaceuticals PLC
Subaru Corp. O’Charley’s Inc. Lifevantage Corp.
Tata Motors Ltd. Panera Bread Co. Mannatech Inc.
Tesla Inc. Red Robin Gourmet Burgers Map Pharmaceuticals Inc.
Tower International Inc. Ruby Tuesday Inc. Mylan NV
Toyota Motor Corp. Ruths Hospitality Group Inc. Novo Nordisk A/S
Volvo AB Sodexo Nu Skin Enterprises
Sonic Corp. Par Pharmaceuticals Hldgs.
Texas Roadhouse Inc. Regeneron Pharmaceuticals
Wendy’s Co. Taro Pharmaceuticl Inds. Ltd.
Teva Pharmaceuticals
55
CHAPTER 2
SMOKE AND MIRRORS: IMPACT OF E-CIGARETTE TAXES ON
UNDERAGE SOCIAL MEDIA POSTING
2.1 Introduction
E-cigarettes are battery operated devices that vaporize liquids to deliver nicotine
infused aerosols, and their usage is referred to as vaping.1 Vaping has increased
dramatically in the last few years,2 and the US Surgeon General report on e-
cigarettes raises concerns that vaping among youth and young adults has reached
alarming levels.3 This is worrisome to regulators as these products contain addic-
tive nicotine, posing severe health consequences, and is reversing decades-long
trend of declining underage smoking. Reliable data on underage consumption is
not available since their consumption is illegal due to minimum age sales laws.
Other data sources such as large-scale national surveys are expensive and time
consuming to conduct, and researchers have traditionally relied on smaller-scale
survey Barrington-Trimis et al. (2016); Soneji et al. (2017). We propose studying
images of vaping posted on a leading social media site. This posting behavior
is a rough proxy for influencing and normalizing behavior, and possibly for con-
sumption behavior among underage population.
This chapter is: Anand, Piyush, and Vrinda Kadiyali (2020). “Smoke and Mirrors: Impact of
E-Cigarette Taxes on Underage Social Media Posting”.
1See for e.g. National Institute on Drug Abuse, 2020, Vaping Devices (Electronic Cigarettes)
2Barshad, A., April 7, 2018. The juul is too cool. New York Times
3U.S. Department of Health and Human Services, 2016, E-cigarette Use Among Youth and
Young Adults - A Report of the Surgeon General
56
We are especially interested in examining the effect of tax policies on posting
behavior. We investigate state-wide regulations of California, Kansas, Pennsyl-
vania, and West Virginia that impose taxes on e-cigarette products, and estimate
their impact on vaping posting behavior from publicly available user-posted im-
ages on a leading social media website in the US. We detect posting user’s age,
gender, race, as well as disguising (emoji overlay) in images, an important con-
founding factor,4 using an ensemble of image analysis from methods in computer
vision – Mask R-CNN He et al. (2017) and Aggregated Residual Neural Networks
Xie et al. (2017). Managers and regulators are likely to be concerned with under-
age vaping posts as well as their disguising on social media since this deters their
efforts to denormalize youth vaping. Additionally, tax impact on posting by gen-
der and race will be of importance to regulators and managers given the‘ concerns
of unequal health outcomes for minority groups.
Our dataset consists of 388,593 scraped social media images that span the du-
ration of Jan 2016 – Dec 2018. From these images we extract social media postings
posters’ age, gender, race, and disguising (emoji overlay). To detect these de-
mographics, we use Aggregated Residual Neural Networks Xie et al. (2017) on
UTKFace (Zhang and Qi, 2017) and FairFace (Kärkkäinen and Joo, 2019) datasets,
and achieve a test-set accuracy of 93.2% for underage detection, 97.8% for gender
detection, and 86.37% average precision for race detection. To detect disguising,
we use Mask R-CNN He et al. (2017) on an RA-annotated dataset of 5,040 images,
4See for e.g. Daily Mail, Revealed: How teenagers use secret emoji code to deal Class A drugs
on Snapchat and Instagram as gangs target ’digital savvy’ school pupils, July, 2017; Drug Addic-
tion Now, Emojis Give Youth a New Way to Communicate About Substance Abuse, Oct 2018
57
and achieved test-set accuracy of 95.15%. We estimate the causal effect of tax
policies on posting behavior using the generalized synthetic controls method Xu
(2017).5 We find that the states with higher taxes - Pennsylvania and California
were effective in deterring underage vaping posts on social media. California’s
decline in underage posting is preceded by increased disguised posts, and Penn-
sylvania’s decline in underage posting is accompanied by increased engagement.
Kansas and West Virginia, with lower taxes, did not see lower underage vaping
image posting behavior. Kansas had an increase in posts with race: Black. Our
conjecture for this unusual result is a possible confound – in 2015, a year prior
to vaping tax in 2016, Kansas increased its cigarette tax significantly, by 63%. The
cigarette industry has historically advertised heavily to Blacks, resulting in greater
rates of consumption. We would need consumption data for both e- and regular
cigarettes to verify our conjecture. Regardless, from a regulatory point of view,
increased posting by Blacks post-tax is likely worrisome.
We advance image literature - we estimate disguising behavior from images
and construct a labeled dataset for disguising. Our work builds on a growing
literature in marketing that uses images to address questions of interest to mar-
keters Burnap et al. (2019); Dew et al. (2019); Liu et al. (2020); Zhang et al. (2017).
Managers might find useful our methods of detecting and flagging inappropriate
usage of their products, since heightened regulatory oversight poses significant
risk to business viability. Regulators can monitor whether the taxes are effective
in denormalizing youth vaping on social media.
5We find similar results with difference-in-difference estimators - see Appendix 2.9.3.
58
The rest of this chapter is organized as follows. In Section 2.2, we discuss
the e-cigarette industry and state tax policies. In Section 2.3, we discuss relevant
literature. In Section 2.4, we discuss the data. In Section 2.5, we discuss the image
analysis and causal inference methods. In Section 2.6 presents the results, and we
conclude in Section 2.7.
2.2 Electronic Cigarettes and State-Wide Taxes in the US
Electronic cigarettes were a $3.6 Billion industry in 2018 in the US,6 with JUUL,
that was launched in 2015, accounting for $1 Billion in sales in 2018.7 Regulators
are concerned with its rising popularity as vaping among teenagers and youth has
reached alarming levels, and research has yet to reach a consensus on whether va-
por products are as harmful as (or more harmful) cigarettes. From the perspective
of the managers and investors of firms in this industry, this heightened regulatory
concern poses significant risk to their businesses’ viability as regulators have re-
stricted their product sales, and are debating the possibility of banning e-cigarettes
entirely.
Monitoring taxes imposed to restrict usage of e-cigarettes would be of criti-
cal interest to both regulators and managers. Many states have passed legislation
that treat vaping as equivalent to smoking and imposes additional restrictions.
6Statista, Electronic cigarettes (e-cigarettes) dollar sales in the United States from 2014 to 2018
(in billion U.S. dollars)
7Reuters, Altria says Juul sales skyrocket to $1 billion in 2018, Jan 31, 2019
59
Of the key states in US, states of California, Kansas, Pennsylvania, and West Vir-
ginia passed taxes on e-cigarettes.8 California amended the definition of the term
“smoking” to include vaporization-based devices and increased taxes by 27.3% of
wholesale cost in April 2017.9 Kansas enforced a tax of $0.20/milliliter of e-liquid
for e-cigarettes starting July 2016, with somewhat mixed design and implemen-
tation.10 Pennsylvania enforced a tax of 40% of wholesale price starting August
2016.11 West Virginia added a $.075/milliliter of e-liquid for e-cigarettes starting
July 2016.12 A typical pack of four Juul pods contains 2.8 milliliter of e-liquid13 and
thus became subject to $0.56 tax in Kansas and $0.21 in West Virginia. While we
do not have wholesale prices for Juul pods, assuming a conservative 75% margin
for retailers leads to a wholesale price of $4 per pack that retails at $16. This cor-
responds to conservative estimates of per pack tax of $1.09 increase in California
and $1.60 tax in Pennsylvania. Thus, we note that Pennsylvania and California
introduced much higher taxes than the states of Kansas and West Virginia.
We investigate the impact of these tax policies on vaping posting behavior
on social media. Specifically, we estimate the prevalence of underage vaping,
8Note that Minnesota enacted taxes on e-cigarettes in 2013, and in 2015 the states of Delaware,
Louisiana, and North Carolina also enacted taxes. We are unable to analyze these policies given
the limitation of time span since most of our data are on / after Q2, 2015. We exclude these four
states in all our analysis.
9CNBC: Feds give big tobacco new headache as California taxes proving hazardous to cigarette
sales, July 28, 2017
10Kansas Health Institute (2016), E-Cigarette Policy, Regulation and Marketing (February 2016).
Kansas later revisited its taxes and reduced the tax to 5 cents / ml, offering credit to retailers who
might have already paid these taxes and delayed effective date to July 1, 2017. See: E-Cigarette
Tax Fix Moves Forward In Kansas, KCUR, April 2017. For our analysis purposes, we consider July
1, 2016 as the tax enactment date.
11Pennsylvania Department of Revenue (2016): Other Tobacco Products Tax
12West Virginia State Tax Department: E-cigarette Liquids Excise Tax FAQ
13Juul: Discover More About JUULpods & Flavors
60
identify the demographics of individuals in posts, and estimate the extent of dis-
guising from social media images. We plan to extend our study to include the
recent stricter regulations enacted by the states of New York, Massachusetts, and
Michigan in September 2019, and also estimate the interaction of these regulations
with the COVID health crisis of 2020. From a managerial perspective, firms can
use near real-time images data to detect inappropriate usage of their vape prod-
ucts. Regulators will also be interested in the impact of regulation on social media
posting behavior as they monitor denormalization of vaping among youth. Deep
learning methods from computer vision literature are particularly suited to extract
these variables which traditional structured data lack, and large-scale surveys are
often expensive to conduct. This motivates the need to study e-cigarettes using
image analysis.
2.3 Literature Review
Three areas of research are related to our work. The first stream of literature rel-
evant to our work is the growing literature in marketing that uses images and
image analysis methods. These include developing new product designs and pre-
dicting customer aesthetic appeals Burnap et al. (2019), logo creation Dew et al.
(2019), presence of human faces in images and engagement on Twitter Li and Xie
(2020), brand features from social media images Liu et al. (2020), and image fea-
tures and rental demand on Airbnb Zhang et al. (2017). We differ from the above
61
papers as we combine image analysis methods with causal inference. Further-
more, our methodological contribution is detecting disguising from images.
The second stream of relevant literature is on health policies and regulations.14
Some relevant papers include the impact of calorie posting regulations on discus-
sions of health in online reviews Puranam et al. (2017), price elasticities for tobacco
products that incorporate addiction Gordon and Sun (2015), physician payment
disclosures and prescription behavior Guo et al. (2020), and recreational marijuana
legalization and online cannabis related search Wang et al. (2019). We differ from
the above papers as our substantial interest is in underage/ illegal consumption
of vaping products, and we use data from images posted in social media.
The third stream of literature is on e-cigarettes consumption and regulation,15
where research has studied optimal taxation policies Allcott and Rafkin (2020),
effects of e-cigarettes taxation policies on cigarette consumption Chen and Rao
(2020), minimum age sales laws and cigarette consumption Dave et al. (2019),
Minnesota e-cigarette tax Pesko and Warman (2017); Saffer et al. (2018), and e-
cigarette advertising Tuchman (2019). We differ from the above papers as follows:
we estimate underage posting and disguising from social media images, which
cannot be estimated using consumption data or textual data. Furthermore, con-
14There is also literature in machine learning that has studied health topics. For example, Twit-
ter posts and discussions of Juul Allem et al. (2018), detecting obesity from satellite images Ma-
harana and Nsoesie (2018), detecting binge drinking from social media ElTayeby et al. (2017), and
predicting age from facial images Rothe et al. (2015).
15There is also literature in public health that has studied e-cigarettes and its health effects. For
example, e-cigarettes as a cessation device for smoking Barbeau et al. (2013); Brown et al. (2014),
pulmonary toxicity of e-cigarettes Chun et al. (2017), youth usage e-cigarette usage and subsequent
smoking habits Barrington-Trimis et al. (2016); Miech et al. (2017); Soneji et al. (2017).
62
sumption data is non-existent given the illegality of underage consumption.
2.4 Data
As discussed earlier, we scraped publicly available images about vaping from
January 2015 – January 2019 from a leading social media website in the US in
which users post with images. We adopted the following procedure to determine
which posts are about vaping - we first scraped all posts with the hashtag “juul”,16
and from these posts, we identified the 10 most commonly occurring other hash-
tags to identify other vaping related posts.17 This is because of scraping limitations
– there were 37,841 unique hashtags that we found when we scraped hashtag juul.
These 10 hashtags occurred in 10.37% of the posts, whereas the average hashtag
occurred in 0.0026% of the posts. Furthermore, scraping these 10 hashtags was
costly and time consuming despite parallelizing on three online virtual servers.
Since we are interested in estimating effects of state-wide tax policies in the US,
we restrict the scraping to US based posts based on the location tagged in the
posts. The resulting sample has 785,431 US-based posts across these 11 hashtags.
Table 2.1 lists the hashtags and number of posts scraped.
Figure 2.1 shows the (log) number of posts by state and year. We find that in
2015, the number of posts were much higher in California relative to other states.
By 2018, we see that number of posts across states has increased, suggesting wider
16We start with the hashtag “juul” since Juul is the largest company in this industry.
17We plan to add more hashtags in future research.
63
Table 2.1: Number of Posts by Hashtag
This table lists the number of social media posts that were scraped for the duration of Jan
2015 – Jan 2019.
Hashtag Number of Posts
vape 540,009
vapenation 81,409
vapelife 64,935
vapeshop 33,297
vapelyfe 19,185
juul 18,282
vapecommunity 9,965
vapeporn 9,282
vapefam 7,767
juulvapor 693
juulpods 607
spread of posting practices across the US.
We exclude 2015 data and Jan 2019 data due to data sparsity, which leaves us
with a total of 750,819 posts.18 To remove posts that are potentially spam posts,
or posted by vape shops, we next scrape all the user’s profiles and calculate the
18We note that the number of posts in 2015 and Jan 2019 is 4.5% of the total data, and sparse
for many states. Furthermore, for 2019 we do not have data for entire Q1 2019 (due to scraping
constraints).
64
Figure 2.1: Number of Vaping Related Posts by State and Year in the US
This figure shows the log number of social media posts that were scraped by year and
state
65
66
Table 2.2: Information Observed for Each Post
This table lists the information that is scraped for each of the social media posts
Variable Description
Post ID Unique identifier for the post
User ID Unique identifier for the user who made the post
Timestamp Time when the post was posted by the user
Latitude Latitude of the post’s location
Longitude Longitude of the post’s location
Num Likes Number of likes as of the scraping date
Num Comments Number of comments as of the scraping date
Caption Text associated with the post
vaping related posts as a percentage of their total posts. We do not consider those
users’ posts that have more than 25% of vaping related posts.19 Thus, we are left
with 388,593 posts after excluding those whose users have posted less than 25%
of their posts that are vaping related. Table 2.2 shows the information observed
for each of these posts. We observe when the post was posted, the user’s total
number of posts, latitude and longitude of location, numerical count of likes and
comments for post, text caption, and an associated image. We estimate a post’s
67
state location based on its latitude and longitude.
2.5 Methodology
First, we discuss the intuition behind convolutional neural networks, which forms
the basis for the deep learning methods for image analysis. Second, we discuss the
methods to estimate demographics and detect disguising in images - Aggregated
Residual Neural Networks Xie et al. (2017) and Mask R-CNN He et al. (2017).
Finally, we discuss generalized synthetic controls Xu (2017) for causal inference.
2.5.1 Convolutional Neural Networks
We first describe the intuition behind convolutional neural networks (CNNs
henceforth) and the building blocks of a CNN - convolutional layers and pool-
ing layers.
A convolutional layer is characterized by several filters, or kernels, that are ap-
plied in a sliding manner (i.e. from one pixel to the next). The weights of these
kernels are learned during the training process. Figure 2.2 shows an illustration of
a convolutional layer on a greyscale image.20 The image has dimensions of 5 × 5,
19Figure 2.13 in the Appendix 2.9.1 shows number of users and posts at different cutoffs.
20This example is for illustrative purposes with a greyscale image (thus 1 channel). In practice,
images have three channels corresponding to red, blue, and green colors, and are of much larger
dimensions (a 1 Megapixel image is usually 1024 × 1024 dimension). The conceptual process re-
68
and each pixel is represented as a pixel value (unit normalized). The kernel is a
3×3 matrix with parameters of weights w1 : w9 and an overall offset, or bias b,
which are learned during training. The weights of the kernel (w1 : w9) and image
pixel values are multiplied element-wise (in a sum-product manner), and then
added with a kernel bias b to obtain output feature map values. In practice, con-
volutional layers have several kernels (ranging from 64 to 512), and the intuition
is that these kernels learn different local correlations. For example, if an image of
face is passed through a convolutional layer, these kernels can learn to detect local
correlations such as face boundaries, nose, ears, and other facial features.
A pooling layer is used to introduce down-sampling, i.e. it reduces the di-
mensionality of the input feature map by retaining a single statistic of the local
window from the input feature map, such as maximum value. Figure 2.2 shows
an illustration of max-pooling layer. There are several variants of pooling layers,
such as max-pooling (maximum value in the window), average pooling (average
value in the window), among others. The intuition for pooling layers is that these
statistics (maximum/average) are approximately sufficient for CNNs to perform
tasks such as image classification based.
CNNs are especially useful to analyze images, as kernels and pooling layers
capture spatial correlations in images, that is nearby pixels in an image are con-
nected (i.e. local connectivity, for example if a pixel in image is one of the many
pixels of a “dog”, then its adjacent pixels are also likely to be those of a “dog”).
mains the same regardless of image size and color.
69
Figure 2.2: Building Blocks for CNNs
Convolutional Layer
A 3 × 3 kernel with weights w1 : w9 and bias b is applied in a sliding manner on an image
with dimensions 5 × 5. The resulting feature map (left) is obtained by element by element
multiplication of image pixel value and kernel weight with a bias added. Notice that the
top left element in the feature map on the right corresponds to the kernel operation with
the image (solid orange box on the image on the left). Other elements in the feature map
are obtained by sliding the kernel over the image.
Max Pooling Layer
A 2 × 2 max pooling operation takes the maximum value in the window as the output for
the feature map, and is applied in a sliding manner over the input feature map. Notice
that the top left element in the output feature map (right) corresponds to the maximum
value in the 2 × 2 window (solid blue box on the input feature map on the left). Other
elements in the output feature map are obtained by sliding the window over the input
feature map.
70
Furthermore, CNNs can also handle translations of objects in images (i.e. transla-
tion invariance, for example if the “dog” is moved to a different location or scale,
i.e. different in size in the image, then the feature values for the “dog” also adjust
accordingly).
Typically, a CNN has a sequence of several convolutional layers, pooling lay-
ers, and non-linear activations such as ReLU to construct feature maps. The scope
(or coverage) of the feature maps changes in a CNN with layers – feature maps
obtained from the shallower layers, i.e. first few layers capture more local corre-
lations than feature maps obtained from the deeper layers, i.e. layers towards the
end of a CNN capture more global correlations. This is because, for instance, a 3×3
filter in the first layer has a coverage of 3 × 3 pixels in the original image, whereas
a 3×3 filter in the second layer has a coverage of 5×5 pixels in the original image.
Finally, these feature maps can then be used in classifiers such as fully-connected
layers or support vector machines for tasks such as object detection and image
classification.
2.5.2 Detecting Demographics in Images
We now describe the approach to estimate age, gender, and race from faces
in images. We train three models to detect these demographics as classification
tasks:
71
1. Age: Underage, i.e. less than 21 years old, or not
2. Gender: Female or male
3. Race: Asian, Black, White, and Others
We use a class of convolutional neural networks: Aggregated Residual Neural
Nets (Xie et al., 2017, ResNeXts henceforth) to classify demographics in images.
Residual Neural Nets (He et al., 2016, ResNets henceforth), tackle the degradation
problem associated with deep learning models – as the number of layers in a neu-
ral network increases, optimizing the non-linear layers becomes harder. One main
reason for this issue is vanishing gradients - with more layers the gradients of the
loss function can go towards zero for the initial layers, and the initial layers are
unable to update their parameters. ResNets overcome this limitation by convert-
ing the learning objective of the neural network layers from learning functional
mappings to optimize residuals by adding identity connections, i.e. input feature
maps are added to the output after every few convolutional layers. ResNeXts
builds on this further with an additional parameter of cardinality, which we dis-
cuss next.
Model Architecture
We first briefly discuss a ResNet-18 model, i.e. a ResNet with 18 layers, and
subsequently discuss how ResNeXts build on it. The network takes in as input an
image and applies 64 convolutional filters of size 7 × 7 to obtain 64 feature maps.
72
This is followed by 16 convolutional layers with feature maps varying from 64 to
512.21 Figure 2.3 shows the ResNet-18 architecture.
ResNet-18 has residual blocks which comprise of 2 convolutional layers, given
by F(x) = C2[ReLU(C1(x))], where x is the input to the residual block, C1 and C2
are the two convolutional layers in the residual block. ReLU is the rectified linear
unit activation function, given by:
ReLU(x) = max(0, x) (2.1)
The rectified linear unit activation function enhances the capability of the neu-
ral network to learn nonlinearities in the data. We use this activation function
after each layer in the network, excluding the last fully connected layer. A resid-
ual block is followed by an identity mapping as shown in Figure 2.4 (i.e., adding
the input at the end of the residual block), and is given by:
y = F(x) + x (2.2)
where y is the input to the subsequent residual block, and F(x) is the residual
mapping that is learned during training. F(x) approximates H(x) − x,where H(x)
is the underlying functional mapping. The reason for choosing residual mapping
F(x) is that it is easier to optimize non-linear convolutional layers to push the
21ResNet-18 applies max-pool layer after the first convolutional layer, and average pool layer
after the last convolutional layer. The model architecture, number of filters, and number of layers
is from He et al. (2016)
73
Figure 2.3: ResNet-18 Architecture
This figure shows the architecture for the ResNet-18 model. The orange box represents the
input image of dimensions 224×224 pixels that is fed to the network. The convolutional
layers have the syntax as “a×b conv,c where a×b is the size of the convolutional filter in
pixels, and c represents the number of features maps, i.e. filters used for that layer. A
“×2” next to a convolutional block represents that there are two of such convolutional
layers. Solid black lines represent the residual connections that maintain the dimension-
ality of the previous layer, whereas dotted black lines represent the residual connections
that double the dimensionality from the previous layer. The final layer is the fully con-
nected layer whose output is dependent on the task. For instance, for age classification
the output will be 2, whereas for the race classification the output will be 4.
74
Figure 2.4: Residual Block of ResNet
This figure shows a residual block for the ResNet-18 model. It comprises of two convolu-
tional layers, with ReLU activation after the first layer. The input to the residual block, x,
is added back after the two convolutional layers as an identity mapping.
residuals (H(x) − x) to zero rather than expect the non-linear layers to learn the
exact functional mappings. He et al. (2016) also empirically demonstrate the ad-
vantage of the residual mapping approach across several image tasks (ImageNet
ILSVRC 2015 and COCO 2015 competitions).
We also apply batch normalization before every layer to prevent scale effects,
such that the scale of the data does not affect model training. Therefore, ResNet-
18 consists of 17 convolutional layers and a fully connected layer at the end. This
fully connected layer gives a c-dimensional output, where c is the number of
classes for our data. That is, for age and gender classification task we have c = 2,
and for race classification task we have c = 4. Model predictions are obtained by
applying a softmax operation to the output of the fully connected layer. Specifi-
75
cally, the softmax operation applied on model output xi for a class i with a total of
c classes is given by:
∑ exiso f tmax(xi) = c x (2.3)j
j=1 e
We now describe the ResNeXt model Xie et al. (2017) which builds further on
ResNets. Xie et al. (2017) propose an architectural modification to ResNets that
allows for the split-transform-merge procedure. They empirically demonstrate
that adding parallel paths in the residual blocks increases the model’s capacity
/ flexibility by reducing the risk of over-adapting model architecture to a specific
dataset, and allows for better training as the split paths are lower dimensional fea-
tures that require less computational cost to optimize. More formally, with arbi-
trary transformation functions Ti and cardinality C, the aggregate transformation
is given by:
∑c
F(x) = Ti(x) (2.4)
i=1
With ResNeXt models, it is straightforward to switch to ResNet models if we
set the cardinality to 1. Figure 2.5 shows a block of ResNet and its ResNeXt equiv-
alent. There are 32 parallel paths, i.e. Ti, which are summed up at the end to
maintain dimensionality. The identity shortcut mapping is subsequently added
to fit the residual neural networks framework.
We use the ResNeXt model with 101 layers and cardinality of 32 and bottle-
76
Figure 2.5: ResNeXt and ResNet Blocks
This figure shows the block for residual neural networks: ResNet Block on the left, and
the block for aggregate residual network: ResNeXt Block on the right. The ResNeXt block
has a cardinality of 32, i.e. there are 32 parallel paths of smaller dimensionality which are
aggregated and subsequently summed with the identity mapping.
neck size of 8. The final two layers in the demographic detection models are fully
connected layers with sizes of 2048 and 512 respectively, and the output of the
model is obtained using a softmax.
Training
We first describe the training data used to train the demographic models. We
use the UTKFace and FairFace datasets. The UTKFace dataset consists of 23,708
images which are labeled with age in years, gender of male or female. Figure 2.6
77
Figure 2.6: Distribution of Age in Training Data
This figure shows the distribution of age in the training data of UTKFace. The dotted
gray line represents the threshold for underage, i.e. points to the left of the dotted gray
line represent underage faces with an age of < 21 years old.
shows the distribution of training data with different age labels.
The dataset consists of 20.54% of the data represents underage faces, i.e. with
age of less than 20. There are 47.7% female faces in the UTKFace dataset. We chose
the FairFace dataset for race labels since it has balanced distribution in the dataset
78
across different the different races – 14.1% for race: Black, 19.1% for race: White,
26.6% for race: Asian, and 41.2% across other races – Latino, Indian, and Middle
Eastern. For the age and gender classification tasks, we use the UTKFace dataset
which comprises of 21,263 training images and 2,445 test images. For the race
classification task, we sub-sample 14,863 training images and 1,407 test images
from the FairFace dataset.22
We next describe the loss function. We use the cross-entropy loss function to
train the model. The loss for a batch of size n with a total of c object classes is
given by:
1 ∑n ∑c
loss = yi jlog(ŷn i j
) (2.5)
i=1 j=1
where yi j is equal to 1 if image i belongs to class j, and 0 otherwise. ŷi j is the
model predicted output for image i and class j, with values in the range of [0, 1].
This loss function is then back-propagated through the layers using stochastic
gradient descent to learn the model parameters. More formally, the following
update procedure is used:
4loss
θr ← θr + η 4 (2.6)θr
where θr are the parameters of the network and η is the learning rate. We
22Note that the FairFace dataset comprises of 100,000 images, however we use a random sub-
set for computational feasibility reasons – the computation time and GPU requirements increase
substantially with larger training data.
79
use ADAM optimizer Kingma and Ba (2014) which is stochastic gradient descent
with momentum. We train the three models for 100 epochs, i.e. iterations over
the data, with a mini-batch size of 50 images. We start with randomly initialized
model parameters and use an initial learning rate of 0.01. The learning rate sched-
ule is such that it decreased by a factor of 10 after every 8 consecutive epochs
without improvement in accuracy on the test dataset. On the test dataset, the age
model has a precision of 93.2% in detecting underage faces, the gender model has
a precision of 97.8% in detecting female faces, and the race model has an average
precision of 86.37% for detecting the four race classes.
2.5.3 Detecting Disguising in Images
We now turn to object detection and segmentation methods from computer
vision literature to detect disguising behavior in images. We use Mask R-CNN He
et al. (2017) to detect disguising behavior in the form of emoji overlay in images.23
There are several media discussions of emoji usage on social media that motivate
our task for detecting emoji overlay in images.24 Note that the disguising measure
obtained by detecting emoji in images provides a lower bound, as there can be
other forms of disguising such as hiding Juul behind persons or other objects.25
23As a robustness, we conducted an RA-based validation to check if the 85 emojis are used in
social media images for disguising. The results of this paper are substantially similar when we
restrict to only the 20 emojis (of the 85 emojis) most commonly used for disguising.
24See for e.g. Daily Mail, Revealed: How teenagers use secret emoji code to deal Class A drugs
on Snapchat and Instagram as gangs target ’digital savvy’ school pupils, July, 2017; Drug Addic-
tion Now, Emojis Give Youth a New Way to Communicate About Substance Abuse, Oct 2018
25We can also identify other commonly co-occurring objects with emoji and use these objects as
80
Figure 2.7: Example of Object Detection for Juul and Emoji
The figure on the left is an image with a Juul device, and the figure on the right uses an
emoji to disguise Juul
Figure 2.7 shows an example of disguising using emoji overlay in images.
We first briefly describe object detection methods. There are two steps in these
methods – region proposal generation and object classification. In the first step,
the region proposal block gives candidate regions, i.e., a part of the image that po-
tentially contains an “object”. In the second step, the object detection block takes
the object candidates and classifies these objects among different object classes.
Previously, several region proposal models have been used in the literature, with
measures of disguising. Note however that such analysis will only be correlation based and not
a definitive measure of disguising. We report the commonly objects commonly co-occurring with
emojis in the Appendix 2.9.6
81
selective search Uijlings et al. (2013) being a commonly used method. Selective
search uses low level image features of color, texture, size, and fill to combine pix-
els in an image to generate candidate object regions using pixel similarity. We refer
the readers to Uijlings et al. (2013) for further description of the selective search
method.
Having a separate region proposal block is computationally expensive. Fur-
thermore, it fails to take advantage of correlations in the region proposal gener-
ation task and the object detection task. That is, if a region in image is an object
candidate, then its features will be useful for the object detection task. Recent
CNN based methods capitalize on this and share convolutional layers. This has
been employed in the recent such as Faster R-CNN Ren et al. (2015) and Mask
R-CNN, that use shared convolutional layers as inputs to both region proposal
block and object detection block. This eliminates the need for a separate region
proposal method, thereby reducing computation time for training object detec-
tion methods. Furthermore, sharing convolutional features for region proposal
and object detection leads to higher accuracy. We use Mask R-CNN in this paper
to detect disguising in images.
Model Architecture and Training
We first describe the Faster R-CNN Ren et al. (2015) approach and then dis-
cuss how Mask R-CNN builds on it. Faster R-CNN builds on Fast R-CNN which
uses VGG-16 neural network Simonyan and Zisserman (2014) as the backbone to
82
compute convolutional features. VGG-16 is a 16-layer neural network, with 13
convolutional layers and 3 fully connected layers.
Faster R-CNN has two blocks: the region proposal generation block and the
object detection block, which share the convolution layers. Figure 2.8 shows the
Faster R-CNN architecture. The shared convolutional layers are used by both the
region proposal generation block (middle block in Figure 2.8: region proposal
block) and the object detection block (final block in Figure 2.8: object detection
block).
To generate region proposals, we slide a 3x3 filter convolutional layer, pre-
dicting k = 9 “anchors”, i.e. centers of region proposals at each location.26 This
is followed by two fully connected layers which predict the bounding boxes co-
ordinates and an “objectness” score, i.e., whether the anchor box contains an object
or not. The detection criteria for a region proposal block is intersection over union
(IOU), i.e. the amount of overlap for the ground truth boxes and predicted boxes.
Region proposals with IOU >0.5 are taken as positive training samples, and those
with IOU < 0.5 are taken as negative training samples.27 256 regions proposals are
generated for each image, 128 positive and 128 negative proposals. These region
proposals, i.e. bounding box co-ordinates and “objectness” scores are then passed
to the object detection block. ROI Pooling layer is used to quantize the features
in the regions for subsequent classification. ROI Pooling layer divides the regions
26k = 9anchor boxes are calculated at each slide of the convolutional filter since we predict at 3
different aspect ratios (1 : 1, 1 : 2, 2 : 1) and 3 different scales (128 × 128, 256 × 256, 512 × 512)
27We follow the standard practice in computer vision literature of intersection over union > 0.5
as the threshold for a positive detection during evaluation.
83
Figure 2.8: Faster R-CNN Architecture
This figure shows the architecture for Faster R-CNN. The shared convolution layers pro-
vide features which are used by both the region proposal model and the object detection
model. The region proposal model applies a convolutional layer that outputs interme-
diate features. These features are then fed to two fully connected layers which give for
k regions 2k class scores (object or no-object) and 4k co-ordinates for region proposals.
These region proposals are then fed to the object detection model which applies a ROI
Pooling layer on the region proposals on the features of the shared convolutional layers.
This is followed by two fully connected layers to generate intermediate features: region
of interest feature vector. Two fully connected layers then output the predicted class and
bounding box co-ordinates for the region of interest.
84
in a 7x7 grid and takes the maximum value from the features to output a 7x7 fea-
ture vector. This feature vector is then applied to two separate fully connected
layers to get bounding box co-ordinates and softmax to get the object class with
the maximum probability.
Mask R-CNN builds on Faster R-CNN, as it also outputs masks for the regions
of interest. A mask (pixel mapping of objects in the image) is useful for object
segmentation, i.e. extracting regions (pixel level) of the image where the object
is present. He et al. (2017) argue that ROI Pooling layer of Faster R-CNN is ap-
propriate for object classification task, however the quantization of features done
in the ROI Pooling layer leads to sub-optimal segmentation. They suggest ROI
Align instead. ROI Align avoids any quantization of features as it performs bi-
linear interpolation of features. We refer the readers to the He et al. (2017) paper
for a detailed description of the bi-linear interpolation operations performed in
the ROI Align layer.
Mask R-CNN keeps the architecture of Faster R-CNN the same up until the
object detection block, i.e. the shared convolutional layers and region proposal
block is exactly the same. Figure 2.9 shows the Mask R-CNN architecture. For
purposes of brevity, the architecture shown is after the region proposals are gen-
erated.28 Different from Faster R-CNN, Mask R-CNN uses 5 scales and 3 aspect
ratios, for a total of 15 anchor boxes per location. After the ROI Align operation, a
convolutional layer is used to generate the intermediate region of interest feature
28This is done because Mask R-CNN and Faster R-CNN have the same architecture until region
proposal generation block.
85
maps. These are then fed to two separate fully connected layers to predict object
class and bounding box co-ordinates. There are also two additional convolutional
layers that predict the object mask of dimension 14x14. The non-linear activation
for the hidden layers is ReLU. The shared convolutional layer block comprises of
the convolutional layers of the ResNeXt-101 block as described Section 2.5.
Training
We first describe the training dataset for detecting emojis in images. The emojis
for detection are given in Figure 2.14 in the Appendix 2.9.2. There are a total of
85 emojis in our library. We manually sub-sampled 1000 images from our data
with a Juul present based on visual inspection. We then superimposed the emojis
over the Juul device in the images. From those 1000 images, we randomly sample
50 as training images for a total of 4,250 training images. For these images, we
constructed the masks, i.e. annotated regions in the image where the emoji is
present using the open source VIA image annotation tool Dutta and Zisserman
(2019). We appointed an RA to do these tasks. Figure 2.10 shows an image from
our training data with annotations. The Mask R-CNN output is similar to the
annotation format: which emoji is detected, i.e. object class, the bounding box for
the emoji, and the pixels in the image where the emoji is present, i.e. mask.
We next describe the loss function for Mask R-CNN. The loss consists of three
components - classification loss, bounding box regression loss, and mask loss. The
classification loss: Lcls is cross-entropy loss (as defined in Section 2.5 ) for whether
86
Figure 2.9: Mask R-CNN Architecture
This figure shows the architecture (post region proposal generation) for Mask R-
CNN. The shared convolution layers provide features which are used by both the
region proposal model and the object detection model. The proposal from the
region proposal model (same as Faster R-CNN) are fed to the ROI Align layer
to subsequently through a convolutional layer to obtain intermediate features –
region of interest feature vector. Two fully connected layers then output the pre-
dicted class and bounding box co-ordinates for the region of interest. Masks are
generated using two subsequent convolutional layers.
87
Figure 2.10: Example of Annotated Data for Disguising
The figure shows a heart emoji used to disguise Juul usage. The annotated data consists of
a bounding box (pink box), object class (emoji “0”), and mask (pixels shaded to annotate
heart emoji).
the predicted object class is the true object class. The bounding box regression loss:
Lreg quantifies the loss in predicted bounding box and the ground truth bounding
box. The mask loss: Lmask the per-pixel loss for whether the pixel mask matched
the ground truth masks – average cross entropy loss (as defined in Eq. 5). More
formally, the overall loss function for Mask R-CNN is given as:
1 ∑ ∗ 1 ∑ ∗ 1 ∑∑L = L ∗
N cls
(pi, pi ) + piLN reg
(ti, ti ) + 1N jk
Lmask(δ jk, δ jk) (2.7)
cls i reg i mask i j
where, for a box i, pi is the predicted probability of detecting an object, p∗i is
the ground truth whether the box contains an object or not. ti are the predicted
coordinates, and t∗i are the ground truth coordinates.
29 A coordinate consists of
29We refer the readers to the Mask R-CNN paper for details on the transformations for the
coordinates.
88
four variables: x, y,w, and h, where x and y are the two-dimensional co-ordinates
for the center of the bounding box, and w and h are the width and height respec-
tively for the bounding box, and these are subsequently scaled around center of
the bounding box and log-transformed. The bounding box regression loss Lreg is:
∑ ′
Lreg(ti, t∗i ) = smooth
v
L1(ti − tvi ) (2.8)
v∈x,y,w,h
smoothL1(x) = 0.5 × x2, |x| < 1
smoothL1(x) = |x| − 0.5, |x| ≥ 1
The mask loss is a pixel by pixel average cross entropy loss. It takes as input
the feature value for pixel j in the 14 × 14 predicted mask with δ jk being predicted
mask value (whether pixel j is predicted mask for object class k), and δ∗jk is the
ground truth mask value.
We train the model using back-propagation with stochastic gradient descent
(as defined in Section 2.5) for 20,000 iterations with a mini-batch size of 1,000 re-
gions (2 images, 500 region proposals per image). We follow the same learning
rate schedule as He et al. (2017), starting with a learning rate of 0.01, weight de-
cay of 0.0001, and momentum of 0.9. To test the accuracy of the method, we held
out 100 randomly selected images with emoji and 100 randomly selected images
without emoji in them. We tested whether the method is able to predict which
images have emoji in them. Our model achieved precision of 92.46% and recall
of 98% for an F1 score of 95.15%. Note that we do not have a direct benchmark
89
for comparing performance since there is no emoji detection dataset or challenge
to the best of our knowledge. The closest benchmark task we found is the object
detection task for 200 object categories in the ILSVRC 2017 challenge, where the
best performing model has a mean average precision of 73.14%. 30
2.5.4 Estimating the Effect of Tax Policies
We use the generalized synthetic control method Xu (2017) to estimate the effect of
state taxes on vaping posting behavior. This method unifies the synthetic control
method with the interactive fixed effects (IFE) model. Counterfactuals for treated
units are imputed in three steps. In the first step, the IFE model is estimated
on the control data. In the second step, the factor loadings for each treated unit
are estimated by linearly projecting treated units’ pretreatment data on the factor
space. Finally, these estimated factor loadings and factors are used to impute
treated counterfactuals. The generalized synthetic control method has been used
in marketing literature Guo et al. (2020); Pattabhiramaiah et al. (2019), and builds
further on the synthetic control method Abadie et al. (2010), which has been used
extensively in marketing literature Li (2019); Tirunillai and Tellis (2017).
Suppose there are N states, with T states that passed taxes - the treatment
states, and C states that did not pass any taxes. Let Yit be the dependent outcome
variable for state i at time period t. We assume a parametric form for Yit given by
30See ILSVRC 2017: ImageNet Large Scale Visual Recognition Challenge 2017 results, accessible
from: http://image-net.org/challenges/LSVRC/2017/results#det
90
a linear factor model as:
Yit = δitDit + βXit + λiFt + it (2.9)
where Dit is a dummy variable that takes value of 1 when the state i is treated
at time period t, Xit are the set of state-time varying covariates of per-capita con-
sumption of cigarettes in the preceding year, (log) population in the previous
quarter, and (log) per-capita income in the previous quarter. λiFt is the linearly
additive factor component of the model.31 The IFE model parameters (β, λi, Ft)
are estimated by minimizing the sum of squares of the fitted values subject to
the constraints for factor loadings and factors and to be unit normalized and or-
thogonal.32 We then estimate the factor loadings for treated units in the second
step by minimizing the mean squared error for the predicted outcomes in pre-
treatment period. In the final step, we impute the treated counterfactuals in the
post-treatment period using the factor loadings estimated in step 2 and the model
parameters estimated in step 1. In our analysis, the states that did not pass taxes
are considered as control units.33
In this manner, treated counterfactuals are imputed for the post-treatment pe-
31These factors allow for capturing varying unobserved heterogeneities, such as unit-specific
linear trends or quadratic trends, auto-regressive components, and more generally heterogeneities
of unobserved random variables that can be decomposed multiplicative form of a time factor and
state loading. This is especially useful in our context to control for unobserved changes in states
over time.
32We chose from zero - three number of factors, using cross-validation with leave-one-out pro-
cedure to minimize the mean squared percentage error.
33We also exclude the states of Minnesota, Louisiana, North Carolina, and Delaware which had
previously enacted taxes.
91
riod, and the average treatment effect across all states at time period t is given
as:
1 ∑
ÂTTt = [Yit(1) − Ŷit(0)] (2.10)T
i∈T
where Yit(1) is the observed treated outcome for state i at time period t, and
Ŷit(0) is the imputed treated counterfactual. Confidence intervals are computed
using the percentile confidence intervals with parametric bootstrap for 1000 boot-
straps.
2.6 Results
We first report the summary statistics of the state-quarter level data. Table 2.3
reports the summary statistics for the observed covariates and image extracted
variables respectively. We use the ResNeXt model as described in Section 2.5 to
estimate underage, gender (female), and race in images, and use Mask R-CNN
model as described in Section 2.5 to detect disguising. We construct the depen-
dent variables for each week as moving average in the past month, similar to the
approach followed by Hollenbeck (2018). At the state-week level, there are on av-
erage 3.75 underage faces and a total of 111.62 faces (both adult and underage).
The extent of underage disguising is 0.28 on average. There are on average 66.03
female faces. The average for race in the data are 18.84, 21.57, and 64.94 for Asian,
92
Table 2.3: Summary Statistics for State-Week Level Variables
This table shows the summary statistics for state level variables of cigarette consumption
per capita, (log) population, (log) per-capita income. Variables extracted from images:
underage, underage disguising, female, race:Asian, race:Black, and race:White are win-
sorized within 5% and 95% level.
Variable Mean Std. Dev. Min Max
Underage 3.74866 5.04659 0 19
Underage Disguising 0.27832 0.59836 0 2
Female 66.03131 82.64968 0 317
Race: Asian 18.84004 23.63845 0 90
Race: Black 21.57946 27.87771 0 107
Race: White 64.93459 80.45354 0 311
Cigarette Consumption 3.71763 0.43070 2.58776 4.57780
(Log) Population 15.29676 0.97821 13.26772 17.49185
(Log) Per-Capita Income 10.79572 0.16077 10.47622 11.22916
Black, and White vaping posts.
The rest of the results section is organized in three parts. First, we report the
average treatment effects of taxes for underage vaping from social media images.
Second, we report the results for underage disguising in social media images af-
ter regulation. Third, we test whether there was a differential effect of taxes on
demographics of race and gender.
93
Figure 2.11: Effect of Taxes on Underage Posts
This figure reports the results of the generalized synthetic control method for gap between
treated and counterfactual (and 90% confidence intervals) for number of underage vaping
posts. The first plot shows the overall treatment effects, and the remaining four plots show
the results for California, Kansas, Pennsylvania, and West Virginia
Underage: All Treated States
4
0
−4
−8
0 50 100
Time relative to Treatment
2.6.1 Impact of Taxes on Underage Vaping Posts
We estimate the effect of taxes using the generalized synthetic controls. Figure
2.11 shows the results for the average treatment effects for each of the four states
of California, Kansas, Pennsylvania, and West Virginia.
94
Coefficient
Underage: CA
10
0
−10
−20
−50 0 50
Time relative to Treatment
Underage: KS
15
10
5
0
−5
−10
0 9550 100
Time relative to Treatment
Coefficient Coefficient
Underage: PA
10
5
0
−5
−10
0 50 100
Time relative to Treatment
Underage: WV
10
5
0
−5
0 9650 100
Time relative to Treatment
Coefficient Coefficient
We find a decrease in underage vaping posts across the treated states in the
weeks 30-90 post tax introduction. We next turn to each individual state and esti-
mate the treatment effects. We find that California and Pennsylvania had a decline
in underage vaping posts, however California had a delay of about a year after
taxes – the effect is significant after week 70. Note that California saw a decline of
5-12 underage vaping posts per week, as compared to an average of 17.17 under-
age posts per week in the pretreatment period. This corresponds to a reduction of
29.12% post taxes introduction. Pennsylvania saw a decline of 1.25 – 7 underage
vaping posts per week during the weeks of 20 – 55, as compared to an average of
5.39 underage vaping posts per week, corresponding to a decline of 23.19% post
tax introduction.34 The timing of these effects is similar to other papers in mar-
keting and addiction products literature Tuchman (2019). We do not observe such
decline in underage faces for the states of Kansas and West Virginia after passing
taxes.
As the next step, we measure impact of taxes for engagement measures for
posts that contain underage faces. These are the following three measures: aver-
age number of likes for these posts; average number of comments on these posts;
and proportion of posts with solo faces, i.e. they contain only one face. Figures
2.15, 2.16, and 2.19 in the Appendix 2.9.4 shows the average treatment effects es-
timated. We find that Pennsylvania and West Virginia had an increase in number
of likes for posts with underage faces, and that this effect is significant after week
34Supplementary analysis for effectiveness of taxes with difference-in-differences approach is
in the Appendix 2.9.3. We find consistent evidence that California and Pennsylvania had a decline
in underage vaping posts.
97
60 post tax introduction. We do not find significant effects for number of com-
ments on underage posts. This evidence suggests that while Pennsylvania saw a
decline in underage vaping posts, however their engagement on social media in-
creased post tax introduction. The West Virginia finding of increased engagement
is puzzling.
Thus, we find that state taxes of California and Pennsylvania were effective in
deterring underage youth social media posting related to vaping, whereas Kansas
and West Virginia did not see such a decline. The Pennsylvania tax effectiveness
is weakened by the finding of increased engagement of the remaining posts. Reg-
ulators will be interested in these findings as higher engagement and influence of
vaping posts on social media can counter regulatory efforts to denormalize youth
vaping.
2.6.2 Impact of Taxes on Underage Disguising
We estimate the effect of taxes on underage disguised posts in a manner similar
to that of Section 2.6.1. Figure 2.12 shows the results.
98
Figure 2.12: Effect of Taxes on Underage Disguising
This figure reports the results of the generalized synthetic control method for gap between
treated and counterfactual (and 90% confidence intervals) for number of underage vaping
posts with disguising. The first plot shows the overall treatment effects, and the remaining
four plots show the results for California, Kansas, Pennsylvania, and West Virginia
Underage Disguising: All Treated States
2
1
0
−1
0 50 100
Time relative to Treatment
99
Coefficient
Underage Disguising: CA
2
0
−2
−50 0 50
Time relative to Treatment
Underage Disguising: KS
3
2
1
0
−1
−2
0 10500 100
Time relative to Treatment
Coefficient Coefficient
Underage Disguising: PA
2.5
0.0
−2.5
0 50 100
Time relative to Treatment
Underage Disguising: WV
2
1
0
−1
−2
0 10510 100
Time relative to Treatment
Coefficient Coefficient
We find that California had an increase in disguising post tax passing, however
this effect goes away after week 25. The effect is as strong as a 100% increase in un-
derage disguising compared to the pretreatment levels (1.75 underage disguised
posts per week). We do not observe similar effects for the other states.35
As the next step, we measure impact of tax for engagement measures for posts
with disguising. These are the following three measures: proportion of posts with
underage faces, i.e. posts with disguising that contain an underage face; average
number of likes for these posts; and average number of comments on these posts,
however we do not find significant effects. Figures 2.17 and 2.18 in the Appendix
2.9.4 show the average treatment effects.
Combined with the findings of underage vaping (Section 2.6.1), these findings
are important for regulators as our results suggest that while underage users tried
to disguise or deceive about e-cigarette usage in their social media posts in Cali-
fornia, the effect subsequently subsided and California taxes did lead to a decline
in underage vaping posts. Importantly, managers will also be interested in moni-
toring disguising as they aim to curtail unintended usage of their vaping products
to ensure regulatory compliance.
35We then conduct additional analysis with difference-in-difference approach, and report the
results in the Appendix 2.9.3. We find convergent evidence of increased disguising in social media
posts for California.
102
2.6.3 Impact of Taxes by Race and Gender
We estimate the effect of taxes on race and gender of posting in the same man-
ner as Section 2.6.1. Appendix 2.9.5 shows the results for the average treatment
effects for each of the four states of California, Kansas, Pennsylvania, and West
Virginia. Our preliminary analysis suggests that Pennsylvania had a decline in
female vaping posts for the weeks 20-55 post tax passing. We do not find such
effects for other states. The research on differential gendered effects of vaping is
currently nascent and unclear. Nonetheless, managers and regulators might find
it useful to know trends of posting by gender, as research on gendered effects of
vaping becomes clearer.
For race, we find that Kansas saw a decline for race: Asian and White, whereas
it had an increase for race: Black. This is an unusual result given the expectation of
decrease in consumption with tax increase. A possible confound is that in 2015, a
year prior to vaping tax in 2016, Kansas increased its cigarette tax significantly, by
63%. The cigarette industry has historically advertised heavily to Blacks, result-
ing in greater rates of consumption. We would need consumption data for both
e- and regular cigarettes to verify our conjecture. Regardless, from a regulatory
point of view, increased posting by Blacks post-tax is likely worrisome. Health
outcomes already disfavor minority groups and these go against regulatory ef-
forts to reduce health disparities.36 For managers, these results are concerning as
they highlight the future possibility of bad publicity or litigation resulting from
36For health outcome inequalities by race, see: CDC Health Disparities and Inequalities Report
- United States, 2013
103
differential impact on minority groups.
2.7 Discussion
Vaping has become a concern for health regulators as nicotine addiction from
vaping causes pulmonary harm to youth and also increases the probability of sub-
sequent smoking. Lack of data on underage consumption makes it harder to mea-
sure the efficacy of taxation and other regulation. We utilize social media images
as a rough proxy of normalization of vaping and potentially of consumption. We
find that the states with higher taxes: California and Pennsylvania had a decline in
number of underage vaping posts. California’s decline in underage vaping posts
was preceded by an increase in disguising activity, and Pennsylvania’s decline in
underage vaping posts was accompanied by increased engagement for the under-
age posts. There also appear to be differential effects by race in Kansas.
Our work has both regulatory and managerial implications. These findings are
worrisome for regulators aiming to denormalize vaping among underage users,
and among minorities who already have less favorable health outcomes. From a
managerial perspective, our methods can be used by firm to detect inappropriate
usage of their products and gauge regulatory risk as firms must carefully navi-
gate the regulatory landscape. Continued regulatory violations could potentially
endanger their viability as a business.
We plan to extend our work on the following three fronts. First, we plan to ex-
104
tend our data to estimate how long the tax effects persist. Second, we plan to study
the recent regulations passed in September 2019 by the state of Massachusetts that
imposes a four month ban on all vaping products, and that by the states of Michi-
gan and New York that bans sales of flavored e-cigarettes. Finally, we plan to
estimate the interaction effects of these regulations with the 2020 COVID health
crisis.
105
2.8 References
Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods
for comparative case studies: Estimating the effect of california’s tobacco control
program. Journal of the American statistical Association, 105(490):493–505.
Allcott, H. and Rafkin, C. (2020). Optimal regulation of e-cigarettes: Theory and
evidence. Technical report, National Bureau of Economic Research.
Allem, J.-P., Dharmapuri, L., Unger, J. B., and Cruz, T. B. (2018). Characterizing
juul-related posts on twitter. Drug and alcohol dependence, 190:1–5.
Barbeau, A. M., Burda, J., and Siegel, M. (2013). Perceived efficacy of e-cigarettes
versus nicotine replacement therapy among successful e-cigarette users: a qual-
itative approach. Addiction Science & Clinical Practice, 8(1):5.
Barrington-Trimis, J. L., Urman, R., Berhane, K., Unger, J. B., Cruz, T. B., Pentz,
M. A., Samet, J. M., Leventhal, A. M., and McConnell, R. (2016). E-cigarettes
and future cigarette use. Pediatrics, 138(1):e20160379.
Brown, J., Beard, E., Kotz, D., Michie, S., and West, R. (2014). Real-world effec-
tiveness of e-cigarettes when used to aid smoking cessation: a cross-sectional
population study. Addiction, 109(9):1531–1540.
Burnap, A., Hauser, J. R., and Timoshenko, A. (2019). Design and evaluation
106
of product aesthetics: a human-machine hybrid approach. Available at SSRN
3421771.
Callaway, B. and Sant’Anna, P. H. (2018). Difference-in-differences with multiple
time periods and an application on the minimum wage and employment. arXiv
preprint arXiv:1803.09015.
Chen, J. and Rao, V. R. (2020). A dynamic model of rational addiction with stock-
piling and learning: An empirical examination of e-cigarettes. Management Sci-
ence.
Chun, L. F., Moazed, F., Calfee, C. S., Matthay, M. A., and Gotts, J. E. (2017). Pul-
monary toxicity of e-cigarettes. American Journal of Physiology-Lung Cellular and
Molecular Physiology, 313(2):L193–L206.
Dave, D., Feng, B., and Pesko, M. F. (2019). The effects of e-cigarette minimum
legal sale age laws on youth substance use. Health economics, 28(3):419–436.
Dew, R., Ansari, A., and Toubia, O. (2019). Letting logos speak: Leveraging mul-
tiview representation learning for data-driven logo design. Available at SSRN
3406857.
Dutta, A. and Zisserman, A. (2019). The via annotation software for images, audio
and video. In Proceedings of the 27th ACM International Conference on Multimedia,
pages 2276–2279.
ElTayeby, O., Eaglin, T., Abdullah, M., Burlinson, D., Dou, W., and Yao, L. (2017).
Detecting drinking-related contents on social media by classifying heteroge-
107
neous data types. In International Conference on Industrial, Engineering and Other
Applications of Applied Intelligent Systems, pages 364–373. Springer.
Gordon, B. R. and Sun, B. (2015). A dynamic model of rational addiction: Evalu-
ating cigarette taxes. Marketing Science, 34(3):452–470.
Guo, T., Sriram, S., and Manchanda, P. (2020). Let the sunshine in: The impact
of industry payment disclosure on physician prescription behavior. Marketing
Science, 39(3):516–539.
He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017). Mask r-cnn. In Proceedings
of the IEEE international conference on computer vision, pages 2961–2969.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Hollenbeck, B. (2018). Online reputation mechanisms and the decreasing value of
chain affiliation. Journal of Marketing Research, 55(5):636–654.
Kärkkäinen, K. and Joo, J. (2019). Fairface: Face attribute dataset for balanced
race, gender, and age. arXiv preprint arXiv:1908.04913.
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980.
Li, K. T. (2019). Statistical inference for average treatment effects estimated by
synthetic control methods. Journal of the American Statistical Association, pages
1–16.
108
Li, Y. and Xie, Y. (2020). Is a picture worth a thousand words? an empirical study
of image content and social media engagement. Journal of Marketing Research,
57(1):1–19.
Liu, L., Dzyabura, D., and Mizik, N. (2020). Visual listening in: Extracting brand
image portrayed on social media. Marketing Science, 39(4):669–686.
Maharana, A. and Nsoesie, E. O. (2018). Use of deep learning to examine the asso-
ciation of the built environment with prevalence of neighborhood adult obesity.
JAMA network open, 1(4):e181535–e181535.
Miech, R., Patrick, M. E., O’malley, P. M., and Johnston, L. D. (2017). E-cigarette
use as a predictor of cigarette smoking: results from a 1-year follow-up of a
national sample of 12th grade students. Tobacco control, 26(e2):e106–e111.
Pattabhiramaiah, A., Sriram, S., and Manchanda, P. (2019). Paywalls: Monetizing
online content. Journal of Marketing, 83(2):19–36.
Pesko, M. and Warman, C. (2017). The effect of prices and taxes on youth cigarette
and e-cigarette use: Economic substitutes or complements? Available at SSRN
3077468.
Puranam, D., Narayan, V., and Kadiyali, V. (2017). The effect of calorie posting
regulation on consumer opinion: a flexible latent dirichlet allocation model with
informative priors. Marketing Science, 36(5):726–746.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time
109
object detection with region proposal networks. In Advances in neural information
processing systems, pages 91–99.
Rothe, R., Timofte, R., and Van Gool, L. (2015). Dex: Deep expectation of apparent
age from a single image. In Proceedings of the IEEE International Conference on
Computer Vision Workshops, pages 10–15.
Saffer, H., Dench, D., Dave, D., and Grossman, M. (2018). E-cigarettes and adult
smoking. Technical report, National Bureau of Economic Research.
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556.
Soneji, S., Barrington-Trimis, J. L., Wills, T. A., Leventhal, A. M., Unger, J. B., Gib-
son, L. A., Yang, J., Primack, B. A., Andrews, J. A., Miech, R. A., et al. (2017). As-
sociation between initial use of e-cigarettes and subsequent cigarette smoking
among adolescents and young adults: a systematic review and meta-analysis.
JAMA pediatrics, 171(8):788–797.
Tirunillai, S. and Tellis, G. J. (2017). Does offline tv advertising affect online chat-
ter? quasi-experimental analysis using synthetic control. Marketing Science,
36(6):862–878.
Tuchman, A. E. (2019). Advertising and demand for addictive goods: The effects
of e-cigarette advertising. Marketing Science, 38(6):994–1022.
Uijlings, J. R., Van De Sande, K. E., Gevers, T., and Smeulders, A. W. (2013).
110
Selective search for object recognition. International journal of computer vision,
104(2):154–171.
Wang, P., Xiong, G., and Yang, J. (2019). Frontiers: Asymmetric effects of recre-
ational cannabis legalization. Marketing Science.
Xiang, S. and Li, H. (2017). On the effects of batch and weight normalization in
generative adversarial networks. arXiv preprint arXiv:1704.03971.
Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. (2017). Aggregated residual
transformations for deep neural networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1492–1500.
Xu, Y. (2017). Generalized synthetic control method: Causal inference with inter-
active fixed effects models. Political Analysis, 25(1):57–76.
Zhang, S., Lee, D. D., Singh, P. V., and Srinivasan, K. (2017). How much is an
image worth? airbnb property demand estimation leveraging large scale im-
age analytics. Airbnb Property Demand Estimation Leveraging Large Scale Image
Analytics (May 25, 2017).
Zhang, Zhifei, S. Y. and Qi, H. (2017). Age progression/regression by condi-
tional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). IEEE.
111
2.9 Appendix
2.9.1 Number of Social Media Posts by Users and Proportion of
Posts Related to Vaping
Figure 2.13 shows the number of posts and users based on proportion vaping
posts cutoff.
Figure 2.13: Number of Posts and Users based on Proportion Vaping
Posts Cutoff.
This figure shows the number of posts and users by cutoff for posts
112
2.9.2 Detecting Disguising in Images
Figure 2.14 shows the emojis used in the object detection method of Section 2.5
Figure 2.14: Emoji List for Detection
113
2.9.3 Supplementary Analysis with Difference-in-Differences
We now discuss the staggered difference in differences approach of Callaway and
Sant’Anna (2018) (SDID henceforth). The traditional approach of difference in
differences with propensity score matching is not suitable in our context since we
have varying treatment periods across different states. Callaway and Sant’Anna
(2018) show the efficacy of SDID in tackling variation in treatment periods, and
also demonstrate its applicability by estimating the impact of federal minimum
wage regulations during 2007-2010 on teen employment.
We first discuss the notation. Suppose there are T time periods, Dt be the bi-
nary indicator if a unit is treated in time period t, Gg is the binary indicator for
whether the unit is first treated in time period g, and C is the binary indicator
for whether the unit is in the control group, i.e. it is never treated. The authors
propose generalized propensity score method to match treated units with control
units. The generalized propensity score pg is given by:
pg(X) = P(Gg = 1|X,C +Gg = 1)
where pg(X) is the probability that a unit with observed covariates X is treated
at time period g. The generalized propensity score is estimated using a logit
model. The outcome variable at time t is Yt, with Yt(1) and Yt(0) as the poten-
tial outcome for the treated and control units. Thus, the average treatment effect,
ATT (g, t), for units first treated at time period g at a time period t(t ≥ g) is given
114
by:
ATT (g, t) = E(Yt(1) − Yt(0)|Gg = 1)
Apart from the standard assumptions required for difference in differences,
SDID requires a modified parallel trends assumption for identification. SDID
makes conditional parallel trends assumption, i.e. parallel trends assumption
holds conditional on covariates, i.e.
E(Yt(0) − Yt−1(0)|X,Gg = 1) = E(Yt(0) − Yt−1(0)|X,C = 1)
With these assumptions, the average treatment effect for a unit treated in time
period g for a time t is then a weighted average of the ’long differences’ of the
outcome variable. The weights are the generalized propensity scores. Thus, the
average treatment effect, ATT (g, t), is given non-parametrically given as:

 pg(X)C
 
ATT (g t) E  Gg − 1−  pg(X)C  , = (Y − Y )E[G ] p (X)C  t g−1g E[ g 1−pg(X)C ]
In our context, the observed state-quarter level covariates consist of (lagged)
log of population, log of per-capita income, average cost of cigarettes, proportion
of non-adults (less than 18 years old), and proportion of online search related to
vaping. These variables are similar to the ones used by Abadie et al. (2010) for
evaluating California Tobacco regulation. The pre-period is 2016 Q1 – till regula-
115
tion for each of the states.
We report the results for four quarters post taxes for the outcome variables
of proportion of underage posts, proportion of female posts, proportion of posts
with race: Asian, proportion of posts with race: Black, proportion of posts with
race: White, and proportion of posts with disguising (overall disguising).
Table 2.4 reports the results for underage vaping posts. Table 2.6 reports the
average treatment effects of taxes on underage vaping in social media posts with
additional covariates of Youth Behavioral Risk Survey, and with lagged outcome
variables.
Table 2.5 reports the results for gender and race. Table 2.7 reports the average
treatment effects of taxes on disguising in social media posts. Table 2.8 reports
the average treatment effects of taxes on disguising in social media posts with
additional covariates of Youth Behavioral Risk Survey, and with lagged outcome
variables.
116
Table 2.4: State-Wide Effect of Taxes: Underage
This table reports the average treatment effect of taxes on underage vaping in social me-
dia posts for the treated states of California, Kansas, Pennsylvania, and West Virginia.
Quarters Q1, Q2, Q3, and Q4 are quarters 1, 2, 3, and 4 post treatment. Controls include
lagged log of population, log of per-capita income, average cost of cigarettes, per-capita
consumption of cigarettes, proportion population that is non-adult, and online search vol-
ume for vaping related searches. Standard errors (in parenthesis) are clustered at the state
level.
Q1 Q2 Q3 Q4
California -0.01289** -0.0134** -0.00272 -0.01341***
(0.00647) (0.00683) (0.00716) (0.00596)
Kansas -0.00637 -0.05086*** -0.03876*** -0.05068***
(0.01444) (0.00516) (0.01044) (0.00557)
Pennsylvania -0.00028 0.00465 -0.01068*** 0.00367*
(0.00842) (0.00689) (0.00356) (0.00204)
West Virginia -0.01578*** -0.00086 0.01298 0.00813
(0.00518) (0.0079) (0.01897) (0.0105)
Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1
117
Table 2.5: State-Wide Effect of Taxes: Other Demographics
This table reports the average treatment effect of taxes on other demographics of gen-
der (female), race – Asian, race – Black, and race - White. Quarters Q1, Q2, Q3, and Q4
are quarters 1, 2, 3, and 4 post treatment. Controls include lagged log of population,
log of per-capita income, average cost of cigarettes, per-capita consumption of cigarettes,
proportion population that is non-adult, and online search volume for vaping related
searches. Standard errors (in parenthesis) are clustered at the state level.
Gender
Q1 Q2 Q3 Q4
California 0.01389 0.00171 0.00467 -0.00441
(0.01888) (0.01336) (0.00337) (0.01591)
Kansas -0.06566*** -0.08426** -0.04711*** -0.02098
(0.02737) (0.03996) (0.01576) (0.01981)
Pennsylvania 0.00134 0.05355*** 0.00168 -0.01485
(0.02497) (0.01898) (0.01203) (0.02819)
West Virginia -0.01385 -0.00442 0.10064*** 0.09348***
(0.05503) (0.00966) (0.02833) (0.0323)
Race-Asian
Q1 Q2 Q3 Q4
California 0.01389 0.00171 0.00467 -0.00441
(0.01888) (0.01336) (0.00337) (0.01591)
Kansas -0.06566*** -0.08426** -0.04711*** -0.02098
(0.02737) (0.03996) (0.01576) (0.01981)
Pennsylvania 0.00134 0.05355*** 0.00168 -0.01485
(0.02497) (0.01898) (0.01203) (0.02819)
West Virginia -0.01385 -0.00442 0.10064*** 0.09348***
118
(0.05503) (0.00966) (0.02833) (0.0323)
Race-Black
Q1 Q2 Q3 Q4
California 0.01167 -0.00017 0.01317 0.02706***
(0.01819) (0.01587) (0.01019) (0.00691)
Kansas 0.08646*** 0.1848*** 0.01704 0.07787***
(0.0249) (0.0482) (0.03286) (0.03556)
Pennsylvania 0.01503 -0.00506 -0.01845 0.01469
(0.01242) (0.01647) (0.01586) (0.01241)
West Virginia -0.0441 0.01229 -0.06886*** -0.03161
(0.07026) (0.01237) (0.02884) (0.03065)
Race-White
Q1 Q2 Q3 Q4
California -0.01228 0.01546 0.01151 -0.01182***
(0.01699) (0.01053) (0.01741) (0.00498)
Kansas 0.00204 -0.15076*** 0.06235 -0.04311
(0.02775) (0.04512) (0.04192) (0.04105)
Pennsylvania -0.01881** -0.01003 0.04818*** -0.01311
(0.00918) (0.01672) (0.01732) (0.01368)
West Virginia 0.13592*** 0.26979*** 0.32453*** 0.1401***
(0.04457) (0.0376) (0.01746) (0.04665)
Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1
119
Table 2.6: Robustness Results for Table 2.4: Model Specifications
These tables report the average treatment effect of taxes on underage vaping in social
media posts for the treated states of California, Kansas, Pennsylvania, and West Virginia.
Quarters Q1, Q2, Q3, and Q4 are quarters 1, 2, 3, and 4 post treatment. Standard errors (in
parenthesis) are clustered at the state level.
With YBRSS Variables - Additional controls include youth cigarette usage,
e-cigarette usage, and marijuana usage. Note that YBRSS did not collect data for
Kansas during the years of 2015 or 2017.
Q1 Q2 Q3 Q4
California 0.00454 -0.00894*** -0.00039 0.00348
(0.00447) (0.00288) (0.00476) (0.00481)
Kansas
Pennsylvania -0.00385 -0.0061 -0.02091*** 0.00472***
(0.00346) (0.00746) (0.00645) (0.0019)
West Virginia 0.01563 -0.01159 0.00764 0.01324***
(0.00999) (0.01386) (0.00981) (0.0015)
With Lagged Outcome Variables - Controls are lagged outcome variables.
Q1 Q2 Q3 Q4
California -0.00084 -0.01713*** -0.00348 -0.00644*
(0.00404) (0.00345) (0.00367) (0.00388)
Kansas -0.05059* -0.05003*** -0.047*** -0.04677***
(0.02671) (0.00867) (0.00512) (0.01129)
Pennsylvania -0.02155* 0.0029 -0.01127*** 0.00221
(0.01192) (0.00417) (0.00339) (0.00498)
West Virginia -0.0258 -01.0200267 0.01144*** 0.01357
(0.02721) (0.00815) (0.00472) (0.01017)
Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1
Table 2.7: State-Wide Effect of Taxes: Disguising
This table reports the average treatment effect of taxes on disguising behavior in social
media posts. Quarters Q1, Q2, Q3, and Q4 are quarters 1, 2, 3, and 4 post treatment. Con-
trols include lagged log of population, log of per-capita income, average cost of cigarettes,
per-capita consumption of cigarettes, proportion population that is non-adult, and online
search volume for vaping related searches. Standard errors (in parenthesis) are clustered
at the state level.
Q1 Q2 Q3 Q4
California 0.00351 0.01063*** 0.01061 0.01116***
(0.00826) (0.00476) (0.00685) (0.00321)
Kansas 0.07775*** 0.05825*** 0.06205*** 0.05966***
(0.0135) (0.00919) (0.01726) (0.01702)
Pennsylvania -0.0056 -0.00658 0.00197 -0.00314
(0.01104) (0.00844) (0.0057) (0.00745)
West Virginia 0.02403 0.02367 -0.00185 0.03642
(0.05309) (0.03105) (0.0396) (0.04608)
Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1
121
Table 2.8: Robustness Results for Table 2.7: Model Specifications
These tables report the average treatment effect of taxes on disguising in social media
posts for the treated states of California, Kansas, Pennsylvania, and West Virginia. Quar-
ters Q1, Q2, Q3, and Q4 are quarters 1, 2, 3, and 4 post treatment. Standard errors (in
parenthesis) are clustered at the state level.
With YBRSS Variables - Additional controls include youth cigarette usage,
e-cigarette usage, and marijuana usage. Note that YBRSS did not collect data for
Kansas during the years of 2015 or 2017.
Q1 Q2 Q3 Q4
California 0.004 0.01477 0.00498 -0.00058
(0.00555) (0.01246) (0.01384) (0.0178)
Kansas
Pennsylvania 0.00511 -0.0003 -0.00256 0.00265
(0.0083) (0.00512) (0.0095) (0.00462)
West Virginia -0.02193 -0.00358 -0.05324*** -0.03171***
(0.01254) (0.01076) (0.01018) (0.01277)
With Lagged Outcome Variables - Controls are lagged outcome variables.
Q1 Q2 Q3 Q4
California 0.01202** 0.01525*** 0.01765*** 0.01154*
(0.00571) (0.00545) (0.00592) (0.00692)
Kansas 0.05291*** 0.05031*** 0.04655*** 0.0438***
(0.00681) (0.00606) (0.00655) (0.00658)
Pennsylvania 0.00811 0.01728 0.0068 0.01089
(0.01111) (0.01372) (0.01221) (0.00901)
122
West Virginia -0.02168 -0.00946 -0.08358*** -0.02417***
(0.01434) (0.00783) (0.02585) (0.00573)
Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1
2.9.4 Engagement Results
Figure 2.15 reports the results for effect of taxes for likes on underage vaping posts.
We find that there is an overall increase post week 75, and that Pennsylvania and
West Virginia saw an increase in number of likes starting week 60 after tax intro-
duction.
Figure 2.16 shows results for number of comments for underage vaping posts.
We do not observe any significant effects.
Figure 2.17 reports the results for effect of taxes for likes on underage vaping
posts with disguising. We do not find substantial increase post tax introduction.
Figure 2.18 reports the results for effect of taxes for comments on underage
vaping posts with disguising. We do not find substantial increase post taxes.
Figure 2.19 reports the results for effect of taxes for solo faces in vaping posts.
We do not find substantial increase post taxes.
123
Figure 2.15: Effect of Taxes on Likes: Underage
This figure reports the results of the generalized synthetic control method for gap be-
tween treated and counterfactual (and 90% confidence intervals) for the outcome variable
of number of likes for underage vaping posts. The first plot shows the overall treatment
effects, and the remaining four plots show the results for California, Kansas, Pennsylva-
nia, and West Virginia
Underage Likes: All Treated States
40
20
0
−20
0 50 100
Time relative to Treatment
124
Coefficient
Underage Likes: CA
60
30
0
−30
−50 0 50
Time relative to Treatment
Underage Likes: KS
80
40
0
−40
0 12550 100
Time relative to Treatment
Coefficient Coefficient
Underage Likes: PA
50
0
−50
0 50 100
Time relative to Treatment
Underage Likes: WV
50
0
0 12560 100
Time relative to Treatment
Coefficient Coefficient
Figure 2.16: Effect of Taxes on Comments: Underage
This figure reports the results of the generalized synthetic control method for gap be-
tween treated and counterfactual (and 90% confidence intervals) for the outcome vari-
able of number of comments for underage vaping posts. The first plot shows the overall
treatment effects, and the remaining four plots show the results for California, Kansas,
Pennsylvania, and West Virginia
Underage Comments: All Treated States
0.5
0.0
−0.5
0 50 100
Time relative to Treatment
127
Coefficient
Underage Comments: CA
1.0
0.5
0.0
−0.5
−1.0
−50 0 50
Time relative to Treatment
Underage Comments: KS
1
0
−1
−2
0 12580 100
Time relative to Treatment
Coefficient Coefficient
Underage Comments: PA
1
0
−1
−2
0 50 100
Time relative to Treatment
Underage Comments: WV
2
1
0
−1
0 12590 100
Time relative to Treatment
Coefficient Coefficient
Figure 2.17: Effect of Taxes on Likes: Underage Disguised
This figure reports the results of the generalized synthetic control method for gap between
treated and counterfactual (and 90% confidence intervals) for the outcome variable of
number of likes for underage disguised vaping posts. The first plot shows the overall
treatment effects, and the remaining four plots show the results for California, Kansas,
Pennsylvania, and West Virginia
Disguising Likes: All Treated States
100
0
−100
0 50 100
Time relative to Treatment
130
Coefficient
Disguising Likes: CA
200
0
−200
−50 0 50
Time relative to Treatment
Disguising Likes: KS
400
200
0
−200
0 13150 100
Time relative to Treatment
Coefficient Coefficient
Disguising Likes: PA
400
200
0
−200
0 50 100
Time relative to Treatment
Disguising Likes: WV
200
0
−200
0 13250 100
Time relative to Treatment
Coefficient Coefficient
Figure 2.18: Effect of Taxes on Comments: Underage Disguised
This figure reports the results of the generalized synthetic control method for gap be-
tween treated and counterfactual (and 90% confidence intervals) for the outcome variable
of number of comments for underage disguised vaping posts. The first plot shows the
overall treatment effects, and the remaining four plots show the results for California,
Kansas, Pennsylvania, and West Virginia
Underage Comments: All Treated States
5.0
2.5
0.0
−2.5
−5.0
0 50 100
Time relative to Treatment
133
Coefficient
Disguising Comments: CA
10
5
0
−5
−10
−50 0 50
Time relative to Treatment
Disguising Comments: KS
10
0
0 13540 100
Time relative to Treatment
Coefficient Coefficient
Disguising Comments: PA
10
0
−10
0 50 100
Time relative to Treatment
Disguising Comments: WV
10
5
0
−5
−10
0 13550 100
Time relative to Treatment
Coefficient Coefficient
Figure 2.19: Effect of Taxes on Solo Faces in Posts
This figure reports the results of the generalized synthetic control method for gap between
treated and counterfactual (and 90% confidence intervals) for the outcome variable of
number of posts with solo faces. The first plot shows the overall treatment effects, and
the remaining four plots show the results for California, Kansas, Pennsylvania, and West
Virginia
Solo Faces: All Treated States
0.2
0.0
−0.2
0 50 100
Time relative to Treatment
136
Coefficient
Solo Faces: CA
0.2
0.0
−0.2
−50 0 50
Time relative to Treatment
Solo Faces: KS
0.75
0.50
0.25
0.00
−0.25
0 13750 100
Time relative to Treatment
Coefficient Coefficient
Solo Faces: PA
0.0
−0.5
−1.0
0 50 100
Time relative to Treatment
Solo Faces: WV
0.6
0.3
0.0
−0.3
−0.6
0 13850 100
Time relative to Treatment
Coefficient Coefficient
2.9.5 Impact of Taxes by Race and Gender
Figure 2.20 reports the results for effect of taxes for female faces in vaping posts.
We find that Pennsylvania had a decline in female vaping related posts in the
weeks of 20-55 post tax introduction.
Figure 2.21 reports the results for effect of taxes for number of posts with race:
Asian. We find that there is an overall decline post tax introduction, and that this
is seen in the Pennsylvania and Kansas.
Figure 2.22 reports the results for effect of taxes for number of posts with race:
Black. We find that there is an overall decline post tax introduction, and that this
is seen as a decrease in Pennsylvania, and an increase Kansas.
Figure 2.23 reports the results for effect of taxes for number of posts with race:
White. We find that there is an overall decline post tax introduction, and that this
is seen in the Pennsylvania and Kansas.
139
Figure 2.20: Effect of Taxes on Gender: Female
This figure reports the results of the generalized synthetic control method for gap between
treated and counterfactual (and 90% confidence intervals) for the outcome variable of
number of female vaping related posts. The first plot shows the overall treatment effects,
and the remaining four plots show the results for California, Kansas, Pennsylvania, and
West Virginia
Female: All Treated States
100
0
−100
0 50 100
Time relative to Treatment
140
Coefficient
Female: CA
400
200
0
−50 0 50
Time relative to Treatment
Female: KS
600
400
200
0
−200
−400
0 14150 100
Time relative to Treatment
Coefficient Coefficient
Female: PA
100
0
−100
−200
0 50 100
Time relative to Treatment
Female: WV
150
100
50
0
−50
−100
0 14250 100
Time relative to Treatment
Coefficient Coefficient
Figure 2.21: Effect of Taxes on Race: Asian
This figure reports the results of the generalized synthetic control method for gap between
treated and counterfactual (and 90% confidence intervals) for the outcome variable of
number of Asian vaping related posts. The first plot shows the overall treatment effects,
and the remaining four plots show the results for California, Kansas, Pennsylvania, and
West Virginia
Race Asian: All Treated States
10
0
−10
−20
−30
0 50 100
Time relative to Treatment
143
Coefficient
Race Asian: CA
30
0
−30
−60
−50 0 50
Time relative to Treatment
Race Asian: KS
50
25
0
−25
0 14540 100
Time relative to Treatment
Coefficient Coefficient
Race Asian: PA
20
0
−20
−40
0 50 100
Time relative to Treatment
Race Asian: WV
40
20
0
−20
0 14550 100
Time relative to Treatment
Coefficient Coefficient
Figure 2.22: Effect of Taxes on Race: Black
This figure reports the results of the generalized synthetic control method for gap between
treated and counterfactual (and 90% confidence intervals) for the outcome variable of
number of Black vaping related posts. The first plot shows the overall treatment effects,
and the remaining four plots show the results for California, Kansas, Pennsylvania, and
West Virginia
Race Black: All Treated States
20
10
0
−10
−20
−30
0 50 100
Time relative to Treatment
146
Coefficient
Race Black: CA
60
30
0
−30
−50 0 50
Time relative to Treatment
Race Black: KS
40
20
0
−20
−40
0 14570 100
Time relative to Treatment
Coefficient Coefficient
Race Black: PA
50
25
0
−25
−50
0 50 100
Time relative to Treatment
Race Black: WV
40
20
0
−20
0 14580 100
Time relative to Treatment
Coefficient Coefficient
Figure 2.23: Effect of Taxes on Race: White
This figure reports the results of the generalized synthetic control method for gap between
treated and counterfactual (and 90% confidence intervals) for the outcome variable of
number of White vaping related posts. The first plot shows the overall treatment effects,
and the remaining four plots show the results for California, Kansas, Pennsylvania, and
West Virginia
Race White: All Treated States
50
0
−50
−100
0 50 100
Time relative to Treatment
149
Coefficient
Race White: CA
100
0
−100
−50 0 50
Time relative to Treatment
Race White: KS
200
100
0
−100
0 15050 100
Time relative to Treatment
Coefficient Coefficient
Race White: PA
100
0
−100
0 50 100
Time relative to Treatment
Race White: WV
100
50
0
−50
−100
0 15150 100
Time relative to Treatment
Coefficient Coefficient
2.9.6 Other Common Objects in Posts
We use a Mask R-CNN model with architecture similar to that described in Section
2.5, that has been pre-trained on the ImageNet dataset. This allows us to detect 80
most common classes from the ImageNet dataset.
We then take this pre-trained model to our social media dataset. The 5 most
common objects (from ImageNet objects) in our data are: bottle, dining table, cup,
cellphone, and book. We apply the pre-trained model on our data to estimate
the following three tables. Table 2.9 reports the proportion of posts with disguis-
ing that contain these common objects and compare with those in posts without
disguising. Table 2.10 reports the proportion of posts with underage faces that
contain these common objects and compare with those in posts without underage
faces. Finally, Table 2.11 reports the comparison of proportion of posts with dis-
guising that contain these common objects to proportion of posts with underage
faces that contain these common objects.
152
Table 2.9: Disguising and Common Objects
This table reports the difference in proportion of posts with common objects in posts that
have disguising and in the posts that do not have disguising.
N = 22049 N = 22045
Object In % Posts with Disguising In % Posts without Disguising Difference
bottle 31.444% 29.996% 1.447%***
dining table 10.041% 8.269% 1.772%***
cup 9.660% 7.321% 2.339%***
cell phone 8.105% 9.634% -1.53%***
book 7.461% 7.013% 0.448%*
Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1
Table 2.10: Underage and Common Objects
This table reports the difference in proportion of posts with common objects in posts that
have underage faces and in the posts that do not have underage faces.
N = 9,910 N = 9,910
Object In % Posts with Underage Faces In % Posts with Non-Underage Faces Difference
bottle 41.463% 34.299% 7.164%***
dining table 10.121% 9.213% 0.908%**
cup 8.234% 9.203% -0.969%**
cell phone 10.464% 10.444% 0.02%
book 10.989% 7.598% 3.391%***
Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1
153
Table 2.11: Underage, Disguising, and Common Objects
This table reports the difference in proportion of posts with common objects in posts that
have underage faces and in the posts that have disguising.
N = 9,910 N = 22049
Object In % Posts with Underage Faces In % Posts with Disguising Difference (Underage - Disguising)
bottle 41.463% 31.444% 10.019%***
dining table 10.121% 10.041% 0.08%
cup 8.234% 9.660% -1.426%***
cell phone 10.464% 8.105% 2.36%***
book 10.989% 7.461% 3.528%***
Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1
154
CHAPTER 3
USING DEEP LEARNING TO OVERCOME PRIVACY AND SCALABILITY
ISSUES IN CUSTOMER DATA TRANSFER
3.1 Introduction
Firms’ sensitive customer data are highly sought after by researchers who use
statistical and econometric models for causal and predictive analyses. The chal-
lenges to obtaining this data entail both privacy and scalability issues. Marketers,
for example, who need to build pricing and targeting models for consumer pack-
aged goods, require access to sales data at either the household or store level, as
well as the corresponding prices of given brands. While prices are publicly ob-
servable in stores and through promotion and advertisements, customer privacy
concerns, legal restrictions, or firms’ concerns regarding disclosure of valuable
information to competitors, are impediments to external sharing of sales data.
Therefore, traditional methods of external data release, e.g., through a third-party
vendor such as The AC Nielsen Company, require a high transaction cost due
This chapter is: Lee, Clarence, and Piyush Anand (2020). “Using Deep Learning to Overcome
Privacy and Scalability Issues in Customer Data Transfer”. Appendix 3.8.7 has been shortened for
brevity and due to document length limitations.
Researcher(s) own analyses calculated (or derived) based in part on data from Nielsen Con-
sumer LLC and marketing databases provided through the NielsenIQ Datasets at the Kilts Center
for Marketing Data Center at The University of Chicago Booth School of Business. The conclu-
sions drawn from the NielsenIQ data are those of the researcher(s) and do not reflect the views of
NielsenIQ. NielsenIQ is not responsible for, had no role in, and was not involved in analyzing and
preparing the results reported herein.
155
to prohibitive nondisclosure agreements (NDA) and restricted data usage agree-
ments (DUA).
Central to the NDAs and DUAs is the original data provider’s need to control
the privacy and accuracy of the data released. The current paradigm widely used
to facilitate this exchange process is transfer of samples of customer data, which
are anonymized by transferring small samples that are either obfuscated or aggre-
gated. On the one hand, the larger the amount of data firms release to researchers,
the more accurate are the price elasticity and targeting estimates. Firms therefore
have incentives to release more data that are as unobfuscated as possible. How-
ever, firms incur fewer privacy risks the smaller the data sample release and the
higher the degree of obfuscation. This trade-off between accuracy and privacy
in data disclosures has been extensively discussed in prior literature: Real-world
situations drive the data provider to exert control along this trade-off Duncan and
Stokes (2004).
Exacerbating the transaction cost of this process is the actual transfer of the
data itself. In the age of big data and digital commerce, the four V’s of big data
gain significance: volume, velocity, variety, and veracity (Chintagunta et al., 2016;
Ansari and Li, 2018). In this paper, we focus on the volume and velocity aspects of
big data, as they can present nontrivial obstacles to the data transfer itself. While
researchers seeking to maximize the accuracy and generalizability of the data have
the incentive to acquire as much data from providers as possible, transferring and
housing large amounts of customer data can require nontrivial technical know-
156
how and significant data storage costs. Furthermore, the velocity of data that
refreshes into a data provider’s databases, often a matter of seconds, can vastly
outpace the speed of a single data exchange. Therefore, the need arises for an
approach to customer data transfer that can potentially address these issues.
Recent developments in deep learning offer the possibility of training a gener-
ative model that can mimic data generating distributions with an unprecedented
degree of accuracy (Goodfellow et al., 2013). Generative adversarial networks
(GANs henceforth) provide a flexible framework that can train two neural net-
works – a discriminator model (discriminator henceforth) and a generator model
(generator henceforth) – simultaneously by pitting them against each other. GANs
involve training a generative model that generates synthetic data and simulta-
neously training a discriminator model able to distinguish between the real and
generated synthetic data, resulting in the generator mimicking the firm-side data
generating distribution with a high degree of accuracy. This obviates the need to
share private and sensitive data with the generator, and also allows for updating
the generator as additional real-time data arrive. In contrast to estimation tech-
niques that first estimate a model on the firm’s side then subsequently transfer
some form of “data” such as actual data or synthetic data to the researcher, the
decoupled nature of this training algorithm has both privacy and scalability ad-
vantages.
We propose an approach for preserving customer privacy that involves trans-
fer of a generator (from GANs) as opposed to the aforementioned traditional ap-
157
proaches. This also provides improved privacy protection: no private and sensi-
tive customer data leaves the servers of the firm as only the discriminator, that is
housed inside the firm’s firewalls, has access to the private data. Furthermore, we
find that our proposed method has scalability advantages. The volume and velocity
aspects of big data, require any analysis to be sufficiently flexible to handle large
volumes of newly arriving data; accordingly, this method’s data exchange cost,
measured in computational and logistical time, does not grow proportionate to
data size. Further, marketers will be interested in our proposed approach to tackle
marketing problems. We show two things along these lines. First, as a proof of
concept, we show that two marketing problems: price markups for optimal profits
and customer targeting, can be effectively tackled using our proposed approach.
Second, we also show that a firm need not train multiple GANs to tackle different
problems. That is, a single GAN trained on the firm data can be used to solve
two marketing problems of price markups for optimal profits and also customer
targeting. Thus, in this paper, we build on the privacy literature in marketing and
additionally analyze data scalability and ability to tackle marketing problems. We
therefore explore the following research questions:
• Accuracy: How well do GANs mimic the data generating process (DGP)?
• Privacy: How well do GANs preserve privacy in the event that the trans-
ferred generator is compromised?
• Scalability: How do GANs accommodate the volume and velocity aspects
of big data?
158
• Applicability: How well do GANs perform on marketing problems of price
markups for optimal profits and customer targeting? Can one GAN perform
both tasks?
We find that GANs perform exceptionally well against benchmark methods
in terms of accuracy of replicating the original data, as evaluated via the stan-
dard accuracy-privacy framework from prior literature. GANs outperform bench-
mark methods in terms of mimicking the true data, both in density plots and
as measured using Kolmogorov-Smirnov test, Jensen-Shannon divergence, and
Kullback-Leibler divergence.1 Furthermore, by modifying the training algorithm
to incorporate customer heterogeneity, the firm can control the accuracy-privacy
trade-off. In both cases, we find that GANs have lower information loss and lower
loss in privacy as compared to benchmarks. We also validate our findings on the
Nielsen household-level data and find that our accuracy-privacy results hold.
Next, we note that GANs, leveraging the “on-line” nature of stochastic gradi-
ent descent (SGD), are designed to handle both volume and velocity. In terms of
data volume, we find that the SGD framework scales well with respect to the size
of the data set, due to its distributed nature allowing for out-of-the-box parallel
CPU or GPU computing. The order of magnitude of the per iteration computa-
tion time does not grow according to the data size, but grows instead according
to factors under the researcher’s control, such as mini-batch size, number of train-
1We discuss in the Appendix 3.8.6 that increasing the GAN complexity (number of neurons)
reduces information loss. However, the improvements in information loss have a point of dimin-
ishing returns after an optimum value of number of neurons.
159
ing iterations, and GAN complexity. We find that training time per iteration only
increases marginally as we increase data volume from one thousand rows of data
to ten million rows of data, and stays the same order of magnitude. Furthermore,
GANs tackle the velocity aspects as the decoupled estimation nature of GANs
requires that only the gradients of the objective function, as opposed to a costly
transfer of an entire data set. We find that transferring a generator instead of data
is cheaper, due to lower file size, especially when the data volume grows large.2
This allows for a lightweight automated exchange method between the two par-
ties, such as the use of an application program interface (API), to “stream” the
latest gradients to the generator in the exchange process. We find that the infor-
mation loss converges faster when we stream the gradients, as opposed to redo-
ing the entire training with the new data. This light-weight, automated exchange
method also has logistical benefits. The traditional data transfer approach from
the synthetic data literature requires the involvement of trained data scientists for
each synthetic data set generated subsequent to the inflow of a substantial amount
of new data. This process can be both error prone and costly to firms as well as
researchers. The automated exchange process potentially alleviates this problem.3
Finally, we find that GANs perform well on the two tasks of optimal price
markups and customer targeting, as compared to benchmarks. The optimal prof-
its obtained from GANs are a minimum of 99.96% of the true profits, as com-
2Note that the data provider can also train a generator on its own end and transfer the trained
generator to the researcher. Our proposed approach is indifferent to either approaches that the
data provider chooses.
3We note that quantifying the nature of API calls (volume, network bandwidth, server require-
ments, among others) is not the primary focus of our paper, and argue that GANs can be trained
using API calls with “sufficient” network bandwidth
160
pared to the closest benchmark: Top Coding with optimal profits at a minimum
of 69.91%. Customer targeting model accuracy from GANs is an order of magni-
tude higher than the nearest benchmark, Swap 20. We also find that GANs can
handle both these tasks simultaneously, i.e., a single GAN can tackle both these
problems and outperforms benchmarks with a loss of accuracy of 1.39% as com-
pared to the closest benchmark: Rounding with an loss of 2.07%. Throughout
these three contexts, GANs also outperform benchmarks on the accuracy-privacy
trade-off. These results extend from GANs ability to mimic the data generation
process closely while providing higher privacy protection.
To the best of our knowledge, our investigation is among the small but grow-
ing literature in marketing to utilize GANs. In contrast to studies in the computer
science literature, we explore how GANs can be used to address the trade-off be-
tween accuracy and privacy. Our most important contribution is to demonstrate a
proof of concept in which a simple GAN can generate privacy-preserving clones
of the real data that is capable of tackling multiple marketing problems.
3.2 Existing Literature
Existing work in the privacy literature in marketing and economics focuses on
protection of data under the paradigm of transfer of true data between parties.
Security is afforded by masking true data via a predetermined mechanism and
accepting the trade off between privacy and usefulness, as for targetability (Gold-
161
farb and Tucker, 2011). Past work in marketing and statistics literature on syn-
thetic data protection has discussed, for example, such data masking mechanisms
as (i) aggregation (Steenburgh et al., 2003; Christen et al., 1997; Tenn, 2006; Link,
1995), (ii) swapping (Reiter, 2010), (iii) truncation/rounding (Schneider et al.,
2018), (iv) and random noise addition (Reiter, 2005). These varied benchmark
methods and associated performance metrics are used by Schneider et al. (2018) to
evaluate their proposed data protection schemes for point-of-sale data. Following
the tradition of synthetic data transfer (Abowd et al., 2012; Hu et al., 2014; Schnei-
der and Abowd, 2015), in which the provider generates synthetic data for transfer
to the user, Schneider et al. (2018) proposes a Bayesian generalized linear model
(GLM) for generating protected synthetic data ex-post data creation. Our work
differs from extant synthetic data protection literature in that GANs can generate
synthetic data for purposes of predictive modeling and inference via the “light
weight” transfer of a generator instead of synthetic data in the transfer process.
Contributions of this paper entail the examination of the desirable properties of
this paradigm shift, which are data volume scalability, transfer-file compression,
and data streaming capabilities.
A growing stream of literature on utilizing machine learning in marketing has
developed in response to the call for integrating methods from computer science
and statistics to address the “V’s” of big data: volume, velocity, variety and verac-
ity (Chintagunta et al., 2016; Ansari and Li, 2018). For example, Liu et al. (2016)
leverage a combination of cloud computing, text mining, and machine learning to
handle massive volumes of online social platform data to forecast sales, and Tim-
162
oshenko and Hauser (2019) use a convolutional neural network to identify cus-
tomer needs from user-generated content. The latter neural network, estimated
using stochastic gradient descent (SGD), scales well on volume of data as well
as computing requirements. Without being restricted to large computer cloud
clusters, model training can, with the proper settings, be performed on a lap-
top. Rafieian and Yoganarasimhan (2018) use Extreme Gradient Boosting method
that enables scalability in the prediction of click-through rates for mobile adver-
tisements. Puranam et al. (2017) use a scalable Bayesian topic model to estimate
the impact of New York City calorie posting regulation on discussions of health-
related topic in restaurant reviews. A fully automated system designed by Culotta
and Cutler (2016) to estimate brand ratings from near real-time keyword Twitter
data addresses the velocity of big data. We build upon this stream of literature by
demonstrating that, when GANs are implemented on the backbone of SGD-type
training, the latter’s scalability properties carry over to considerations of volume
and velocity associated with implementing algorithms for privacy protection.
Lastly, this paper builds on the small but growing literature in marketing that
employs GANs. Burnap et al. (2019) use an ensemble of deep learning methods to
predict aesthetic appeal of automotive designs as a means of augmenting aesthetic
design process. They use GANs in order to generate product aesthetic propos-
als. Malik and Singh (2019) discuss different deep learning methods in computer
vision and note that GANs have enabled realistic image generation. Our work
differs from that reported in this literature in that we demonstrate that GANs can
provide scalable and privacy preserving approach that can be used to solve mul-
163
tiple marketing problems.
3.3 Methodology: Extant Approach and Benchmarks
We first compare the difference between the extant data transfer paradigm and
our proposed data transfer paradigm. We then evaluate our methodology using
benchmark methods.
3.3.1 Extant versus Proposed Data Transfer Paradigm
In this section, we first examine the extant data transfer paradigm and its asso-
ciated obstacles. This is illustrated in Figure 3.1. We then demonstrate how our
proposed data transfer approach may alleviate these obstacles.
Current approaches involving data transfer from a firm to researchers often
require the researchers to sign legally binding contracts such as non-disclosure
agreements (NDA) and data usage agreements (DUA) in order to access the data.
Once the researchers sign these contracts, the firm then sets up mechanisms to
transfer the data to the researchers. There are three broad decisions that the data
provider makes. First, whether to provide the full data for all of its customers or
for a sub-sampled set of its customers. The second decision that the data provider
makes is whether to provide data from the true data, i.e. its actual data, or to
164
provide “synthetic” data, such as data generated using synthetic data generation
method (see, for example Schneider et al. (2018)). The third decision is the level
of obfuscation or aggregation done to the data in order to protect privacy. These
include doing top coding, i.e. truncating at a certain percentile, rounding, i.e.
rounding the data to a certain digit, swapping, i.e. randomly swapping sales data
in a certain set of observations. The data provider can also chose to aggregate data
at a certain level, for example at product lines level or markets level. Inherent to
the third decision is the firm’s attempt to trade-off accuracy of data shared with
researchers and its need to protect privacy of the data shared. These data are
then transferred to the researchers, who apply research methods such as reduced
form analysis, structural econometrics, or machine learning methods, for results
comprising a combination of inference, prediction, and counterfactuals.
There are three major concerns with this approach. First, data privacy is a con-
cern: Once the data leave the confines of the firm, the firm has very little control
over the data protection. The data are vulnerable to hacking, which would cre-
ate a significant privacy breach for the firm. Second, there is the generalizability
issue since the transferred data are often much smaller compared to the firm’s en-
tire customer base. Third, the data transfer process is slow and time consuming,
and this increases the firm’s transaction costs each time the research methods are
trained on new data.
Our approach eliminates the need for any real customer data or synthetic data
to leave the firm. Instead, we propose transferring a generator to the researcher.
165
The generator is trained in an adversarial framework, and the discriminator sits
inside the firm’s walls.4 Note that the generator, which sits with the researchers,
never accesses the private data. Only the discriminator can access the private
data, and the generator is trained using the gradients of the discriminator’s loss
function. Thus the generator can generate data up to the size of the full population
of a given firm’s customers, and can be retrained using a semi-automated interface
(API) such that little or no manual intervention is needed.
Of the existing approaches, an important one is that of Schneider et al. (2018).
We note that while the Schneider et al. (2018) approach has been demonstrated
on stores point of sales data, it can conceptually be extended to consumer level
data. However, a key difference from our approach is that Schneider et al. (2018)
requires prior knowledge of data generating process that is embedded in the syn-
thetic data generation process itself. We argue that our approach is data gener-
ating process agnostic (the GAN model is not explicitly trained using a specific
inference model), and its only objective is to “mimic” the data-generation process
of the true data.5
Thus, our approach allows us to tackle the three primary concerns of tradi-
tional data transfer approaches. First, our approach offers higher privacy pro-
tection because no customer data leaves the control of the firm. We empirically
4Note that the term “adversarial” comes from the name of the deep learning model: Genera-
tive Adversarial Networks. The “adversaries” in this context are the generator and discriminator
which “compete” with each other, i.e. the generator creates data in an attempt to fool the discrim-
inator into classifying it as real data, and the discriminator has to classify the true data as different
from the fake data.
5See section 3.3.3 for discussion of how we measure effectiveness in approximating the data-
generating process of the true data
166
demonstrate that, should the generator on the researcher’s side be hacked, our ap-
proach’s privacy protection remains superior to that of the benchmark methods.
Second, the generalizability concern is potentially alleviated because the gener-
ator can generate data up to the size of the firm’s customer population. Third,
with new streaming customer data, the use of a semi-automated application pro-
gramming interface (API) significantly reduces the transaction costs for the firm,
as well as reducing the time needed to update the generator controlled by the
researchers.
Thus, our proposed approach has privacy advantages, reduced transactional
costs for both firms and researchers, and scalability advantages when compared
to traditional data transfer approaches. An important point to note is that the
firm can choose when and how the researcher gets the generator. There are two
possible approaches of training the generator in our paradigm, both of which will
lead to the same results. In the first approach, the firm trains both the generator
and the discriminator on its end, and hands over the generator to the researcher
once the generator is trained. In this situation, the researcher starts with a pre-
trained generator, and can update it as and when new data arrives with the firm.
In the second approach, the researcher starts with an uninitialized generator at its
end, and train the generator from scratch by making API calls to the discriminator
residing inside the firm’s walls. In this situation, the researcher makes API calls
for each of the training iteration as it updates the generator parameters. Note
that the generator obtained after training in either of the approaches would be the
same, and the firm can chose whether it wishes to pass on a trained generator to
167
the researcher, or ask the researcher to train a generator from scratch.
3.3.2 Benchmark Methodology
In this section we describe our methodology for evaluating our proposed GANs
and benchmarks. We do so along the following seven dimensions:
1. Data characteristics: To what extend do the probability distribution statistics
(e.g. probability density function, KL divergence) and other distributional
characteristics differ?
2. Information loss: To what extent do results differ from model-based analy-
ses, such as price elasticity coefficient estimates from regressions?
3. Privacy: How well does the proposed model protect customer privacy com-
pared to benchmark methods.
4. Volume: How well does the proposed model’s training speed and informa-
tion transfer size scale with the volume of data.
5. Velocity: How does continual estimation compare to restart estimation of
the model with the arrival of new customer data?
6. Generalizability to Marketing Problems:
• Optimal Price Markups: How high are the optimal profits as compared
to those obtained from true data?
168
Figure 3.1: Data Transfer Paradigm Comparison
Figure 3.2: Benchmark.
Figure 3.3: Proposed using GAN
169
• Customer Targeting: How accurate are the targeting models as com-
pared to those trained on the true data?
• Tackling Multiple Marketing Problems With One GAN: Can a single
GAN trained on the full firm data be used generate synthetic data that
can solve multiple marketing problems?
We utilize methods commonly used in existing literature for data protection as
benchmarks against which to compare our proposed approach (Table 3.1), ranging
from aggregation (i.e., at market-level) to obfuscation (e.g., adding random noise).
Schneider et al. (2018) find data protection schemes to generally entail a trade-off
between accuracy and privacy; the goal of the seven benchmark methods, which
include using true data, is to track juxtaposition of the respective metrics along
these two dimensions. We modify these benchmark methods, which Schneider
et al. (2018) apply to store-level point-of-sales data, to utilize household level sales
and pricing data, while preserving the panel structure of the data set. Similar
to Schneider et al. (2018), we protect only the sales variables of the individual
households, with brand prices being public and observable in stores.
170
Table 3.1: Description of Benchmark Methods
Benchmark Method Description
1 “True” or unprotected data Original household-level sales data without any
protection.
2 Random noise Observations are binned into deciles based on sales,
and random noise is added to the sales in each decile.
3 Rounding Sales are rounded to the nearest hundredth place.
4 Top coding Sales greater than the 95th percentile are truncated.
5 20% swapping 20% of observations are divided into two groups and
their sales data exchanged.
6 50% swapping 50% of observations are divided into two groups and
their sales data exchanged.
7 Market Level For each week, sales are summed and prices
averaged across households to the market level.
3.3.3 Performance Metrics
Comparison of Data Characteristics
We utilize three measures commonly employed in the statistics and marketing
literature: KL divergence; Jensen-Shannon divergence; and the Kolmogorov-
171
Smirnov statistic. We do so in order to measure the distance between the real
data and the synthetic data generated from GANs and benchmarks.
The KL and Jensen-Shannon divergences provide, respectively, asymmetric
and symmetric distance measures of the distribution of the true data relative
to the synthetic data generated by a protection method. We also calculate the
Kolmogorov-Smirnov (KS) statistic as a quantitative estimate of the maximum
difference in two cumulative distribution functions. The KS statistic has an addi-
tional advantage that it exists regardless of the support of the two distributions
(Toubia and Netzer, 2016).
The Kullback-Leibler divergence (Kullback and Leibler, 1951) is a measure
of relative-entropy between two probability distributions, P and Q. For discrete
probability distributions, we have:
∑
DKL(P||
P(i)
Q) = P(i) log (3.1)
Q(i)
i
The KL divergence for distributions P and Q measures how much extra infor-
mation is needed to arrive at Q as the posterior, when P is the prior, distribution.
The closer the KL divergence to 0, the more “similar” the distributions P and
Q.6 In order to see its ties to maximum log-likelihood estimation, we can write
DKL(P||Q) = LL(P, P) − LL(P,Q), where LL(P,Q) = EP[logQ] is the log-likelihood
6Note that the KL divergence is not symmetric, as the amount of information needed to go
from distribution P to Q need not be the same as the amount of information needed to go from
distribution Q to P, whereas the Jensen-Shannon divergence is a symmetric measure.
172
of observing the data from P given the parameters of the distribution Q Eguchi
and Copas (2006). Thus, minimizing the KL divergence: DKL(P||Q) is equivalent
to obtaining the maximum likelihood estimates for the distribution Q.
The Jensen-Shannon divergence (Lin, 1991), a symmetric measure of the infor-
mation difference between two distributions, can be formulated in terms of the
KL divergence. In the information sciences literature, it has been used to measure
distances between distributions, and also provide the upper and lower bounds for
the Bayes probability of error7. The JSD for discrete distributions P and Q, with
average distribution A = 0.5(P+Q), is given by:
JS D(P|| 1 || 1Q) = DKL(P A) + DKL(Q||A) (3.2)2 2
Finally, we use the Kolmogorov-Smirnov test as a quantitative estimate of the
maximum difference in cumulative distribution functions and corresponding sig-
nificance levels. The KS-test for two samples, P and Q, is given by:
KS (P,Q) = max |Cp,i −Cq,i| (3.3)
i
where Cp is the cumulative distribution function associated with distribution
P.
7See discussion in Lin (1991) on the derivation of the upper and lower bounds for the Bayes
probability of error using the Jensen-Shannon divergence
173
Information Loss
To calculate information loss, we first define a commonly used inference frame-
work to estimate coefficients (β) from the true data. We then estimate the same
coefficients using our proposed approach and benchmarks, denoted β̂.
We estimate the following multiple regression framework with continuous in-
dependent variables of prices P and dependent variables of sales S , we propose
the following log-log regression in a standard panel data setting with entity i,
brand j, and time period t:
∑
lnS i jt = µ j + µi j + β jlnPi jt + βklnPikt + i jt, (3.4)
k, j
Where k is the number of brands of interest, µ j is the brand specific intercept
term, µi j the household-level random effects term drawn from a normal distribu-
tion N(0, σµ), and  is the unobserved, independent error term. This log-log re-
gression framework has been used widely in marketing and economics (Leeflang
and Wittink, 2000), modeling continuous dependent variables such as store sales,
worker wages, and customer demand.
With the above inference model, we measure mean absolute percentage dif-
ference (MAPD, Christen et al. (1997)) as a measure of information loss. MAPD
provides an estimate of how good are the coefficient estimates from our proposed
approach and benchmarks as compared to those obtained from the true data, as it
174
quantifies the difference between the regression estimates. More formally, MAPD
for J number of coefficients of interest is given by:
1 ∑J
MAPD =
J ∣∣∣∣ β̂ j − β j ∣∣∣∣ × 100%, (3.5)β
j=1 j
where β̂ j is the estimated coefficient of interest on protected, and β j the esti-
mated coefficient on real data and J refers to the number of relevant coefficients to
be analyzed using a statistical modeling technique (e.g., regression).8 The afore-
mentioned metric is not bound to the specific inference model defined above and
can be applied more generally to estimates from other reduced-form or structural
models.
Loss of Privacy
In the manner of Schneider et al. (2018), we employ maximum loss of privacy
(MLP) as the metric for data protection. In order to compute MLP for the data,
we first define the loss in privacy (LP) metric. Schneider et al. (2018) define the
LP metric as the “intruder’s” confidence in the data to identify an entity. Thus,
we employ the LP measure for a customer i (from n customers and across T time
periods) as follows:
8We use the brands’ own price elasticities as the coefficients of interest in the subsequent sec-
tions when MAPD is reported. We discuss the inference model and MAPD in detail in the Ap-
pendix 3.8.1.
175
√ ∑n
− 1
∑T
LPi = 1 + n [ P(Ŷit = IDi′ |S it, P 2it)] . (3.6)
′ Ti =1 t=1
Where P(Ŷit = IDi′ |S it, Pit) is the probability of identifying an observation Yit as
belonging to a customer IDi′ given the observed sales S it and prices P∑it, normal-
ized by the probability for customer 1: P(Ŷit = ID1|S , P 1 Tit it)). Thus, T t=1 P(Ŷit =
IDi′ |S it, Pit) is the mean probability (mean across all time periods) of identifying a
customer i in the data. We compute P(Ŷit = IDi′ |S it, Pit) as follows:
(P(Ŷ ) J Jit = IDi′ |S ∑ ∑ln it, Pit) = a
| i
′ jlnS i jt + bi′ jlnPi jt,
P(Ŷit = ID1 S it, Pit) j=1 j=1
i = 1, ..., n; i′ = 2, ..., n; t = 1, ...,T (3.7)
Note that Eqs. 6 and 7 for loss in privacy can be extended to include further un-
protected variables such as marketing variables, customer visit counts, and other
similar variables depending on the data context. Thus, with this metric, we then
define the MLP metric. MLP can measure the maximum loss of privacy across
all customers in the data set - it serves as the measure of the privacy for the least
privacy protected customer in the data.9
9To account for out-of-sample fit, we calculate the above metrics using a leave-one-out cross-
validation procedure, as specified by Schneider et al. (2018). Furthermore, we use the MLP of the
true data as the upper-bound on the MLPs for all other methods.
176
MLP = max{LP1, ..., LPn}. (3.8)
Trade-off between Information Loss and Privacy Protection
The risk-utility curve introduced by Duncan and Stokes (2004), describes the fun-
damental trade-off between the risk of confidential data disclosure and the util-
ity of a data set for analysis. From this stream of literature, we know that firms
and regulators collect data with the underlying promise that the data will be kept
confidential. In order to honor this confidentiality pledge, firms need to share
data such that the risk of disclosure is minimized. De-identification, i.e. removing
identifiers such as names, addresses, phone numbers, etc from the data is not suffi-
cient to reduce disclosure risks to acceptable levels, as “data snoopers”, i.e. entities
with authorized access to the data but goals of uncovering individuals in the data,
can link the data to other dataset which have names and identifiers associated
with them, and that with such “linkage”, data can be re-identified. They argue
that masking strategies, such as data coarsening, top-coding, aggregating, etc al-
low for reduced disclosure risk as the data becomes less identifiable, however the
data utility, i.e. quality of inference from this masked data, also reduces, since the
perturbations, or noise, added to the data impact the inference that can be drawn
from the data. This inherent trade-off between disclosure risk and data utility is
the essence of the stream of literature that looks at accuracy-privacy trade-off.
Using this concept to quantify the trade-offs between the two measures of ac-
177
curacy and loss of privacy, we compare the performance of our generator against
those of the benchmark methods. Similar to Schneider et al. (2018), we plot var-
ious methods’ information loss (utility of data) against the loss of privacy (risk
of disclosure). We further explore how incorporating heterogeneity informs the
privacy trade-off.
Data Volume Scalability: Training Speed
In this section, we examine scalability in terms of training time when protection
is provided in terms of numbers of rows of data (N). One challenge in comparing
speed of training using SGD is that the training algorithm can accommodate an
arbitrary number of iterations. We therefore run the training algorithm well past
the number of iterations at which the loss function becomes stationary from visual
inspection. We then measure total run time and run time per iteration to examine
how run time scales to volume of data.
Data Volume Scalability: Information Size
An additional benefit of using a GAN is that the size of information passed be-
tween parties in big data settings is significantly less when only a generative
model is being transferred, and not actual data. By incorporating the data gen-
erating process, the generator effectively serves as a data compression algorithm.
Size measured by information transfer is a function of GAN complexity measured
178
by number of neurons, as opposed to the size of the data set.
Data Velocity Scalability
We examine here how the “on-line” nature of SGD can be exploited to train the
GAN as new data stream into the provider. First we train a GAN to convergence,
subsequently referred to as the baseline model. Then, we explore how the SGD
responds to a single burst of new data by simulating a small new training dataset
from the data generating distribution. We then run two versions of the proposed
model. In the first, the new data are “streamed” into the baseline model, and in
the second, the training is “restarted” by retraining on the combination of new
and old training data. We then compare the point at which both training methods
regain the same level of information loss in the presence of new data.
Generalizability to Marketing Problems: Price Markups and Optimal Profits
Marketing managers are interested in estimating price markups for their products
in order to obtain optimal profits based on their customers behavior. We now
discuss how we evaluate price markups and optimal profits from our proposed
approach and compare with those obtained from benchmark methods, following
the approach given in Schneider et al. (2018).
We use the Monte Carlo data and compute the price elasticities using Eq. 3.4.
Thus, we first estimate the price markups for the original data and data from
179
benchmark methods as:
C
P ii = (3.9)1 + βi
Where Pi is the price markup for a brand i, and Ci is the cost for the brand i. As
the next step, we compute the optimal profit ratio using the following equation:
∏∏∗ β ∗i i + 1 β( i + 1 βi= ∗ )β (3.10)
i βi + 1 β
∗
i + 1 βi
∏
Where ∗i is the profit obtained for brand i using the price obtained from E∏q.
3.9 with the price elasticities obtained using the benchmark methods, i.e. β∗i . i
is the profit obtained for brand i using the price elasticity obtained from the true
data, i.e. βi. We use the ratio of the optimal profits obtained from benchmark
methods and that obtain∏ed from true data for each of the brands, and report the∗
optimal profit ratio, i.e. ∏i . This metric helps us estimate the relative loss in opti-
i
mal profit from using the benchmark methods, as opposed to the profits obtained
from the true data.
Generalizability to Marketing Problems: Customer Targeting
We now discuss an application of GANs to customer targeting models. We borrow
the purchase model from Park and Park (2016) and briefly discuss the setup. Park
180
and Park (2016) use click-stream data of an online retailer to predict purchase
based on online visits and marketing efforts, by modeling the visit behavior and
purchase behavior in their proposed model. As a proof of concept, we estimate
their purchase behavior model.10
Similar to our Monte Carlo data, we construct purchase behavior for 200 cus-
tomers and 52 weeks using the purchase probability model from Park and Park
(2016). More formally:
ui j = bi j + βv,i jVi j + βm,i, jMi, j + βppδpp,i j + i j (3.11)
Pi j = 1 i f ui j > 0 (3.12)
Where Pi j is whether a customer i makes a purchase in week j, which we ob-
serve when the utility ui j is greater than zero. In the customer’s utility function,
Vi j is the log of the visits made by the customers to store so far, Mi, j is a dummy for
whether a customer was marketed in the week j or not, and δpp,i j is the dummy
for whether the customer made a purchase in the week preceding week j. i j is the
random error term drawn from type I extreme value distribution. We borrow the
random coefficients from Park and Park (2016) and discuss them in further detail
in the Appendix 3.8.3. We consider the data constructed in this manner as the true
10We model the visit probabilities of customers as draws from a random uniform distribution,
and give further details in the Appendix 3.8.3
181
data, with purchase variable (whether a customer purchased in a current week)
as the private, protected data, and the other variables as the public data.
Thus, with the true data and the benchmark methods, we estimate the fol-
lowing targeting model and calculate the MAPD for true data coefficients and
benchmark data coefficients as a measure of targeting accuracy:
ebi j+βv,i jVi j+βm,i, j Mi, j+βppδpp,i j
P(Yi j = 1) = (3.13)1 + ebi j+βv,i jVi j+βm,i, j Mi, j+βppδpp,i j
Generalizability to Marketing Problems: Tackling Multiple Marketing Prob-
lems With One GAN
We now discuss a context to evaluate whether a single GANs can handle multiple
marketing problems. As a proof concept, we construct a Monte Carlo data for
customer purchases when the firms set prices, and chose combinations of other
marketing instruments of product feature and product display.
In this setting, a customer in a given week observes publicly available prices
and marketing variables for the five brands and subsequently makes purchases
across the five brands. Consistent with our procedure before, we follow the log-
log model as the data generating process. The data generating process specifica-
tion is along the lines of Schneider et al. (2018), as they model purchase behavior
of consumers based on observed prices and marketing mix variables along the
lines of the market response model of SCAN*PRO. More formally:
182
lnS i jt = µ j + µi j + β jlnPi jt + ln(δ f j)Fi jt + ln(δd j)Di jt + ln(δ f d j)FDi jt + i jt, (3.14)
Where S i jt is the sales made by a customer i for a brand j in a week t, µ j is the
brand specific random effect, µi j is the customer-brand random effect, and Pi jt is
the price observed by the customer i for brand j and time t. Fi jt, Di jt, and FDi jt
are dummy variables for whether the brand j was featured, displayed, and both
featured and displayed to the customer i during time t. The price distribution
and coefficients are the same as those described in the Appendix 3.8.2 and the
Appendix 3.8.4. We consider the data constructed in this manner as the true data,
with sales variable (how much a customer purchased in a current week) as the
private, protected data, and the other variables as the public data.
Through this exercise, we aim to measure the effectiveness of GANs, as com-
pared to benchmarks, in capturing both price elasticities as well as marketing
variables of interest such as brand features and brand display. This also helps
us evaluate whether a single GANs can solve multiple marketing problems.
183
3.4 Proposed Model
3.4.1 Generative Adversarial Networks
In this section we describe the GAN method. The generator takes in as input
the draws of random variable z and public data x and outputs generated data
G(z|x; θg), where θg are generator’s parameters which are learned during the train-
ing process. The discriminator take in as input both the real, private data y and
generated data G(z|x; θg) and attempts to distinguish between the real and the gen-
erated data in a binary classification task. The discriminators parameters are θd,
which are learned during the training process. Following the design of Mirza and
Osindero (2014), Conditional Generative Adversarial Networks have the follow-
ing objective function:
[ ] [ ]
min max V(D,G) = Ey∼pdata(y) logD(y|x; θd) + Ez∼Pz(z) log(1 − D(G(z|x; θg)|x; θd)) , (3.15)G D
The objective function has theoretical links to both Kullback-Leibler and
Jensen-Shannon Divergence (Goodfellow et al., 2014), and the underlying intu-
ition is that the training procedure minimizes the distance between the distribu-
tion of the real and distribution of the generated data. Goodfellow et al. (2014) also
provide theoretical guarantees that pg, i.e. generated data distribution, converges
to pd, i.e. the true data distribution.11
11Goodfellow et al. (2014) derive theoretical guarantees for convergence in section 4.1 and 4.2 of
184
GANs have been traditionally used in the computer vision literature, where
the generator learns the mapping θg from random noise to the space of real im-
ages, as the discriminator predicts images as being real or fake in this min-max
game. GANs have been shown to be able to generate realistic images of faces
(Radford et al., 2015; Chen et al., 2016), and several other categories of images
such as home interiors, animals, and vehicles (Kim and Bengio, 2016; Wang and
Liu, 2016). In section 3.4.3, we discuss how we extend GANs to train on customer
level data.
3.4.2 Design of Neural Networks
We next discuss the design of the generator and discriminator neural networks.
We base the design of both generator and discriminator on the original condi-
tional GAN from Mirza and Osindero (2014). These are neural networks with
only “fully-connected” hidden layers, and are represented in Figures 3.4 and 3.5.
As proof of concept, we employ only one hidden layer for both neural net-
works. We ReLU activation in the discriminator and Leaky ReLU activation in
the generator. We also use batch normalization in the generator.12 From previ-
ous studies, we understand that neural networks perform well in classification
their paper, and argue that the generated data distribution converges to the true data distribution
when the discriminator is allowed to reach its optimum at each iteration. We rely on this theoretical
guarantee for convergence, and note that in our experiments we set the number of iterations to
100,000 as we found that the objective function stopped improving sufficiently prior to this number
of iterations.
12We discuss the data flows in GANs and robustness to different network architectures in the
Appendix 3.8.5 and 3.8.7 respectively.
185
Figure 3.4: Design of Generator Neural Network
Inputs Hidden Layer Output
(Random Noise and Prices) (Fully Connected) (Generated Data)
Dimensions: N X (Z+K) Dimensions: (Z+K) X H Dimensions: N X K
N observations of Z Number of neurons: H N observations of K 
dimensional random Each neuron has (K+K) generated sales 
noise and prices weights and 1 bias term
Activation: Leaky ReLu(x)
and data generation tasks due to the universal approximation theorem (Hornik
et al., 1989; Cybenko, 1989): Given enough neurons in the hidden layers, a neural
network can approximate any function. Essentially, this property allows the gen-
erator to represent the true DGP of original unprotected data in a semi-parametric
fashion: such that the more neurons included in each hidden layer, the better able
the network is to capture the true DGP. As the number of parameters grows, how-
ever, the model increases an estimation “burden” on the data as well, which can
potentially yield a nonlinear relationship between the number of neurons and ac-
curacy of the generator. We thus explore the efficacy of hyperparameters, such
as the number of neurons, in influencing performance on accuracy and privacy
metrics in the Appendix 3.8.6.
186
Figure 3.5: Design of Discriminator Neural Network
Inputs Hidden Layer Output
(Sales and Prices) (Fully Connected) (Probability of Real Data)
Dimensions: N X (K+K) Dimensions: (K+K) X H Dimensions: N X 1
N observations of sales Number of neurons: H N probabilities: whether 
and prices Each neuron has (K+K) the data is from real data 
weights and 1 bias term or from generated data
Activation: ReLu(x): max(0,x)
3.4.3 The Picture-Data Analogy and Extension to Heterogeneity
We now describe our extension of GANs to train on customer data. Conditional
GANs were originally designed to mimic the data-generating process for pictures,
when given a particular vector that conditions on labels. Among the most well
known examples is that of the MNIST dataset of handwritten digits. Given a
vector denoting a number between zero and nine, it trains a generator to produce
a picture of the designated handwritten digit. In essence, it teaches the generator
to write. We see a direct parallel between the numerical matrices of which pictures
are composed and the panel-data format often used in marketing, economics, and
statistics research.
187
Figure 3.6: Picture Data Analogy
The connection between pictures and data is illustrated in Figure 3.6. Just as
in the realm of computer vision, the conditional GAN “conditions” on a label and
then generates a picture of a handwritten digit, the proposed GAN can condition
on a matrix of unprotected data columns X and generate a data matrix Y. To carry
this analogy further, we define what would constitute a “picture” in the panel data
setting. Figure 3.7 presents an example of a Nielsen Scanner Panel household data
set in which rows correspond to a household’s weekly observations and columns
to weekly sales and advertising spend per brand. We then need to protect the
column “sales”, i.e. treat it as the private data, and share the rest of the columns,
i.e. treat them as public data. Effectively, we treat each weekly observation as
equivalent to a “picture” in the machine vision context, such that with a random
noise matrix and conditional GAN specific X’s as the row, and generate a “pic-
ture” of the protected variable, sales (i.e., Y). The assumption in this analogy is
188
that the information in each row of data is sufficient to characterize the data gen-
erating patterns of each week of sales. In this way, each row is independently and
identically distributed across each customer and week.
Figure 3.7: A “Picture” in Panel Data Context: Without Heterogeneity.
Figure 3.8: A “Picture” in Panel Data Context: With Heterogeneity.
Note that in the presence of considerable customer heterogeneity, such as K
189
types of customers, this type of picture data analogy becomes less effective at cap-
turing existing differences across customers. Heterogeneity implies that there ex-
ists unique segment averages in the X and Y variables, for each type of customer,
that differentiate them from customers in the other segments. Practically speak-
ing, this implies that each block of customer data (of T rows), rather than each
data row, would be i.i.d. A simple way to capture this heterogeneity would be
to treat as a “picture” each block of customer data, as grouped by the time series
structure (see Figure 3.8).
We therefore define the two variants of our proposed models as (a) without
heterogeneity (No Het.), and (b) with heterogeneity (Het.), depending on how we
treat the picture equivalent in the data, i.e. either each customer-week as a picture,
or each block of customer data across time periods as a picture. In the generator
without heterogeneity - GAN (No Het.), we define each unit of the mini-batch
samples as random draws at the row level. In the generator with heterogeneity
- GAN (Het.), we define each unit of the mini-batch samples as a block of cus-
tomers. We use the Jensen-Shannon divergence, Kullback-Leibler divergence, and
KS-test as discussed in Section 3.3.3 to get quantitative estimates of how well do
GAN (Het.) and GAN (No Het.) mimick the data generating process.
Accuracy and privacy results are compared and discussed in the following
section. By changing the way the GAN is trained, we hypothesize that the data
provider can trade off between accuracy and privacy with a single, easy switch.
Our analysis treats sales as the protected data, and the rest of the data as public
190
data. This approach is consistent with existing literature (for example, see Schnei-
der et al. (2018)).
3.4.4 Training
We now discuss the training process for the conditional GAN. Recall from the last
sub-section that there are two possible approaches we can take for the data: Het.,
where we consider one customer’s data for all weeks as a picture equivalent, and
No Het., where we randomly sample across customers and weeks to generate a
picture equivalent. For the purposes of illustration, we discuss below with the
notation for Het. case.13
We estimate the parameters for the generator: θg, and discriminator: θd via
stochastic gradient descent (SGD) with momentum using the ADAM optimizer
(Kingma and Ba, 2014). Stochastic gradient descent updates the parameter θg (and
similarly θd) based on the loss function for the generator J(θg) (J(θd) for the dis-
criminator) for a mini-batch of the data of size n customers using the following
update procedure:
∑n
θg ←− θg − ηg.∇
1 ( )
θJg(θg), Jg(θg) = log 1 − D(G(zi, pi, θg), θd) (3.16)n
i=1
13We consider k=52 weeks as the duration, thus each customer has 52 weeks of purchase data
which constitutes as a picture data for the training purposes. For the No Het. case, we randomly
sample 52 customer-weeks across the entire data as rows to construct a picture equivalent.
191
←− − 1
∑n ( ) ( )
θd θd ηd.∇θJd(θd), Jd(θd) = (log D(sr,i, pi, θd + log 1 − D(G(zi, pi, θg), θd) ) (3.17)n
i=1
Where ηg, ηd are the learning rates and n is the mini-batch size of the data
(number of customers) sampled in the iteration.14 sr,i are the “real” sales observed
for a customer i observed in the true data, pi are the prices observed by the cus-
tomer, and G(zi, pi) are the “generated” sales for the customer which we get from
the generator with random noise zi. Thus, the discriminator serves as a binary
classifier, as it maximizes the objective function Jd(θd) such that it minimizes the
probability of incorrectly labelling the “generated” data and real, and it maximizes
the probability of correctly labelling the “real” data as real. The generator max-
imizes objective function Jg(θg), which maximizes the probability of fooling the
discriminator, i.e., generating data such that the “generated” data is more likely
to be labelled as real. We discuss in the Appendix 3.8.5 the training process for
our GAN with gradients flow used to update the parameters.
This approach allows us several advantages. First, because the GAN training
framework allows for the separation of generator and discriminator, the generator
needs only the loss function Jg(θg), and uses the gradient ∇θJg(θg) to update its
parameters. Note that the private, protected data of customer sales: sr,i is available
only to the discriminator via its objective function Jd(θd). Second, open source
software like Tensorflow allows for scalable parallel computing on graphics or
tensor-processing units (Abadi et al., 2016). Third, the optimization is done in
14ADAM uses adaptive learning rate such that ηg, ηd hyperparameters are optimized during
training. We refer the readers to the Kingma and Ba (2014) paper for a detailed description of the
ADAM optimizer.
192
mini batches to update parameters, which allows for the scalability advantages of
“on-line” training. We explore in detail these scalability advantages provided by
stochastic gradient descent in our results.
3.5 Empirical Context and Results
This section is organized in three parts. First, we demonstrate effectiveness of
GANs on the accuracy and privacy protection metrics as compared to bench-
mark methods using a Monte Carlo data. Furthermore, we validate the accuracy-
privacy trade-off on real world data. Second, by using Monte Carlo data, we
explore scalability advantages of GANs - how GANs handle volume and velocity
of data. Finally, as a proof of concept, we show generalizability of GANs. That is,
we demonstrate that GANs can be used to tackle marketing problems of setting
prices for optimal profits and customer targeting. Furthermore, we also demon-
strate that a single GAN can handle both contexts combined.
3.5.1 Accuracy - Privacy Trade-off
In this section we estimate how well GANs perform on the accuracy and privacy
metrics as compared to benchmark methods using a Monte Carlo data and subse-
quently validate on a real world Nielsen data.
193
Monte Carlo Experiment
In this Monte Carlo experiment, we generate household-level customer data for
five representative brands using the data generating process specified in Section
3.3.3. The data context is thus similar to the real world Nielsen data. We take as
a starting point 200 customers over a span of 52 weeks for five brands’ sales and
prices. Note that brand prices are public data (i.e. accessible by both researcher
and the firm), whereas sales are private data (i.e. accessible only by the firm).
Table 3.2 reports summary statistics for the Monte Carlo data. We discuss further
details of the Monte Carlo data in the Appendix 3.8.2.
Distributional Accuracy
We first examine the proposed GAN’s generated synthetic data distributional ac-
curacy relative to that of the true data. Figure 3.9 compares data densities for
log-sales of all brands protected by various methods. For purposes of this analy-
sis, the 20% and 50% swap methods will not provide meaningful comparisons, as
the distribution of the variable of interest (sales), by construction, does not change
after merely swapping sales in the data.15
Across the five brands, we find that GANs fit the true data the most closely, and
that the fit for GAN (Het.) is higher than for GAN (No Het.). The next best fit is
15Nor would market level data be considered, as this analysis aggregates data across customers
at the weekly level and the distribution range would not be amenable to comparison using meth-
ods that track individual customer level sales.
194
Figure 3.9: Distributional Accuracy for Synthetic Data
Brand 1 Brand 2
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
−5.0 −2.5 0.0 −5.0 −2.5 0.0
Brand 3 Brand 4
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
−5.0 −2.5 0.0 −5.0 −2.5 0.0
log Sales
Brand 5
0.6 Model
GAN (Het. H:512)
0.4 GAN (No Het. H:512)
Random
0.2 Rounding
Top Coding
0.0 True Data
−5.0 −2.5 0.0
log Sales
195
Density Density Density
Table 3.2: Summary Statistics for Monte Carlo Data (N=10,400)
This table shows the summary statistics for the Monte Carlo data generated using the
process described in the Appendix 3.8.2. The prices are public data, i.e. observed by
both researcher, and the sales are private data, i.e. visible only to the firm. Note that the
generator in the GAN model has access to only the public data (prices), and the private
data (sales) never leaves the walls of the firm.
Variable Mean Std. Dev. Min Max
Price (Brand 1) 8.69 0.93 5.06 12.24
Price (Brand 2) 5.08 0.50 2.92 7.11
Price (Brand 3) 5.48 0.50 3.51 7.34
Price (Brand 4) 5.94 0.39 4.37 7.45
Price (Brand 5) 4.15 0.77 0.90 7.40
Sales (Brand 1) 0.07 0.09 0 1.40
Sales (Brand 2) 0.11 0.15 0 3.75
Sales (Brand 3) 0.05 0.09 0 4.38
Sales (Brand 4) 0.05 0.06 0 1.21
Sales (Brand 5) 0.12 0.19 0 4.87
from top-coding, which by construction, differs from true data post the truncation
point of 95th percentile. We also find that random noise and rounding fluctuate
around the true data, and have a poorer fit. This finding provides preliminary
evidence that GAN most closely fits the true data.
In order to get quantitative and robust measures of the fit of the different ap-
proaches to the true data, we estimate the distance between the true data distri-
196
bution and that of the data protected by various methods (proposed and bench-
mark). In Table 3.3, we examine the corresponding distribution metrics, namely,
JSD, KL divergence, and KS Statistic. Examining the JSD metric, we observe the
lowest value for the GAN (Het.): 0,0213. Rounding benchmark follows second
with JSD of 0.0288, closely followed by GAN (No Het.): 0.0307. This finding indi-
cates that the probability distributions for GANs and true data are the close. When
we consider the KL divergence metric, we find that GAN (Het.) also has the low-
est value of 0.0231. Thus, GAN (Het.) probability distribution is closest to the true
data distribution. This conclusion is also the case for the KS Statistic, GAN (Het.)
registering the lowest score on the KS test of 0.0077, which gives the upper bound
on the difference in cumulative density functions for two distributions. Note that
GANs (No Het.) also beats the best performing benchmark on the KS test, with
a value of 0.0322 as opposed to 0.05 for Top Coding. Thus, through these three
different metrics, we find that the GAN (Het.) distribution is closest to the true
data across all measures of statistical differences in distributions. This provides
confirmatory empirical evidence that the GANs best mimics the true data.
Balance between Accuracy and Privacy
We use the information loss metrics to examine accuracy and maximum loss of
protection (MLP) for the benchmarks. Although the separability of the GAN pro-
vides a first layer of protection, the MLP metric gives us quantitative estimates
of the loss in privacy in the situation that the transferred generator were hacked.
197
Table 3.3: Distribution Metrics (Lower the Better)
Model JSD KL KS
Random Noise 2.147 3.8847 0.1173
Rounding 0.0288 0.0274 0.1036
Top Coding 0.4718 0.8474 0.0500
GAN (No Het.) 0.0307 0.0419 0.0332
GAN (Het.) 0.0213 0.0231 0.0077
Employing the case of a compromised generator, we investigate the likelihood
that the generated data can be traced back to the original IDs of customers.
We find evidence consistent with those in the Nielsen data. Figure 3.10 shows
the results. Benchmark methods for random noise, rounding, and top-coding
have lower loss of information, MAPD, but higher loss in privacy protection com-
pared to other benchmark methods. The 20% swap has a much lower information
loss as compared to the 50% swap, which by construction has information loss,
MAPD, of approximately 50. 50% swap, however, has much better privacy protec-
tion than other individual customer level benchmark methods. The market-level
benchmark method offers the best privacy protection, MLP of 0, by construction,
but comes with high information loss of 56.
Ideally, we want to be at the bottom left of the MAPD-MLP plot, with low
information loss and low loss of privacy. We find that our proposed generators
show consistently lower information loss and superior privacy protection than all
198
Figure 3.10: Performance on Monte Carlo Data.
●
Market
model
● Benchmark
●
Swap 50 ● GAN (Het.)
● GAN (No Het.)
40 ● True Data
●
Swap 20 Random Noise
20 ●
Rounding ●
● ●
No Het. (H: 512) Top Coding
Het. (H: 512)
●
0 ●
True Data
0.00 0.01 0.02 0.03 0.04
Maximum loss of protection (MLP)
the benchmark methods. Specifically, we find lower information loss, in terms of
MAPD for GAN (Het.) than for GAN (No Het.). GAN (Het.) with 512 neurons has
an MAPD of 1.2, which is a 4.6 times improvement in accuracy as compared to the
best benchmark method, which is top-coding, with an MAPD of 5.3. This finding
is consistent with JSD and KS Statistic measures obtained in the previous section.
We find, however, with lower information loss comes a trade-off regarding pri-
199
Information Loss (MAPD)
vacy protection. GAN (No Het.) has significantly superior privacy protection
than GAN (Het.), but with higher loss in information. Interestingly, we find that
GANs (No Het.) have MLP of 0.0035, which is closest to the market-level data as
compared to each of the other methods: The loss of information varies between
4.6 and 10, which is significantly superior to the information loss for 50% swap.
At the cost of potential privacy loss, GAN (Het.) has much lower information
loss than all other methods. Furthermore, despite this trade-off, we find that our
proposed generators occupy the bottom left of the MAPD-MLP plot, thus indicat-
ing that relative to the benchmark methods they offer a superior overall balance
between accuracy and privacy.16
Real Data Validation: Nielsen Data
We apply the proposed and benchmark methods for protecting a data set in a
“real-world” setting using the 2006 Nielsen Household Panel and Retail Scan-
ner data sets. Both have been studied extensively in the marketing literature and
are used by marketing practitioners. Although our method should be applicable
to any data transfer setting in downstream applications using any class of infer-
ence models, this canonical data inform a natural proof of concept examining real
world performance related to information and privacy loss.
16We explore the relationship between model parameters: number of neurons, and the accuracy
of GANs in the Appendix 3.8.6. We also explore the robustness of model’s architecture such as
activation functions, batch normalization, and the noise distribution used to generate data in the
Appendix 3.8.7.
200
To demonstrate the applicability of our proposed method on a reasonably large
data set in a real world setting, our initial analysis uses the Nielsen data set to
construct a sample with at least ten thousand rows composed of data for two
hundred households across fifty weeks for the year 2006. We define variables
similar to those used by Hendel and Nevo (2006) and Schneider et al. (2018).
Following Hendel and Nevo (2006), we examine consumer purchases in the
liquid detergent category aggregated at brand level for the leading brands: Tide,
Cheer, All, and Wisk, the remainder combined as Others brand. The unit of ob-
servation is household-week, and we observe purchases ($ amount) of each brand
by each household, and the prices ($ amount) observed during that week for each
of the brands. We consider prices as the publicly available data, and treat sales as
the private data that only the data-provider has access. We thus create a dataset
of 200 randomly sampled households that made at least ten purchases in the year
2006. We then estimate the private data, i.e. sales from benchmark methods and
from the data generated by our proposed GANs. In order to estimate accuracy, we
compute coefficients from the true data and benchmarks for Eq. 3.4, estimate the
MAPD metric using Eq. 3.5, and estimate the loss of privacy metric MLP using
Eq. 3.8. Figure 3.11 illustrates our examination of information and privacy loss
with the proposed and benchmark methods17
17We modify Top Coding (99.9 percentile instead of 95) and Random Noise (centiles instead of
deciles) to increase the difficulty of the benchmark comparison, as the 95% and deciles have higher
information loss and we did not want the real-data benchmark to be easier than the Monte Carlo
data setting. Rounding is modified to the nearest dollar instead of nearest cent (hundredth place)
or nearest 10th cent (tenth place), since in the true data the sales are often ending in 9 cents (for
example $3.89 is rounded to $4.00)
201
Figure 3.11: Information Loss and Loss of Privacy on Real Data
Market Level model
Benchmark
GAN (Het.)
75
GAN (No Het.)
True Data
Swap 50%
50
GAN (No Het.)
Swap 20% Random Noise
Rounding
25
Top Coding
GAN (Het.)
0
True Data
0.0 0.5 1.0 1.5
Maximum loss of protection (MLP)
We find that as compared to the best benchmark method , i.e., top-coding, with
MAPD of 11%, GAN (Het.) has an MAPD of 5.6%.18 GANs therefore double the
accuracy of the best benchmark method. GAN (No Het.), despite a higher MAPD
of 45% compared to 5.6% for GAN (Het.), has the lowest loss in privacy among
the non-aggregate benchmarks, with an MLP of 0.15 compared to 0.31 for 50%
swap. Overall, we find that our proposed generators consistently outperform the
benchmark data protection methods in terms of lower information loss and higher
privacy protection.
18Both GAN (Het.) and GAN (No Het.) have 512 neurons each. We discuss how number of
neurons affects accuracy in the Appendix 3.8.6.
202
Information Loss (MAPD)
3.5.2 Scalability
In this section, we examine the scalability aspect of volume and velocity for GANs,
i.e. how well do GANs scale with the volume of data in terms of model estimation
time and transferred information size, as well how well do GANs handle newly
arriving data, i.e. streaming data. For the purposes of this section, we use the
Monte Carlo data described earlier and as summarized in Table 3.2.
Estimation Time
In this section, we discuss the relationship between volume of data and estima-
tion time for GANs. We vary the size of data from ∼1000 rows of data (N), i.e.
10 customers (Nc) and 102 weeks (T ) per customer, to ∼10 Million rows of data,
i.e. 100,000 customers and 102 weeks per customer. We report the total estima-
tion time and time per iteration in Table 3.4.19. We observe that training time per
iteration increases only marginally with data volume, from 6.33 milliseconds per
iteration with 1,000 rows of data to 7.55 milliseconds per iteration with ten million
rows of data. We note that the training time across different data volumes shows
the same order of magnitude: I.e., the training time stays the same as we sub-
stantially increase volume of data. This observation can be attributed to the SGD
algorithm used to train the GAN, as its parameters are trained in each iteration us-
19We run the GANs with 512 neurons and mini-batch size of 128 customers in Tensorflow 1.4
on a computer with the following configuration: Intel Core i9-9000X 10 Core 3.3 GHz, 64 GB RAM,
and Titan Xp GPU (Pascal), for 100,000 iterations. We use this as a training stopping point because
the RMSE between the real data and the synthetically generated samples stabilizes prior to this
point, implying GAN convergence.
203
ing a sample of the data, because of which the proposed generator scales well with
respect to volume of data. However, we note that the training time may increase
due to other factors, such as SGD mini-batch sample size and GAN complexity,
both of which are controlled by the researcher and can be adjusted according to
available computing equipment.
Table 3.4: Data Volume and Estimation Time
This table shows the estimation time for GANs based on the volume of the data. Nc is the
number of customers in the data, T is the number of timer periods in the data, N is the
volume of data, i.e. number of rows in the data, and equals Nc × T . Estimation time is the
total time in minutes needed to estimate a GAN model, and we also report the time per
iteration (in milliseconds).
Rows of Nc T Estimation Time per
Data (N) Time (min) Iteration (ms)
∼ 1, 000 10 102 10.55 6.33
∼ 10, 000 100 102 10.65 6.39
∼ 100, 000 1,000 102 10.79 6.47
∼ 1, 000, 000 10,000 102 11.03 6.62
∼ 10, 000, 000 100,000 102 12.58 7.55
204
Transferred Information Size
Table 3.5: Data Volume and Generator Compression
This table shows the transferred information size for GANs based on the volume of the
data. Nc is the number of customers in the data, T is the number of timer periods in the
data, N is the volume of data, i.e. number of rows in the data, and equals Nc × T . Original
File Size is the size of the true data, Generator Size is the size of the model checkpoints
that are transferred to a researcher in order instantiate a fully trained generator, and File
Compression Factor is the ratio of Generator Size and Original File Size.
Rows of Nc T Original File Generator File Compression
Data (N) Size Size Factor (Original / Generator)
∼ 1, 000 10 102 180 KB 7.79 MB 0.023
∼ 10, 000 100 102 1.8 MB 7.79 MB 0.23
∼ 100, 000 1000 102 18 MB 7.79 MB 2.28
∼ 1, 000, 000 10,000 102 185 MB 7.79 MB 23.75
∼ 10, 000, 000 100,000 102 1,719.21 MB 7.79 MB 220.67
We next examine the relationship between data volume and transferred file
size. Note that, as expected, the original file size that will otherwise need to be
transferred (using comparable benchmark methods) grows with the rows of data.
In the context in which we transfer the generator, however, the size of the file
205
Figure 3.12: Generator Size and GAN Complexity
20
15
10
5
0
128 256 512 768 1024
Number of Neurons
grows only in proportion to GAN complexity. Table 3.5 shows the volume of in-
formation transferred. We observe that, whereas file size is large at 7.79 MB in a
small data setting of 1,000 rows, this approach excels with the transfer of larger
amounts of data on the order of 100,000 rows or greater20. Figure 3.12 shows an
almost linear relationship in generator size and number of neurons. This is due to
the number of generator parameters increasing linearly with number of neurons.
We note that in settings closer to real-world situations (e.g., more than one million
rows), GANs are able to achieve a file compression factor that is at least twenty-
three, i.e the transferred information with GANs is twenty three times smaller
than the original data file size. These findings assure us that our proposed gener-
20The data size reported is the size of the checkpoint data that Tensorflow saves for the genera-
tor parameters. The generator uses 512 number of neurons.
206
Model Size (MB)
ator scales well for transferred information size with respect to data volume and
GAN complexity.
Data Velocity Scalability
We examine information loss when the algorithm is trained on in-flowing data.
Figure 3.13 presents the results of comparing the traditional “restart,” in which
the GAN is trained from scratch with each new inflow of data, and the “stream-
ing” method, in which the GAN is trained continuously from previously known
estimates. We find that in the case of streaming rather than restart, information
loss stabilizes sooner, and in the first 50 thousand iterations, the MAPD is lower.
We observe less information loss in the streaming than in the restart case, in which
the GAN parameters are learned from scratch. This finding results from employ-
ing stochastic gradient descent as the training method for streaming, whereby
training of the GAN parameters is continuous yet with more data. More gener-
ally, the “online” nature of SGD can be exploited as a learning method in GANs
with continuous streaming data to train GAN parameters as soon as new data
presents.
207
Figure 3.13: Streaming Data and Information Loss
model
baseline
restart
120
streaming
80
40
0 50000 100000 150000 200000
Number of Iterations
3.5.3 Generalizability to Marketing Problems
In this section, we demonstrate as a proof of concept how GANs can generalize to
marketing contexts of setting prices for optimal profits, customer targeting, and
also demonstrate that a single GAN tackle both problems of price setting and cus-
tomer targeting. We do so using a series of Monte Carlo datasets. For the purpose
of this analysis, we focus on GAN (Het.) since we find from our prior results that
GAN (Het.) achieves higher accuracy than GAN (No Het.). Furthermore, GAN
(Het.) performs better than benchmarks on the accuracy-privacy trade-off.
208
Information Loss (MAPD)
Price Markups and Optimal Profit Ratio
We now discuss how GANs compare relative to benchmarks on setting price
markups for optimal profits. We use the Monte Carlo dataset from before and
as described in Table 3.2. Table 3.6 shows the markup percentages w.r.t. cost for
each of the five brands. These markups are obtained using Eq. 3.9 which uses the
price elasticities as computed from the true data and benchmark data protection
methods. We find that for each of the brands, the price markups estimated from
GAN (Het.) is closest to the true data, as compared to other benchmarks.
Table 3.6: Price Markups from Eq. 3.9 for True Data and Benchmarks
This table shows the price markups for optimal profits obtained from Eq 3.9 for
the true data, GANs, and other benchmarks. We obtain these price markups for
each of the five brands.
Method Brand 1 Brand 2 Brand 3 Brand 4 Brand 5
True Data 229.35 226.72 98.15 162.68 110.32
GAN (Het.) 241.98 234.36 99.34 165.22 113.09
Random Noise 205.46 -558.21 -114.40 79.02 -233.61
Rounding 140.34 163.86 118.49 190.83 819.11
Swap 20 84.45 543.69 64.36 97.25 923.39
Swap 50 -332.17 -333.26 104.31 -106.79 -289.60
Top Coding 122.76 147.22 95.99 120.00 396.21
Thus, we now estimate optimal profit ratios using Eq. 3.10. Table 3.7 shows the
209
ratio of the optimal profits obtained from benchmark methods w.r.t. the optimal
profits obtained from using the true data. We find here that the optimal profits
obtained GAN (Het.) are consistently higher than 99.96% of those obtained from
the true data for each of the five brands, and it also consistently outperforms other
benchmark methods. Amongst the other benchmark methods, the closest is Top
Coding, whose optimal profits vary from 69.91% for brand 5 to 99.99% for brand
3, however GAN (Het.) outperforms Top Coding across each of the five brands.
Table 3.7: Optimal Profit Ratio from Eq. 3.10 for Benchmark Methods w.r.t
True Data
This table shows the optimal profit ratios (as % of profits obtained from the true
data) using Eq 3.9 for GANs and other benchmarks. We obtain these optimal
profit ratios for each of the five brands.
Method Brand 1 Brand 2 Brand 3 Brand 4 Brand 5
GAN (Het.) 99.96% 99.98% 99.99% 99.99% 99.99%
Random Noise 99.81% 31.41% 3.64% 90.22% 5.65%
Rounding 96.20% 98.34% 99.11% 99.52% 44.63%
Swap 20 84.65% 90.25% 95.64% 94.93% 40.99%
Swap 50 31.96% 31.41% 99.91% 16.97% 5.65%
Top Coding 93.85% 97.05% 99.99% 98.22% 69.91%
This finding suggests that managers can use GANs to make pricing decisions
that lead to higher profits as compared to benchmark approaches. Further, GANs
fare better on the accuracy-privacy trade-off for this Monte Carlo data - see Fig-
210
ure 3.10. Thus, GANs can provide a suitable alternative to the true data as mar-
keting managers using customer sales data will be interested in computing price
markups and optimizing profits.
Customer Targeting
In order to estimate customer targeting accuracy for GANs and traditional bench-
marks, we generate a Monte Carlo dataset using the process described in Section
3.3.3. The data comprises of 200 customers and 52 weeks, for a total of 10,400
observations. For each customer-week, we observe whether the customer was
marketed to or not (dummy variable: Marketing), whether the customer made a
purchase in the previous week (dummy variable: Previous Purchase), how many
times the customer has visited the store so far: log(Visits So Far). The private data,
and the outcome variable of interest, is whether the customer makes a purchase or
not in the current week (dummy variable: Purchase). With this data, we estimate
MAPD of coefficients from true data and benchmark data from Eq. 3.13. Note that
since the outcome variable is whether a customer purchased (or not), the bench-
marks methods of random noise, rounding, and top coding do not apply since
they are applicable only on continuous outcome variables. Thus, we benchmark
GANs with Swap 20 and Swap 50 methods using the logistic regression for mod-
eling purchase behavior as described in Eq. 3.13.
Figure 3.14 shows the accuracy and privacy results for GANs and benchmark
methods. We find that GANs are able to achieve an accuracy of 0.1% in customer
211
Table 3.8: Summary Statistics for Monte Carlo Data (N=10,400)
This table shows the summary statistics for the Monte Carlo data generated using the
process described in Section 3.3.3. The following variables are public: Marketing, Previ-
ous Purchase, and log(Visits So Far), i.e. observed by both researcher, and the Purchase
variable is private data, i.e. visible only to the firm. Note that the generator in the GAN
model has access to only the public data, and the private data never leaves the walls of
the firm.
Variable Mean Std. Dev. Min Max
Marketing 0.10 0.30 0 1
Previous Purchase 0.24 0.43 0 1
log(Visits So Far) 2.38 0.84 0 3.83
Purchase 0.24 0.43 0 1
targeting model, and that this exceeds the benchmarks of swap 20, with an MAPD
of 10.04%, and also swap 50, with an MAPD of 28.88%. Further, GANs are able
to achieve higher privacy protection as compared benchmarks, with an MLP of
0.319 as compared to 0.355 for both Swap 20 and Swap 50.
This finding suggests that marketing managers, who need to build customer
targeting models, will obtain substantially higher accuracy at customer targeting
with GANs, as compared to other benchmarks. Furthermore, GANs offer better
privacy protection, thus alleviating privacy concerns of data providers who are
sharing data. Thus, GANs can provide a suitable alternative to the true data as
well as benchmarks to marketing managers interested in building customer tar-
geting models.
212
Figure 3.14: Performance for Customer Targeting
0.3
Swap 50 model
GAN (Het.)
Swap 20
0.2 Swap 50
True Data
0.1
Swap 20
GAN (Het.)
0.0 True Data
0.325 0.350 0.375
Maximum loss of protection (MLP)
Tackling Multiple Marketing Problems With One GAN
Since the purpose of a GAN is to generate privacy protected synthetic data, we test
whether data generated from GANs can be used to run a variety of inference mod-
els similar to those that are possible on the true data. Thus, we test as a proof of
concept whether a single GAN can handle combined marketing problems pricing
and targeting. We generate a Monte Carlo dataset using the process described in
Section 3.3.3. The data comprises of 200 customers and 52 weeks across 5 brands,
for a total of 10,400 observations.
For each customer-week, we observe across the five brands the public data:
whether the brand was featured to the customer or not (dummy variable: Fea-
213
Information Loss (MAPD)
Table 3.9: Summary Statistics for Monte Carlo Data (N=10,400)
This table shows the summary statistics for the Monte Carlo data generated using the
process described in Section 3.3.3. The data is for 200 customers for 52 weeks across 5
brands. The following variables are public for each brand: Display, Feature, Price, i.e.
observed by both researcher, and the Sales variable for each brand is private data, i.e.
visible only to the firm. Note that the generator in the GAN model has access to only the
public data, and the private data never leaves the walls of the firm.
Variable Mean Std. Dev. Min Max
Display (Brand 1) 0.25 0.43 0 1
Display (Brand 2) 0.15 0.36 0 1
Display (Brand 3) 0.05 0.21 0 1
Display (Brand 4) 0.16 0.36 0 1
Display (Brand 5) 0.10 0.30 0 1
Feature (Brand 1) 0.30 0.46 0 1
Feature (Brand 2) 0.20 0.40 0 1
Feature (Brand 3) 0.15 0.36 0 1
Feature (Brand 4) 0.25 0.43 0 1
Feature (Brand 5) 0.10 0.30 0 1
log (Price Brand 1) 1.08 0.17 0.13 1.60
log (Price Brand 2) 1.59 0.21 0.07 2.20
log (Price Brand 3) 2.07 0.13 1.42 2.47
log (Price Brand 4) 2.48 0.08 2.14 2.75
log (Price Brand 5) 2.30 0.10 1.80 2.62
log (Sales Brand 1) -0.67 1.40 -5.67 5.14
log (Sales Brand 2) -2.76 1.31 -7.24 4.11
log (Sales Brand 3) -3.92 1.47 -7.98 5.57
log (Sales Brand 4) -4.65 1.48 -9.44 1.88
214
log (Sales Brand 5) -2.46 1.31 -6.79 4.69
ture), whether the brand was displayed to the customer or not (dummy variable:
Display), and the price: log(Price). The private data, and the outcome variable
of interest, is how much the customer purchases a certain brand during a week:
log(Sales). Table 3.9 shows the summary statistics for the Monte Carlo data.
We report the MAPD and MLP results from Eq. 3.14 in Figure 3.15.21. We find
that GANs outperform benchmark methods - GANs have an MAPD of 0.0139, i.e.
a 1.39% difference in the price elasticities and coefficients for feature and display
and their interaction term. The only benchmark that comes close is rounding, with
an MAPD of 0.0207, whereas other benchmarks have an MAPD an entire order of
magnitude higher at about 0.2 or higher. Furthermore, we find that GANs are
at the left most corner on the accuracy-privacy plot, and provide higher privacy
protection as compared to benchmarks. Thus, our empirical evidence suggests
that GANs can indeed incorporate multiple marketing problems with a single
model, and that this outperforms other benchmarks in terms of accuracy-privacy
trade-off.
The finding that GANs can tackle multiple marketing problems will be of
much interest to data providers as well as researchers. Data providers need to
train only one GAN model on their entire dataset which can subsequently be used
by researchers to draw multiple inferences such as pricing and customer targeting.
21Note that we do not consider market aggregated benchmark since the feature and display
for a brand is at customer-week level, thus aggregating it across multiple customers is a weak
benchmark
215
Figure 3.15: Performance for Tackling Multiple Problems
Swap 50
0.5 model
GAN (Het.)
Top Coding Random Noise
0.4 Rounding
Swap 20
Swap 50
Top Coding
0.3
True Data
0.2 Swap 20
Random Noise
0.1
GAN (Het.) Rounding
0.0
True Data
0.01 0.02 0.03 0.04 0.05
Maximum loss of protection (MLP)
3.6 Discussion
In this paper we address the concerns of researchers who need access to firms’
sensitive customer data and present a novel approach that differs from traditional
data transfer approaches. We address the three concerns firms and researchers
have regarding data transfer: (i) our approach is effective in preserving the pri-
vacy of sensitive customer data with higher accuracy; (ii) our proposed genera-
tive model scales to big data; (iii) our proposed approach can be used to tackle
216
Information Loss (MAPD)
multiple marketing problems. Additionally, by using the picture-data analogy to
incorporate heterogeneity, we enable a given firm to control the trade-off between
privacy and accuracy.
Our proposed method, Generative Adversarial Networks, consists of two com-
peting neural networks, a discriminator network and a generator network, the
construction of which allows for separability. This decoupled nature has both
privacy and scalability advantages. Privacy advantages derive from only the dis-
criminator accessing the real data on the firm’s side, thereby ensuring that no real
data leaves the walls of the firm. The scalability advantages derive from only
the gradients of the loss function’s being passed from the discriminator to the
generator. The researcher, with the generator neural network, can generate data
mimicking the true data to a high degree of accuracy.
We test our generative models on four datasets, household scanner panel data
from AC Nielsen, three Monte Carlo customer datasets, and to validate the accu-
racy of our proposed generative model in comparison to benchmarks. We find
that data generated from GANs have probability distributions closest to the true
data, and outperform benchmarks on the accuracy-privacy trade-off. We also
evaluated GANs on marketing problems of optimal price markups for profit max-
imization, customer targeting, and the ability to tackle multiple marketing prob-
lems with the use of a single GAN. We find that GANs outperform benchmarks
on tackling marketing problems, and also alleviate data providers logistical and
computational overheads as the data providers need train only one GAN model
217
that can tackle several marketing problems.
We also address the scalability concerns that are typical for big data. First, we
find that our generator scales effectively with respect to data volume and velocity.
We find that the training time per iteration is of the same order of magnitude for
different data volumes, ranging from one thousand rows to ten million rows of
data. Second, we find that the transferred information size outshines true-data
transfer when the data volume is of the order of hundreds of thousands rows or
more. Finally, we also demonstrate that the stochastic gradient descent (SGD)
allows us to handle streaming data; i.e., because the generator training can be
resumed without much loss in informational value, it scales effectively regarding
new data.
An important limitation of our GAN model is that we currently do not model
consumer dynamics. This concern can be addressed by modifying the GANs to
incorporate attention, which can enable us to capture a possible source of hetero-
geneity.
In conclusion, we present a novel scalable approach as a proof of concept
for data transfer, which demonstrates improved privacy protection compared to
benchmark methods and can be used to solve several marketing problems. In
light of recent regulatory concerns over data privacy, our findings have significant
implications for firms, consumers, and regulators, as privacy protection becomes
increasingly important for marketers.
218
3.7 References
REFERENCES
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-
mawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S.,
Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu,
Y., and Zheng, X. (2016). Tensorflow: A system for large-scale machine learn-
ing. In 12th USENIX Symposium on Operating Systems Design and Implementation
(OSDI 16), pages 265–283.
Abowd, J., Gittings, R. K., McKinney, K., Stephens, B., Vilhuber, L., and Woodcock,
S. (2012). Dynamically consistent noise infusion and partially synthetic data as
confidentiality protection measures for related time series.
Ansari, A. and Li, Y. (2018). Big data analytics. In Handbook of Marketing Analytics.
Edward Elgar Publishing.
Burnap, A., Hauser, J. R., and Timoshenko, A. (2019). Design and evaluation
of product aesthetics: a human-machine hybrid approach. Available at SSRN
3421771.
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016).
Infogan: Interpretable representation learning by information maximizing gen-
erative adversarial nets. In Advances in neural information processing systems,
pages 2172–2180.
219
Chintagunta, P., Hanssens, D. M., and Hauser, J. R. (2016). Editorial: Marketing
science and big data. Marketing science, 35(3):341–342.
Christen, M., Gupta, S., Porter, J. C., Staelin, R., and Wittink, D. R. (1997). Using
market-level data to understand promotion effects in a nonlinear model. Journal
of Marketing Research, pages 322–334.
Culotta, A. and Cutler, J. (2016). Mining brand perceptions from twitter social
networks. Marketing science, 35(3):343–362.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function.
Mathematics of control, signals and systems, 2(4):303–314.
Duncan, G. T. and Stokes, S. L. (2004). Disclosure risk vs. data utility: The ru
confidentiality map as applied to topcoding. Chance, 17(3):16–20.
Eguchi, S. and Copas, J. (2006). Interpreting kullback–leibler divergence with the
neyman–pearson lemma. Journal of Multivariate Analysis, 97(9):2034–2040.
Goldfarb, A. and Tucker, C. (2011). Online display advertising: Targeting and
obtrusiveness. Marketing Science, 30(3):389–404.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in
neural information processing systems, pages 2672–2680.
Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An
empirical investigation of catastrophic forgetting in gradient-based neural net-
works. arXiv preprint arXiv:1312.6211.
220
Hendel, I. and Nevo, A. (2006). Sales and consumer inventory. The RAND Journal
of Economics, 37(3):543–561.
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward net-
works are universal approximators. Neural networks, 2(5):359–366.
Hu, J., Reiter, J. P., and Wang, Q. (2014). Disclosure risk evaluation for fully
synthetic categorical data. In International Conference on Privacy in Statistical
Databases, pages 185–199. Springer.
Kim, T. and Bengio, Y. (2016). Deep directed generative models with energy-based
probability estimation. arXiv preprint arXiv:1606.03439.
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980.
Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The annals
of mathematical statistics, 22(1):79–86.
Leeflang, P. S. and Wittink, D. R. (2000). Building models for marketing decisions::
Past, present and future. International journal of research in marketing, 17(2-3):105–
126.
Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Transac-
tions on Information theory, 37(1):145–151.
Link, R. (1995). Are aggregate scanner data models biased? Journal of Advertising
Research, 35(5):RC8–RC8.
221
Liu, X., Singh, P. V., and Srinivasan, K. (2016). A structured analysis of unstruc-
tured big data by leveraging cloud computing. Marketing science, 35(3):363–388.
Malik, N. and Singh, P. V. (2019). Deep learning in computer vision: Methods,
interpretation, causation and fairness. Interpretation, Causation and Fairness (May
28, 2019).
Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv
preprint arXiv:1411.1784.
Park, C. H. and Park, Y.-H. (2016). Investigating purchase conversion by uncover-
ing online visit patterns. Marketing Science, 35(6):894–914.
Puranam, D., Narayan, V., and Kadiyali, V. (2017). The effect of calorie posting
regulation on consumer opinion: a flexible latent dirichlet allocation model with
informative priors. Marketing Science, 36(5):726–746.
Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learn-
ing with deep convolutional generative adversarial networks. arXiv preprint
arXiv:1511.06434.
Rafieian, O. and Yoganarasimhan, H. (2018). Targeting and privacy in mobile
advertising.
Reiter, J. P. (2005). Estimating risks of identification disclosure in microdata. Jour-
nal of the American Statistical Association, 100(472):1103–1112.
Reiter, J. P. (2010). Multiple imputation for disclosure limitation: Future research
challenges. Journal of Privacy and Confidentiality, 1(2).
222
Schneider, M. J. and Abowd, J. M. (2015). A new method for protecting interre-
lated time series with bayesian prior distributions and synthetic data. Journal of
the Royal Statistical Society: Series A (Statistics in Society), 178(4):963–975.
Schneider, M. J., Jagpal, S., Gupta, S., Li, S., and Yu, Y. (2018). A flexible method
for protecting marketing data: An application to point-of-sale data. Marketing
Science.
Steenburgh, T. J., Ainslie, A., and Engebretson, P. H. (2003). Massively categorical
variables: Revealing the information in zip codes. Marketing Science, 22(1):40–
57.
Tenn, S. (2006). Avoiding aggregation bias in demand estimation: A multivariate
promotional disaggregation approach. Quantitative Marketing and Economics,
4(4):383–405.
Timoshenko, A. and Hauser, J. R. (2019). Identifying customer needs from user-
generated content. Marketing Science, 38(1):1–20.
Toubia, O. and Netzer, O. (2016). Idea generation, creativity, and prototypicality.
Marketing science, 36(1):1–20.
Wang, D. and Liu, Q. (2016). Learning to draw samples: With application to amor-
tized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722.
Wittink, D. R., Addona, M. J., Hawkes, W. J., and Porter, J. C. (1988). Scan* pro:
The estimation, validation and use of promotional effects based on scanner data.
Internal Paper, Cornell University.
223
3.8 Appendix
3.8.1 Inference Model and Information Loss
We now describe information loss and the inference model. As an example of
a multiple regression framework with continuous independent and dependent
variables, consider the following log-log regression in a standard panel data set-
ting with entity i, brand j, and time period t
∑
lnYi jt = δ j + β jlnXi jt + βklnXikt + i jt, (3.18)
k, j
for k number of covariates of interest, where δ j s the brand-level intercept. This
log-log regression framework has been used widely in marketing and economics,
modeling continuous dependent variables such as store sales, worker wages, and
customer demand
One context that falls under this framework is the market response model pro-
posed in SCAN*PRO Leeflang and Wittink (2000), which has been used exten-
sively by AC Nielsen and consumer goods manufacturers Schneider et al. (2018).
The SCAN*PRO model uses a log-log framework with own and competitor pric-
ing as covariates and sales as the dependent variable.22 This corresponds to a data
22We exclude promotion indicator variables in this proof of concept in order to focus on the
method’s power to protect continuous independent and dependent variables. We will explore
how to protect indicator variables in future research.
224
generating process similar to a nonlinear customer response function, such as the
multiplicative specification proposed by Wittink et al. (1988) for store-level data.
Therefore, we formally capture weekly customer sales for multiple brands:
∏
S β ji jt = αi jPi jt P
βk i jt
ikte , i = 1, ..., n; t = 1, ...,T, (3.19)
k, j
which translates to a log-log regression model:
∑
lnS i jt = µ j + µi j + β jlnPi jt + βklnPikt + i jt, (3.20)
k, j
where S i jt denotes household weekly purchase in dollar amount and Pi jt price
in dollars for customer i, product brand j, and week t. Pikt are all prices observed
by the customer for competitor brands; µ j is the brand specific intercept term and
µi j the household-level random effects term drawn from a normal distribution
N(0, σµ). For log sales models in which the independent variables are log-prices,
β j, the price coefficient, is also the price elasticity;  is the unobserved, indepen-
dent error term.
Mean absolute percentage difference (MAPD), proposed by Christen et al.
(1997) as a measure of the difference between the regression estimate of a market-
level aggregation method and true data, allows for assessing the relative differ-
ence between a protection method and true data. Essentially, MAPD measures
225
the average absolute difference of coefficient estimates across J number of coeffi-
cients of interest.
1 ∑J ∣
MAPD = ∣∣∣ β̂ ∣j − β j ∣∣∣ x 100%, (3.21)J β
j=1 j
where β̂ j is the estimated coefficient of interest on protected, and β j the esti-
mated coefficient on real data and J refers to the number of relevant coefficients
to be analyzed using a statistical modeling technique (e.g., regression). We use
the brands’ own price elasticities as the coefficients of interest in the subsequent
sections when MAPD is reported.
3.8.2 Monte Carlo Experiment 1 Data
In this section, we describe the Monte Carlo data generation process for experi-
ment 1. We use this data to benchmark GANs and other data protection models.
We use the following equation (from section 3.8.1) to generate the sales data S for
5 brands ( j), 200 customers(i), and 52 weeks (t):
lnS i jt = µ j + δi j + β jlnPi jt + i jt
We chose the values for the parameters: µ = (−0.1,−0.05,−0.09,−0.3,−0.1) is
226
inherent preference for a brand j, δ = N(0, 2) is the customer specific random effect
i for a brand j. For the price elasticities, we borrow from Christen et al. (1997) the
price elastitices: β = (−1.5,−1.7,−2.01,−1.98,−1.9). We draw prices for a brand j
as random draws from a normal distribution as follows: P j = N(µp j, σp j), where
µp = (8.68, 5.09, 5.48, 5.94, 4.15) and σp = (0.93, 0.5, 0.5, 0.39, 0.77).
With the above parameter settings, we generate the customer sales data that is
subsequently used as the “true” data for the Monte Carlo data.
3.8.3 Monte Carlo Experiment 2 Data - Customer Targeting
We construct purchase behavior for 200 customers and 52 weeks using the pur-
chase probability model from Park and Park (2016). More formally:
ui j = bi j + βvisit,i jvisitsi j + βmarketing,i, jmarketingi, j + βprevpurchaseδprevpurchase,i j + i j (3.22)
Pi j = 1 i f ui j > 0 (3.23)
We borrow the random coefficients values from Park and Park (2016) as
bi j = N(−2.547, 0.804), βvisit,i j = N(0.369, 0.406), βmarketing,i, j = N(0.310, 0.522), and
227
βprevpurchase = −0.618. We consider the data constructed in this manner as the true
data.
3.8.4 Monte Carlo Experiment 3 Data - Tackling Multiple Mar-
keting Problems
The data generating process specification is along the lines of Schneider et al.
(2018). More formally:
lnS i jt = µ j + µi j + β jlnPi jt + ln(δ f j)Fi jt + ln(δd j)Di jt + ln(δ f d j)FDi jt + i jt, (3.24)
The price distribution and coefficients are the same as those described in the
Appendix 3.8.2. The feature, display, and feature and display coefficients are from
Schneider et al. (2018). In order to simulate whether a brand was featured, dis-
played, or both for a customer-brand-week, we draw randomly from a uniform
distribution, with the following thresholds for feature: (0.3, 0.2, 0.15, 0.25, 0.1), and
the following threshold for display: (0.25, 0.15, 0.05, 0.15, 0.1).23 Thus, with this
data generating process, we construct a dataset of 200 customers for 52 weeks and
5 brands, and compare the MAPD GANs, as compared to benchmarks, for price
23We note that the choice of these thresholds is to illustrate as a proof of concept, and that these
thresholds should not influence the subsequent results.
228
coefficients, and marketing variables coefficients for feature, display, and both.
3.8.5 GAN Design and Training
Figure 3.16 shows the training process with gradients flow used to update the pa-
rameters. For each iteration, the generator samples a mini-batch size of n, i.e., n
image equivalents of customer data. Each image equivalent data i in the mini-
batch consists of a k ×m random noise, where k is the number of time periods and
m is the dimension of the vector sampled for each time period. It also observes
the publicly available prices across brands: piKB for a brand B at time period K.
The generator then outputs “generated” data sg for n customers. The discrimina-
tor, having access to the data provider’s private data, is provided samples n cus-
tomers sales data: sr. Thus, the discriminator receives a mini-batch samples for
the “real” data and the “generated” data. The discriminator as a binary classifier
then predicts labels for the “real” and “generated” data, and the misclassfication
loss gradients are used to update the parameters: Jd(θd) for the discriminator, and
Jg(θg) for the generator.
3.8.6 Relationship between Hyperparameters and Accuracy
We consider in this section the relationship between number of neurons and accu-
racy of the GAN’s replication of the original data distribution for the Monte Carlo
229
Figure 3.16: GAN Training with Loss Function Gradients
This figure shows the each iteration of the training process for the conditional GAN with a training
batch of n customers who chose from b brands. Note that the actual, private data sr is only accessi-
ble by the firm’s private discriminator. zikm is the m-dimensional random noise draw for customer
i at time period k, pikb is the price for brand p at time period k that is observed by customer i, srikb
and sgirb are the “real” sales and “generated” sales respectively for a customer i at time period k
for brand b. The data and gradients flow in highlighted in red are private, i.e. visible only to the
data providing firm, and those in blue are public and visible to the researcher.
 
Firm’s private data (sales 𝑠𝑟) + 
 
Public data (prices 𝑝) 
 
𝑠𝑟𝑖11… 𝑠𝑟𝑖1𝑏𝑝𝑖11… 𝑝𝑖1𝑏 Discriminator loss gradient  
𝑛
 … 1−∇𝜃 ∑[log(𝐷(𝑠𝑟𝑖, 𝑝𝑖 , 𝜃𝑑 𝑛 𝑑
) + log(1 − 𝐷(𝐺(𝑧𝑖, 𝑝𝑖, 𝜃𝑔), 𝜃𝑑)] 
 𝑠𝑟𝑖𝑘1… 𝑠𝑟𝑖𝑘𝑏𝑝𝑖𝑘1… 𝑝𝑖𝑘𝑏 𝑖=1
  
… 
  Predicted Labels 
(Real or Generated) 
Random noise (𝑧) +   𝑖 ∈ [1, 𝑛] 
Public data (prices 𝑝) 1 
  Discriminator Loss function: … 
Generated data (sales 𝑠𝑔) + Misclassification cost 
𝑧𝑖11… 𝑧𝑖1𝑚𝑝𝑖11… 𝑝𝑖1𝑏 Public data (prices 𝑝) 0 
… 
𝑠𝑟𝑖11… 𝑠𝑟𝑖1𝑏𝑝𝑖11… 𝑝𝑖1𝑏 
𝑧𝑖𝑘1… 𝑧𝑖𝑘𝑚𝑝𝑖𝑘1… 𝑝𝑖𝑘𝑏 
Generator … 
 
… 𝑠𝑟𝑖𝑘1… 𝑠𝑟𝑖𝑘𝑏𝑝𝑖𝑘1… 𝑝𝑖𝑘𝑏 
 
 
𝑖 ∈ [1, 𝑛] 
 … Generator loss gradient 
 
 𝑛
 𝑖 ∈ [1, 𝑛] 1−∇𝜃 ∑ log(1 − 𝐷(𝐺(𝑧𝑖, 𝑝𝑖, 𝜃𝑔), 𝜃𝑑) 𝑔 𝑛
 𝑖=1
data.
The values reported in Figure 3.17 are the minimum MAPD values across fifty
230
Figure 3.17: Information Loss vs. Neuron Dimensionality.
10.0
GAN (Het.)
GAN (No Het.)
7.5
5.0
2.5
250 500 750 1000
Neuron Dimensionality
seeds for each of the generator types - GAN (Het.) and GAN (No Het.), and also
number of neurons.24 We find that the best performance is for the GAN with
heterogeneity and 512 neurons, as it has an information loss, MAPD, of 1.2, or
total of 1.2 percent difference in the regression coefficients from the true data and
GAN generated data. We also find a decreasing relationship between number of
neurons and information loss.
In order to test if the MAPD values differ across the generators, GAN (Het.)
and GAN (No Het.), and for different values of number of neurons, we conduct
two-tailed t-tests for the difference. We first conduct a cross generator-type t-test
24We follow a common convention from computer science literature where the neuron sizes are
divisible by multiples of 8, such as in increments of 128 or 256.
231
Information Loss (MAPD)
for each of the number of neurons. We find that for all values of number of neu-
rons, we reject the null that GAN (Het.) and GAN (No Het.) have the same MAPD
(p < 0.01). We next consider whether for a given generator type, GAN (Het.) or
GAN (No Het.), if the MAPDs’ differ statistically for the different values of num-
ber of neurons. We find that for both GAN (Het.) and GAN (No Het.), the MAPD
is statistically different for number of neurons 128, 256, and 512. However, the
differences in MAPD after this point are not significantly different from those of
512. This finding indicates diminishing returns in accuracy for increasing GAN
complexity. Furthermore, we also find that generator trained to incorporate het-
erogeneity, GAN (Het.), has consistently lower information loss than the generator
trained without heterogeneity - GAN (No Het.).
232
3.8.7 Model Architecture and Accuracy
In this section, we evaluate the accuracy of GANs with different model archi-
tectures. We find that compared to baseline, MAPD measures for the other model
variants are not substantially different, as the MAPD ranges from 0.0056 to 0.0215,
with the baseline model MAPD of 0.011. We use GANs with the following model
architecture as the baseline model:
1. Generator model:
(a) Activation function: Leaky ReLU
(b) Batch normalization
(c) Random noise distribution: Uniform
(d) Number of neurons: 512
2. Discriminator model:
(a) Activation function: ReLU
(b) Number of neurons: 512
3. Training procedure: With Heterogeneity
233
For the activation functions for both the generator model and discriminator
model in the GAN, we consider ReLU and Leaky ReLU activation functions. The
rectified linear unit activation function, also commonly known as ReLU activa-
tion, is defined as:
ReLU(x) = max(0, x)
Thus, ReLU introduces a non-linearity at x = 0. The leaky rectified linear unit
activation function, also commonly known as Leaky ReLU activation, is defined
as:
LeakyReLU(x) = max(x, 0.2x)
The Leaky ReLU activation potentially helps the sparse gradients problem of
ReLU, as the output of the activation is not set to zero when x is negative. Thus,
we explore both Leaky ReLU and ReLU activations as the activation functions.
We also introduce batch normalization in the intermediate layer of the generator
model, as batch normalization can potentially help alleviate mode collapse prob-
lems while training Xiang and Li (2017). Finally, we explore both uniform noise
and Gaussian noise for the random noise that is used by the generator model
to generate samples. The uniform noise is drawn randomly between -1 and 1,
whereas the Gaussian noise is drawn from a standard normal distribution.
234