MACHINE LEARNING METHODS FOR MARKETING MANAGERS AND POLICY MAKERS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Piyush Anand May 2021 © 2021 Piyush Anand ALL RIGHTS RESERVED MACHINE LEARNING METHODS FOR MARKETING MANAGERS AND POLICY MAKERS Piyush Anand, Ph.D. Cornell University 2021 This dissertation is a combination of three papers that use machine learning methods to investigate research questions of interest to marketing managers and regulators. In the first chapter, the authors investigate if firms that face crises have sys- tematically different corporate culture, and whether employee reviews provide early warnings. The authors examine these questions in the Wells Fargo con- sumer banking crisis context, and compare the culture discussions in online re- views posted by employees of Wells Fargo to those of other banks. They measure two important dimensions of corporate culture – control of employees and stabil- ity of processes, and focus on competition-based goals and outcomes. They find that sentiment of culture discussions on competition goals, and on rules and sta- bility at Wells Fargo is more negative than other banks, and this is visible as far back as ten quarters prior to the crisis reveal. These are the same causes of crisis identified in a definitive post-mortem report. Additionally, they identify another bank which could potentially be at crisis, and also find similar results for other consumer-harming crises at General Motors in 2014, Chipotle in 2015, and Mylan in 2016. In the second chapter, the authors use social media images data to estimate the impact of taxes on underage vaping. Various states in the US have enacted taxes to discourage E-cigarette usage, especially since underage usage has grown significantly. Data is difficult for underage consumption since it is illegal for them to purchase these products, and estimating tax impact has not been limited due to these constraints. The authors use publicly available user-posted images on social media from Jan 2016 - Dec 2018 to measure the impact of greater taxes on under- age posting behavior. These posts are a rough proxy for normalization, and poten- tially for consumption among underage population. Age and other demographics are detected using an ensemble of image analysis methods - Mask R-CNN (He et al., 2017) and Aggregated Residual Neural Networks (Xie et al., 2017). The authors also develop methods to estimate disguised posting of usage images, given their purported utilization by underage users. With the generalized synthetic controls (Xu, 2017), the authors find that only the states with higher taxes - Pennsylvania and California-saw a decline in underage e-cigarette posts. California’s decline is preceded by an increase in disguised posting, and Pennsylvania’s decline is ac- companied by increased engagement for the underage posts. The authors also estimate impact of taxes on posting by race and gender. In the third chapter, the authors examine generative adversarial networks (GANs) as a privacy protecting approach to customer data transfer. As cus- tomer privacy becomes increasingly important to marketers, the authors inves- tigate GANs ability to transfer a generative model, instead of data, thereby avoid- ing the process of sampling and anonymizing customer data for release for use in various analytic use cases. The authors find that GANs excel in preserving de- sired characteristics of original data and protecting privacy as compared to bench- marks. With real world data, the authors find that GANs achieve double the accu- racy as compared to the best benchmark. Additionally, they demonstrate GANs in different marketing contexts of pricing for optimal profits, and customer target- ing, and show that a individual GAN can handle multiple problems. Finally, they demonstrate volume and velocity advantages of GANs in handling larger data and real-time data streams. BIOGRAPHICAL SKETCH Piyush Anand’s research examines how machine learning methods can enrich managers’ and regulators’ understanding of consumer and firm behavior. His dissertation focuses on topics of corporate culture and consumer harm crises us- ing text analysis, health tax policy impact using image analysis, and customer data privacy protection using machine learning methods. Prior to joining the Ph.D. program at Cornell University, Piyush was a Category Manager at Ama- zon. He obtained a Post Graduate Diploma in Management from Indian Institute of Management Ahmedabad, and a Bachelor of Technology from Indian Institute Technology Guwahati. iii ACKNOWLEDGEMENTS My dissertation has benefited tremendously from the valuable feedback and sup- port of my advisor, Dr. Vrinda Kadiyali. Throughout the different stages of my PhD, she encouraged me to learn the different skills needed and to explore re- search topics of managerial and regulatory importance. I am grateful to her for her guidance during my doctoral studies. I thank my committee members, Dr. Bharath Hariharan, Dr. David Mimno, and Dr. Shawn Mankad for their feedback during my PhD. I am especially grate- ful to Dr. Bharath Hariharan for his suggestions for the second chapter of my dissertation. I thank Dr. Clarence Lee and Dr. Manoj Thomas for their insightful suggestions during the course of our co-authored projects. I would also like to ex- press my gratitude to the marketing faculty and colleagues for their helpful feed- back, the NLP and computer vision community at Cornell for the opportunity to present my work and for their suggestions. I am thankful for the recognition and financial supports of ISMS Doctoral Dissertation Award, Shankar-Spiegel Doc- toral Dissertation Proposal Award, Dyckman Research Grant, and Byron E. Grote, MS ’77, Ph.D. ’81 Johnson Professional Scholarship. I am indebted to my family for their support and encouragement - my parents, Mrs. Asha Anand and Dr. Manoj Anand, my wife, Aanchal Raisahib, and my siblings, Rahul Anand and Aarti Anand Ohri. I dedicate my dissertation to my late grandfather - Prabh Dyal Anand, from whom I learned the values of patience, hard work, and perseverance. iv TABLE OF CONTENTS Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Employee Reviews as Leading Indicators of Consumer Harm Crises: A Text Mining Approach 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Independent Directors of the Board of Wells Sales Practices Investi- gation Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Culture and Textual Analysis Model . . . . . . . . . . . . . . . . . . 14 1.5.1 A framework for measuring dimensions of culture . . . . . 14 1.5.2 Text measure of culture . . . . . . . . . . . . . . . . . . . . . 17 1.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.6.1 Wells Employees’ Discussion of Corporate Culture . . . . . 23 1.6.2 Risk Assessment of Other Banks . . . . . . . . . . . . . . . . 33 1.6.3 Generalizability to Other Consumer Facing Crises . . . . . . 34 1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 1.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 1.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 1.9.1 Wells List of Firms . . . . . . . . . . . . . . . . . . . . . . . . 50 1.9.2 Seed Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 1.9.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . 53 1.9.4 Word2Vec Hierarchical Softmax . . . . . . . . . . . . . . . . 53 1.9.5 List of Other Consumer Facing Crises Firms . . . . . . . . . 54 2 Smoke and Mirrors: Impact of E-Cigarette Taxes on Underage Social Me- dia Posting 56 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2 Electronic Cigarettes and State-Wide Taxes in the US . . . . . . . . . 59 2.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.5.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . 68 v 2.5.2 Detecting Demographics in Images . . . . . . . . . . . . . . 71 2.5.3 Detecting Disguising in Images . . . . . . . . . . . . . . . . . 80 2.5.4 Estimating the Effect of Tax Policies . . . . . . . . . . . . . . 90 2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 2.6.1 Impact of Taxes on Underage Vaping Posts . . . . . . . . . . 94 2.6.2 Impact of Taxes on Underage Disguising . . . . . . . . . . . 98 2.6.3 Impact of Taxes by Race and Gender . . . . . . . . . . . . . . 103 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 2.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 2.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 2.9.1 Number of Social Media Posts by Users and Proportion of Posts Related to Vaping . . . . . . . . . . . . . . . . . . . . . 112 2.9.2 Detecting Disguising in Images . . . . . . . . . . . . . . . . . 113 2.9.3 Supplementary Analysis with Difference-in-Differences . . 114 2.9.4 Engagement Results . . . . . . . . . . . . . . . . . . . . . . . 123 2.9.5 Impact of Taxes by Race and Gender . . . . . . . . . . . . . . 139 2.9.6 Other Common Objects in Posts . . . . . . . . . . . . . . . . 152 3 Using Deep Learning to Overcome Privacy and Scalability Issues in Cus- tomer Data Transfer 155 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 3.2 Existing Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 3.3 Methodology: Extant Approach and Benchmarks . . . . . . . . . . 164 3.3.1 Extant versus Proposed Data Transfer Paradigm . . . . . . . 164 3.3.2 Benchmark Methodology . . . . . . . . . . . . . . . . . . . . 168 3.3.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 171 3.4 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 3.4.1 Generative Adversarial Networks . . . . . . . . . . . . . . . 184 3.4.2 Design of Neural Networks . . . . . . . . . . . . . . . . . . . 185 3.4.3 The Picture-Data Analogy and Extension to Heterogeneity . 187 3.4.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 3.5 Empirical Context and Results . . . . . . . . . . . . . . . . . . . . . . 193 3.5.1 Accuracy - Privacy Trade-off . . . . . . . . . . . . . . . . . . 193 3.5.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 3.5.3 Generalizability to Marketing Problems . . . . . . . . . . . . 208 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 3.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 vi 3.8.1 Inference Model and Information Loss . . . . . . . . . . . . . 224 3.8.2 Monte Carlo Experiment 1 Data . . . . . . . . . . . . . . . . 226 3.8.3 Monte Carlo Experiment 2 Data - Customer Targeting . . . . 227 3.8.4 Monte Carlo Experiment 3 Data - Tackling Multiple Market- ing Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 3.8.5 GAN Design and Training . . . . . . . . . . . . . . . . . . . . 229 3.8.6 Relationship between Hyperparameters and Accuracy . . . 229 3.8.7 Model Architecture and Accuracy . . . . . . . . . . . . . . . 233 vii LIST OF TABLES 1.1 Number of Employee Reviews (ranked by assets as of 03/31/2018) 10 1.2 Most Similar Words for Culture Topics in Pros and Cons Sections of Employee Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.3 Employee Reviews: Discussions of Culture Topics . . . . . . . . . . 26 1.4 Semi-Elasticities for Wells in Discussions of Culture . . . . . . . . . 28 1.5 Employee Reviews: Discussions of Culture Topics Over Time . . . 31 1.6 Semi-Elasticities for Wells in Discussions of Culture Over Time . . 32 1.7 Generalizability to Other Crises . . . . . . . . . . . . . . . . . . . . 37 1.8 Generalizability to Other Crises – Elasticities for Overall Culture . 40 2.1 Number of Posts by Hashtag . . . . . . . . . . . . . . . . . . . . . . 64 2.2 Information Observed for Each Post . . . . . . . . . . . . . . . . . . 67 2.3 Summary Statistics for State-Week Level Variables . . . . . . . . . . 93 2.4 State-Wide Effect of Taxes: Underage . . . . . . . . . . . . . . . . . 117 2.5 State-Wide Effect of Taxes: Other Demographics . . . . . . . . . . . 118 2.6 Robustness Results for Table 2.4: Model Specifications . . . . . . . 120 2.7 State-Wide Effect of Taxes: Disguising . . . . . . . . . . . . . . . . . 121 2.8 Robustness Results for Table 2.7: Model Specifications . . . . . . . 122 2.9 Disguising and Common Objects . . . . . . . . . . . . . . . . . . . . 153 2.10 Underage and Common Objects . . . . . . . . . . . . . . . . . . . . 153 2.11 Underage, Disguising, and Common Objects . . . . . . . . . . . . 154 3.1 Description of Benchmark Methods . . . . . . . . . . . . . . . . . . 171 3.2 Summary Statistics for Monte Carlo Data (N=10,400) . . . . . . . . 196 3.3 Distribution Metrics (Lower the Better) . . . . . . . . . . . . . . . . 198 3.4 Data Volume and Estimation Time . . . . . . . . . . . . . . . . . . . 204 3.5 Data Volume and Generator Compression . . . . . . . . . . . . . . 205 3.6 Price Markups from Eq. 3.9 for True Data and Benchmarks . . . . . 209 3.7 Optimal Profit Ratio from Eq. 3.10 for Benchmark Methods w.r.t True Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 3.8 Summary Statistics for Monte Carlo Data (N=10,400) . . . . . . . . 212 3.9 Summary Statistics for Monte Carlo Data (N=10,400) . . . . . . . . 214 viii LIST OF FIGURES 1.1 Competing Values Framework . . . . . . . . . . . . . . . . . . . . . 15 1.2 Word2Vec Skip-gram Model . . . . . . . . . . . . . . . . . . . . . . 19 2.1 Number of Vaping Related Posts by State and Year in the US . . . 65 2.2 Building Blocks for CNNs . . . . . . . . . . . . . . . . . . . . . . . . 70 2.3 ResNet-18 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.4 Residual Block of ResNet . . . . . . . . . . . . . . . . . . . . . . . . 75 2.5 ResNeXt and ResNet Blocks . . . . . . . . . . . . . . . . . . . . . . . 77 2.6 Distribution of Age in Training Data . . . . . . . . . . . . . . . . . . 78 2.7 Example of Object Detection for Juul and Emoji . . . . . . . . . . . 81 2.8 Faster R-CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . 84 2.9 Mask R-CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . 87 2.10 Example of Annotated Data for Disguising . . . . . . . . . . . . . . 88 2.11 Effect of Taxes on Underage Posts . . . . . . . . . . . . . . . . . . . 94 2.12 Effect of Taxes on Underage Disguising . . . . . . . . . . . . . . . . 99 2.13 Number of Posts and Users based on Proportion Vaping Posts Cutoff.112 2.14 Emoji List for Detection . . . . . . . . . . . . . . . . . . . . . . . . . 113 2.15 Effect of Taxes on Likes: Underage . . . . . . . . . . . . . . . . . . . 124 2.16 Effect of Taxes on Comments: Underage . . . . . . . . . . . . . . . 127 2.17 Effect of Taxes on Likes: Underage Disguised . . . . . . . . . . . . 130 2.18 Effect of Taxes on Comments: Underage Disguised . . . . . . . . . 133 2.19 Effect of Taxes on Solo Faces in Posts . . . . . . . . . . . . . . . . . 136 2.20 Effect of Taxes on Gender: Female . . . . . . . . . . . . . . . . . . . 140 2.21 Effect of Taxes on Race: Asian . . . . . . . . . . . . . . . . . . . . . 143 2.22 Effect of Taxes on Race: Black . . . . . . . . . . . . . . . . . . . . . . 146 2.23 Effect of Taxes on Race: White . . . . . . . . . . . . . . . . . . . . . 149 3.1 Data Transfer Paradigm Comparison . . . . . . . . . . . . . . . . . 169 3.2 Benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 3.3 Proposed using GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 169 3.4 Design of Generator Neural Network . . . . . . . . . . . . . . . . . 186 3.5 Design of Discriminator Neural Network . . . . . . . . . . . . . . . 187 3.6 Picture Data Analogy . . . . . . . . . . . . . . . . . . . . . . . . . . 188 3.7 A “Picture” in Panel Data Context: Without Heterogeneity. . . . . 189 3.8 A “Picture” in Panel Data Context: With Heterogeneity. . . . . . . 189 3.9 Distributional Accuracy for Synthetic Data . . . . . . . . . . . . . . 195 3.10 Performance on Monte Carlo Data. . . . . . . . . . . . . . . . . . . 199 ix 3.11 Information Loss and Loss of Privacy on Real Data . . . . . . . . . 202 3.12 Generator Size and GAN Complexity . . . . . . . . . . . . . . . . . 206 3.13 Streaming Data and Information Loss . . . . . . . . . . . . . . . . . 208 3.14 Performance for Customer Targeting . . . . . . . . . . . . . . . . . 213 3.15 Performance for Tackling Multiple Problems . . . . . . . . . . . . . 216 3.16 GAN Training with Loss Function Gradients . . . . . . . . . . . . . 230 3.17 Information Loss vs. Neuron Dimensionality. . . . . . . . . . . . . 231 x CHAPTER 1 EMPLOYEE REVIEWS AS LEADING INDICATORS OF CONSUMER HARM CRISES: A TEXT MINING APPROACH 1.1 Introduction Do firms that face crises have systematically different corporate culture? Can on- line reviews posted by employees in public forums, provide early warning of po- tentially dysfunctional culture? We examine these questions for corporate crises that cause consumer harm- we do this first in the context of the Wells Fargo con- sumer banking crisis. In this crisis, Wells Fargo (“Wells” hereon) created millions of unauthorized bank and credit card accounts without their customers’ knowl- edge (Arnold 2016; Ochs 2016). These revelations came to light in September 2016. Since then, much has been written in the press about how Wells’ corporate cul- ture was directly responsible for this crisis. According to the business press since, some aspects of this allegedly problematic culture included aggressive sales tar- gets, lack of oversight by the leadership of different divisions in the bank, and strong leaders in divisions who demanded loyalty1. Since September 2016, Wells has paid up to $3 billion in federal fines alone, and state fines on top of that, and has lost significant firm value. This chapter is: Anand, Piyush, Vrinda Kadiyali, and Vishal Narayan (2020). “Employee Reviews as Leading Indicators of Consumer Harm Crises: A Text Mining Approach.” 1Wells continues to struggle with finding fixes. For example, see Charles Scharf Puts Stamp on Wells Fargo With Overhaul of Reporting Lines, Wall Street Journal, Feb 11 2020 1 In the aftermath of the consumer banking division crisis, Wells conducted its own investigation in to what led to the crisis, knowing that regulators were closely watching its process and conclusions and hence having every incentive to report accurately. Resultantly, a definitive investigative report was produced by its inde- pendent board of directors on the improper sales practices (“the Report” hence- forth).2 Unfortunately, even as discussions of the Wells crisis and its extensive and extended aftermath on Wells’ profits have been discussed in the press, there has been some recent concern that similar crises might be brewing in other banks.3 Expectedly, this is worrying the regulatory authorities because of its impact on consumers and investors. We measure corporate culture at Wells before the crisis came to light in the press, to see whether this measured corporate culture is different from its peers, and whether it is concordant with the findings of the definitive post-mortem in- vestigation. We also want to understand if similar culture problems currently af- flict other banks. While corporate culture has been discussed in business press and in academic research, academic measurement has traditionally been via employee surveys, using firm 10K documents, etc. There are no public-use Wells employee surveys available from before and after the crisis, nor are there similar surveys available for other banks currently. Therefore, following Bhandari et al. (2017) and Corritore, Goldberg, and Srivastava (2019), we use textual data from anony- 2Independent Directors of the Board of Wells Fargo Company, Wells Fargo, April 10 2017 3See, for e.g., Economist, June 14, 2018, “Other American banks may have misbehaved as Wells Fargo did. Which ones?”. Recently, American Express was accused of using similar tactics in its small business division – “AmEx Staff Misled Small-Business Owners to Boost Card Sign-Ups”, Wall Street Journal, March 1 2020 2 mous employee reviews from a leading jobs website in the US.4 We are especially interested in estimating whether these employee reviews can provide any indica- tions of dysfunctional culture before crises become public.5 The presence of the Report provides the “ground truth” to see if employee reviews are valuable, and differentiates our work from other research in this area. We use insights from an organizational framework - competing values framework (CVF hereon; Cameron et al. 2014) - to guide our measurement of corporate culture. This framework categorizes corporate culture along two key dimensions - the extent of control of employees and stability of processes, and the extent to which the firm is organized to focus on internal structure or external market/ competition-oriented goals. By all accounts, Wells had a problematic culture before the crisis was reported in the media. Therefore, we cannot use causal inference methods to isolate which aspects of corporate culture caused the crisis. Instead, we compare Wells’ culture to that of its bank rivals who did not face a corporate crisis like Wells (or at least, did not have a crisis that became public until July 2018 when our data period ends). We restrict ourselves to large national commercial banks as reported by the Federal Reserve – banks with no foreign ownership and at least 100 branches in the US as of March 31, 2018.6 Data for smaller banks is sparser. We obtain 34,306 4The literature on employee voice discussed how Employees are often the “silent” stakehold- ers of organizations, despite employees being closest to the firm and having a significant effect on firm (Chang, Oh, and Park 2017; Huang et al. 2015; Moniz 2017; Moniz and Jong 2014). 5As the previous paragraph indicates, the corporate culture at Wells could well have been viewed positively by those rewarded for the achievement of aggressive sales targets, or even non- cheating employees who found attractive the more free-wheeling culture with few oversights. It is likely there were other employees with different views. In our measurement, we capture aggregate positive and negative sentiment across all employees. 6Source: Federal Reserve Statistical Release - Large Commercial Banks, release date: March 31, 2018 3 reviews for 32 large national commercial US banks before the Wells crisis becomes public, spanning the period May 2008 – Sep 2016.7 We then compare these measures of culture at Wells to those identified as the cause of crisis by the Report. We find that employees at Wells discuss fewer culture strengths and more culture weaknesses than employees at other large national commercial banks. Importantly, these culture drawbacks overlap with those identified as key causes of the crisis in the Report. Next, we analyze em- ployee reviews for the top 10 large national commercial banks in US to see how close culture measures are to those in the pre-crisis employee reviews of Wells. We find troubling similarity between pre-crisis corporate culture of Wells and today’s corporate culture in another bank.8 Additionally, we examine the generalizability of our results outside of the banking industry by measuring culture discussions in employee reviews for three other consumer-facing crises of General Motors in 2014 (faulty ignition), Chipotle in 2015 (food contamination), and Mylan in 2016 (exorbitant pricing of a pharma product), starting from ten quarters before these crises became public. This anal- ysis shows similar results and provides convergent evidence on the usefulness of employee reviews to identify corporate culture dysfunction in not just Wells and the banking industry, but in other industries too. The substantive contribution of our paper is to demonstrate across several 7See Table 1 for details on these banks 8We do not reveal the name of this bank in this paper 4 crises, that employee reviews can provide systematic signals of corporate crises before these crises become public. Previous studies in marketing have studied what happens to marketing outcomes like brand strength, consumer engagement, etc. after product crises (Borah and Tellis 2016; Van Heerde, Helsen, and Dekimpe 2007; Zhong and Schweidel 2020), financial impact (Chen, Ganesan, and Liu 2009) and the determinants of firm responses to such crises (Liu, Liu, and Luo 2016). To the best of our knowledge, we are the first to demonstrate the association of nega- tive corporate culture with the occurrence of crises i.e. as leading indicators before the crises are revealed in the press. Previous scholars in marketing have used CVF (Deshpande and Farley 2004; Lukas, Whitwell, and Heide 2013) to study firm cul- ture. Our methodological contribution is combining CVF and current text mining methods on a new source of publicly available data in marketing- employee re- views. Employee reviews are free from potential biases in employee survey data (e.g., demand effects, order effects, social desirability bias). Our results of identi- fying elements of problematic culture are likely to be useful both to managers and regulators to reduce consumer harm. Managers of firms facing consumer harm crises can use signals from reviews as a basis for investment in improvement of culture. Regulators can use these publicly available signals to identify which firms might be at risk of causing consumer harm, and accordingly direct their monitor- ing efforts. The rest of the paper proceeds as follows. In section 2, we discuss related literature. In section 3, we discuss the data. Section 4 details the text mining model. Section 5 presents results, and we conclude in section 6. 5 1.2 Literature Review Given the interdisciplinary nature of our research question, we discuss several areas of research that are related to our work. Consider first the definition of culture. Economists have defined culture as comprising shared language, shared knowledge and established rules of behav- ior in ways that are unique to a company (Cremer 1993). Graham et al. (2017) find in their survey of executives that executives characterize culture as a “beliefs system”, “coordination mechanism”, and “invisible hand”. According to O’Reilly (1989), an organizations scholar, culture is/ comprises control systems and norma- tive order. Therefore, culture can be efficiency improving but it can also become too ingrained in suboptimal ways. This definition accounts for both formal and informal culture-defining practices. We follow this definition. With this definition in mind, we first discuss the literature on the impact of corporate culture on various outcomes. O’Reilly, Chatman, and Caldwell (1991) find better organizational performance when there is a fit between personal and organizational values. Green et al. (2019) suggest good corporate culture is con- nected to better stock returns. Culture has also been found to be a driver of radical innovation (Tellis, Prabhu, and Chandy 2009), of merger success (Chang, Oh, and Park 2017), of financial reporting risk (Ji, Rozenbaum, and Welch 2017), etc. In Graham et al.’s (2017) survey, senior executives believe corporate culture is a top- three driver of firm value, and 92% believe better culture would lead to higher 6 value. Guiso, Sapienza, and Zingales (2015b) find that an increase in integrity is associated with increases in Tobin’s q and increase in profits. The papers above provide important background motivation. However, our paper differs from these papers on a number of dimensions. First, we are inter- ested in trying to isolate the link between (dysfunctional) corporate culture and (lower) corporate performance - specifically, the occurrence of the crisis. Our fo- cus on negative outcome- a major crisis- is also quite new in the literature. The closest paper to ours is Ji, Rozenbaum, and Welch (2017), who use employee rat- ings of culture and values on an employee reviews website, and find that lower ratings for culture and values is associated with reporting fraud. However, we differ from Ji, Rozenbaum, and Welch (2017) on five major points. First, we use CVF to measure the nature of discussions of culture from employee reviews, as opposed to the ratings on a 5-point scale. This approach allows us to examine culture in an organizational theory framework which is tied to agency theory and competition. Second, we exploit the demarcated nature of employee reviews to get separate measures for positive and negative discussions of culture. Third, our substantive interest is in understanding the role of corporate culture in consumer facing crises, and not in accounting frauds. Fourth, we do text analysis to extract semantic measures of discussions in text. Fifth, the existence of the Report allows us to validate our measures as being potentially causal. Expectedly, our work is also related to literature on firm crises. These papers focus on effects of firm crises. Van Heerde, Helsen, and Dekimpe (2007) found 7 a crisis firm faces reduced effectiveness for its marketing activities. Chen, Gane- san, and Liu (2009) find that a product-harm crisis leads to a decrease in brand equity, consumer preferences, firm’s reputation, and market share. Borah and Tellis (2016) study the impact of product recalls on online discussions and social media. Zhong and Schweidel (2020) use a Dirichlet process – hidden Markov model to capture change in discussions on social media for recent brand crises. As crisis events unfold, different stakeholders get affected negatively, and their responses vary from regulatory fines and penalties, decreased brand loyalty from consumers, lower employee morale, increased negative mentions in media, and lower financial performance (Liu, Liu, and Luo 2016; Pearson and Clair 1998; Wei, Ouyang, and Chen 2017; Zhao, Zhao, and Helsen 2011). Unlike these papers, we want to identify cultural leading indicators of crises rather than the effect of crises. Importantly, this allows us to identify another at-risk bank. Furthermore, we use CVF to infer correlates of crises, which can serve as leading indicators. Another relevant stream of literature is measuring corporate culture. Weber and Camerer (2003) and Burks and Krupa (2012) use laboratory experiments to “treat” culture and observe outcomes (difficulty after merger and ethics respec- tively). As mentioned earlier, Guiso, Sapienza, and Zingales (2015a, b) use “Great places to work” survey of employees and corporations’ own description of cul- ture to measure differences between differences between employees’ perception of culture and the company’s self-stated culture. Our approach of text mining employee reviews is close to Bhandari et al. (2017) who text-mined 10K state- ment of companies to extract culture traits using Latent Dirichlet Allocation. In 8 terms of text mining methods, we are closest to Timoshenko and Hauser (2019) who use Word2Vec (Mikolov et al. 2013) to capture semantic word embeddings that are subsequently used as features in a hybrid deep learning model to capture consumer needs from user generated content. We use Word2Vec in this paper to extract discussions of culture from employee reviews. 1.3 Data As discussed earlier, we collect publicly available employee reviews from a lead- ing online jobs and reviews website in the U.S. We collect employee reviews for Wells before the crisis became public. We collect all reviews from January 2008 to July 2018, for Wells and for 31 other banks. For comparability, and to ensure there are enough reviews, we confined ourselves to the large commercial banks with domestic ownership. As of March 2018, these banks had at least $300 Mil- lion in consolidated assets and at least 100 branches in the U.S.9 Table 1.1 lists the number of reviews by year for each of these banks. Of the 46,385 reviews, 34,306 were posted before September 7, 2016, the date on which the crisis at Wells was revealed (Arnold 2016; Ochs 2016). 7 banks (including Wells) account for 89.9% of reviews. Banks with greater assets elicit more reviews, so our analysis controls for assets. 9The financial details of these banks from the Federal Reserve are listed in the Appendix 1.9.1 9 10 Table 1.1: Number of Employee Reviews (ranked by assets as of 03/31/2018) This table shows the total number of reviews by bank and year (beginning May 1, 2008, ending on June 30, 2018) Bank 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Grand Total JPMorgan Chase 192 164 317 399 618 852 1403 2020 1875 1279 599 9718 Bank of America 240 218 345 460 566 924 1389 2081 1918 1214 537 9892 Wells Fargo 224 169 352 463 720 891 1548 1999 1874 1366 550 10156 Citibank 145 119 173 158 197 322 445 621 722 571 253 3726 US Bank 66 54 73 101 148 229 375 480 510 369 172 2577 PNC Bank 13 14 71 67 122 194 300 467 464 364 160 2236 Capital One 74 74 84 90 171 243 372 686 687 574 340 3395 KeyBank 17 9 29 30 49 61 64 111 116 118 63 667 Citizens Bank 9 11 23 36 39 53 125 217 211 178 81 983 Huntington bank 5 6 9 12 18 33 69 112 115 97 49 525 Zions Bank 4 2 14 11 15 22 31 64 54 44 21 282 Peoples United Bank 2 3 3 4 9 20 22 27 37 24 15 166 First Tennessee Bank 2 7 6 8 20 12 17 22 29 27 25 175 BOK Bank 2 1 1 2 2 6 9 12 20 18 25 98 First National Bank of Pennsylvania 2 1 1 13 6 16 9 6 54 Associated Bank 2 3 2 7 9 13 15 30 16 35 8 140 Sterling National Bank 1 2 1 4 10 13 11 5 47 Valley National Bank 6 1 7 5 11 20 27 15 11 103 Webster Bank 1 5 6 9 12 14 10 5 62 TCF National Bank 4 3 14 20 22 45 63 74 74 43 20 382 MB Financial 2 1 3 6 5 9 9 13 12 12 9 81 First National Bank of Omaha 1 3 7 8 5 23 21 25 20 14 127 Old National Bank 1 1 1 3 1 5 3 6 15 3 3 42 Washington Federal Bank 2 4 6 29 22 35 21 8 127 Trustmark National Bank 6 3 6 11 11 5 2 44 Fulton Bank 1 1 4 2 5 11 18 9 1 52 Centerstate Bank 1 2 1 3 2 9 NBT Bank 1 2 1 4 5 10 3 26 Park National Bank 1 1 4 2 5 2 15 Woodforest National Bank 3 6 6 9 16 32 37 78 78 67 29 361 DBA First Convenience Bank 1 2 10 9 17 25 34 12 7 117 Grand Total 1009 867 1537 1904 2798 4005 6418 9270 9034 6523 3020 46385 For each review, we observe the date when the review was posted, whether the employee posting the review was currently working at the bank at the time of posting the review, and whether they disclosed their job title in the review or not, i.e. anonymously. In our data, we find that 59% of the reviews are posted by employees who were working at the firm at the time of posting the review, i.e., current employees, while 41% of the reviews are posted by employees who no longer worked at the firm, i.e., former employees. We also observe whether reviewers disclose their job title- 14% of the reviews do not disclose the job title of the employee posting the review, i.e., anonymously, while 86% of the reviews disclose the employees’ job title. There can be differences in posting content for anonymous and non-anonymous reviews, as well as current and former employ- ees, so we control for these variables. In the text-based portion of a review, the website has three prompts – pros, cons, and advice to management. Following Lee and Bradlow (2011), we use this demarcation to label as positive the sentiment of topics extracted from pros, and as negative the sentiment of topics extracted from cons. The mean length of employee reviews is 50.80 words (SD = 33.21). The pros section has a mean of 17.33 words (SD = 13.57), and the cons section has a mean of 20.15 words (SD = 16.96). Since the depth of discussion of any topics and number of topics is likely affected by the length of the review, we control for the length of the reviews in our analysis. Finally, research on employee reviews from this website (Corritore, Goldberg, 11 and Srivastava 2019; Marinescu and Posner 2019) has reported evidence against any systematic selection biases due to non-random sampling of employees who post reviews on this website. Another bias could arise if Wells successfully pres- sured employees to post positive reviews. Our finding that employees of Wells post fewer positive reviews than those of comparable firms suggests that any such pressure was perhaps ineffective. 1.4 Independent Directors of the Board of Wells Sales Practices Investigation Report In the aftermath of the Wells crisis, the independent directors of the board of Wells conducted a sales practices investigation. The findings of the investigation were made public on April 10, 2017 in a 110-page long report that lists the root-causes for the improper sales practices at Wells. An excerpt from the report summary is as follows: ”The root cause of sales practice failures was the distortion of the Community Bank’s sales culture and performance management sys- tem, which, when combined with aggressive sales management, cre- ated pressure on employees to sell unwanted or unneeded products to customers and, in some cases, to open unauthorized accounts.” 12 The report discusses the sales practices and culture at Wells, and how it led to the crisis. The report highlights two aspects of corporate culture that led to these poor sales practices. First, there was a high weight placed on performance as measured by sales, since these were seen as directly linked to higher market share and higher profits. Second, the decentralized structure of the bank with lack of formal procedures was a major reason why the sales practices went unnoticed. Even when there were reported incidents of improper sales practices, they were considered as independent incidents and were not seen as manifestation of the same underlying issue. A more recent report was released by Wells on Jan 30, 2019. This report, a 104 pages long document titled “Learning from the past, transforming for the future”, discusses Wells learning from the crisis and how it intends to prevent such inci- dents from happening again in the future. Both reports highlight that the root cause of the crisis was the combination of high sales focus with a decentralized business model, and a failure of control functions that allowed the improper sales practices to go unchecked. The findings of these reports serve as a benchmark for our analysis. 13 1.5 Culture and Textual Analysis Model 1.5.1 A framework for measuring dimensions of culture The key conceptual framework for measuring culture is the Competing Values Framework (CVF). This framework categorizes culture on two dimensions (see Figure 1.1). The first dimension is the degree of control of processes- one end of the spectrum represents firms that are flexible and dynamic, the other end of the spectrum represents stability and order.10 Among other things, this dimension captures the extent to which the firm formally deals with or needs to deal with agency.The second dimension is the degree of market or internal focus- one end of the spectrum represents firms that focus on integration and uniformity of various activities inside the company. The other end of the spectrum represents firms that are more focused on externally oriented, specifically, market-based drivers of activities.11 Based on these two dimensions, four major culture types are obtained. First, consider a company with a higher focus on stability and internal consistency. This culture type places importance on predictability, conformity, and internal pro- 10Among other things, this dimension captures the extent to which the firm formally deals with or needs to deal with agency, including those caused by incentives set by the firm. This is directly relevant to the Wells’ case. 11This dimension captures, among other things, the extent of competitive pressures faced by a firm. These competitive pressures can come from economic conditions in the industry (e.g. declin- ing revenues or high fixed costs) and compensation schemes of CEOs and managers (Aggarwal and Samwick 1999). The Report suggests managers incentive compensation schemes were tied to achieving aggressive competitive targets. 14 Figure 1.1: Competing Values Framework cesses, and has a lower focus on external markets (lower left quadrant). This corporate culture, termed “control”, captures Weber’s theory / organization of a firm for the industrial age. In years past, conglomerates like GE or Westing- house were predominantly structured this way. Next consider a company with high process/employee stability and order and high focus on markets/high fo- cus on internal consistency (lower right quadrant). This corporate culture, which is competition-oriented and termed “compete” in this framework is related to Williamson’s (Williamson 1981) framework where a firm’s key role is to transact with outside entities. Additionally, for these firms, the goal is competitive strength and shareholder value. Many companies of the old economy are described well by this- PG, Coca Cola, IBM, etc. Third, consider a company with flexibility and dynamism in em- 15 ployee/process control and internal focus (upper left quadrant). This corporate culture is called “collaborative”. These are characterized by loyalty to leaders, tradition, commitment and cohesion. These features can potentially substitute for formal employee control/ agency solutions. This culture is common in several family firms and several Chinese firms (Chen 2001). Fourth, consider a firm with flexible employee/process control, and with an emphasis on market transactions (upper right quadrant). This culture is most salient for the information age where companies face rapidly changing industry landscape and have work patterns that are best supported by temporary teams that form and dissolve over the course of projects. These companies are also responsive to market pressure (e.g., because of increasing returns technologies) and are focused on how to beat these pressures. Two aspect of the framework are important to highlight for our application. First, the framework is descriptive and not normative; there is no guidance on what constitutes optimal culture. In our application, we measure the association of culture and crisis by comparing crisis and non-crisis firms. Second, a company can have multiple types of culture, in different divisions, and at different levels within a division. Cameron et al. (2014) argue that CVF has a high degree of congruence with other well-known constructs of values, the way people think and their assump- tions, such as McKenney and Keen (1974), Mitzroff and Killman (1978), and Myers (1962). CVF has been used as well as adapted in management and organizations literature to assess organizational culture (Hartnell, Ou, and Kinicki 2011; Lavine 16 2014; Panayotopoulou, Bourantas, and Papalexandris 2003). Using CVF, Desh- pande and Farley (2004) found that collaborate and control cultures have a signif- icantly negative impact on firm performance, while create and compete cultures have a significantly positive impact on firm performance. Lukas, Whitwell, and Heide (2013) found in the context of product design and overprovision in mar- kets that higher focus on create and compete cultures in firms leads to potential mismatch between customer needs and firms product capability decisions. Our paper differs from existing CVF literature in the following ways. First, we have the Report that establishes the causes of the crisis, and therefore we can verify the cultural causes of crises. Second, conditional on measuring problematic culture at Wells, we use it as a yardstick to see if other companies are similarly at risk. In the absence of theory-driven hypotheses about problematic culture, our empirical approach to creating a yardstick for problematic culture is novel. 1.5.2 Text measure of culture We are interested in estimating the nature and extent of discussions of four dimen- sions of corporate culture: create, collaborate, control, and compete, which are captured by a set of seed words as listed in the Appendix 1.9.2. Adding or drop- ping a few randomly selected seed words does not change our results. Counts of seed words as measure of discussion is not ideal since they are not extensively used in online reviews (Puranam, Narayan, and Kadiyali 2017). Furthermore, us- 17 ing seed words by themselves to estimate extent of discussion also suffers from researcher subjectivity bias, as the word lists may not exhaustively cover the na- ture of discussion of these topics. Word2Vec captures semantic similarity of words using the contexts in which they appear. We take two steps to measure culture discussions using Word2Vec. First, we train the Word2Vec model on the reviews data. The model learns N-dimensional vector representations, subsequently referred to as embeddings, for each of word in the corpus. To construct the embeddings for the culture topics, we take the av- erage of the embeddings for each of the seed words for the culture topics. Second, we estimate the extent of discussion of culture topics in the employee reviews as the cosine distance between the culture topics obtained from the first step and the words in the text, aggregated for each employee review. We describe these two steps below. We discuss pre-processing of text in the Appendix 1.9.3. In the first step, the Word2Vec model estimates embeddings for words based on the contexts they occur in. Consider the example the phrase: “Manager pushes sales targets, high stress job.” If we select a context window of 2 words and se- lect the focal word as “targets”, then the model observes that the word “targets” occurs with the context: “pushes”, “sales”, “high”, “stress”. That is, 2 words pre- ceding the focal word and 2 words post the focal word are the context when we set the context window as 2.12 The model then estimates embeddings such that the probability of predicting this context given the focal word is maximized for all possible focal word and context combinations observed in the corpus. This 18 prediction task is the skip-gram model for Word2Vec (also see Figure 2). Figure 1.2: Word2Vec Skip-gram Model This figure illustrates the skip-gram model for Word2Vec using the example “Manager pushes sales targets, high stress job”. The focal word is ”targets”, and the model predicts the probability of observing the context words around the focal word. The skip-gram Word2Vec model maximizes the following objective function: 1 ∑T ∑ logp(wt+ j|wt) (1.1)T t=1 −J≤ j≤J, j,0 19 where T is the total number of focal words, J is the context window size, and wi is the embedding for word i. The probability of observing a context word i given focal word j is: ′ exp(vw >vwi) p(wi|w jj) = ∑V (1.2) j=1 exp(v ′ w >v )j wi where vw j represents the input vector for word w j and ′ vw represents the outputj vector for word w jout of a vocabulary of size V The model thus has 2 × N × V number of parameters to be estimated, where N is the number of neurons in the hidden layer. The denominator in the softmax poses a computational challenge as the vocabulary size is often large, and the computational cost of the gradient of p(wi|w j) is directly proportional to V. To deal with this, we use the hierarchical softmax approximation for the softmax (Mikolov et al. 2013). It has the advantage that the computation cost is proportional to log2V instead of V with the softmax.12 The objective function for the skip-gram with hierarchical softmax Word2Vec is maximized using stochastic gradient descent. In order to measure culture senti- ment, we exploit the demarcated pros and cons sections structure of the reviews. We train two Word2Vec models on the pros and cons sections of the reviews sep- arately. The seed words are the basis for forming culture topic embeddings. We construct culture topic embeddings as the average of the embeddings for the seed 12We discuss the details of hierarchical softmax in the Appendix section 1.9.4 20 words corresponding to that culture topic. Thus, we get positive embedding for a culture topic c as vcp which is obtained from the Word2Vec model trained on the pros section of the reviews, and we get the negative embedding for a culture topic c as v from the cons section of the reviews. 13cn Based on the word embeddings estimated from these models, we measure the discussion of a culture topic in terms of cosine distance between the culture topic embeddings and the text in employee review pros (or cons), using the cosine sim- ilarity metric. More formally, we measure positive (and negative) discussions of culture topics in reviews as follows: 1 ∑ vw >vc d (r v ) j sc,s , cs = (1.3)ls w j∈ ||vr w j || ∗ ||vcs || where dc,s is one of two the cosine distance measures – positive and negative discussions based on the sentiment s for a culture topic c,ls is the length (number of words) of the section of the review: rs and vw j the word embedding for a word w j. The support for these measures is [-1,1], with the discussion in a review being perfectly similar to the culture topic for a score of 1, orthogonal to the culture topic for a score of 0, and perfectly dissimilar to the culture topic for a score of -1.14 Thus, for the reviews we construct eight measures of culture discussions – 13The model hyper-parameters for the two Word2Vec models that are optimized by maximizing the held-out log-likelihood on 5% of the corpus are: dimension of the embeddings, size of context window, and number of iterations over the corpus. We changed the held-out corpus to 2%, 5%, and 10% during Word2Vec training and found that optimal model hyper-parameters do not change. 14We find empirically in our reviews data that these measures are in the range [0.053, 0.685], implying that our discussion measures vary from low similarity to high similarity. 21 positive discussion and negative discussion for each of the four culture topics. For both positive and negative sentiments, we also construct an overall measure of culture discussion. Our overall measure of discussion of culture in review r is a weighted average of discussion measures of the four culture topics for that review in a section s, given by: ∑∑c,s Wc,sdc,sdc,s = (1.4) c,s Wc,s where dc,s is the measured discussion of culture topic c for a section s – pros and cons, and Wc,s is the mean discussion for sentiment s of topic c across all re- views. An important point to note is that our measures of discussion are based on similarities to culture topics, i.e. how similar the content of a review is to a culture topic, and not on the magnitude (e.g. high or low) of discussion of the culture topic in the review. Furthermore, longer reviews could include lower proportion of culture topics if there is discussion of other topics. Therefore, we control for the length of a review (in words) in our subsequent analysis in order to account for volume of discussion in a review. 1.6 Results The results section is organized in three parts. First, we discuss the analysis of the employee reviews for Wells relative to other banks. Second, to see if they might 22 potentially be at risk for (future) crises, we assess how close our estimates of other banks’ culture topics are to Wells’. Finally, we study three consumer facing crises outside the banking industry. 1.6.1 Wells Employees’ Discussion of Corporate Culture We discuss the results of the Word2Vec model. In Table 1.6.1, we present the top 20 words that are most similar to the culture topics in the pros and cons section of the reviews respectively. We first want to understand whether Wells’ culture is systematically different from its rivals in the ten quarters prior to the crisis reveal in the press. Therefore, we run the following regressions for each of the culture topics (including overall culture topic) and sentiment for the employee reviews in the 10 quarters leading up to the crisis unveil: ∑ dr, j,s = α j,s + βδWells + ρ δqtr + γRr + r, j,s (1.5) qtr 23 24 Table 1.2: Most Similar Words for Culture Topics in Pros and Cons Sections of Employee Reviews This table lists the 20 most similar words (using cosine similarity) for the culture topics embeddings as estimated from the Word2Vec models trained on the pros (and cons) sections of the reviews. A culture topic embedding is estimated as the average of the embeddings of its seed words as listed in the appendix. Create Collaborate Control Compete Pros Cons Pros Cons Pros Cons Pros Cons technically entrepreneurial inviting appreciative procedures onerous ridiculous aggressive agenda creative emphasized humans written impede targets pressure inspirational innovative demonstrated participation conservatively principles evaluation sales innovation innovation seeks unified workflow flows exceeding quotes transformation evolve participation resolutions straightforward bloated timelines cooker insightful stifle eachother mistreatment micromanagement adapting reachable campaigns acceptance unified appreciative instill mature prioritization unrealistic aggravating entrepreneurial stifled enthusiasm interpersonal overly protocol struggle lofty bold stifles differences acknowledgment timings structuring hitting obscene innovative averse advocate foster predictable regs achievable exceeded contributors explore principles recognizing defined rigidity strictly unachievable tune progressive disabilities outgoing consistent needlessly hustle ridiculous implementing creativity treating sought structured robust sells payoff emphasized adapting admirable faith consistent bureaucracy production obsession exceptionally leverage sincere affecting rigid abrupt producer upsell dynamic horizontal moral sole equitable workflow target goals prides ease fostered sympathy staffs interference producing baseline agile methodology communicative smartest informal synergy scores stressing savvy deeply consensus degrading swag unforgiving objective appealing evolving necessity enthusiastic synergy affected formalized quotas berated where dr, j,s for a review r and sentiment s, and j is one of the four culture topics or the overall culture topics. δWells is the dummy variable for whether the review is for Wells. δqtr is the quarter fixed effect. Rr are the review level characteristics of log of length of the review, whether it is by a current employee or not, whether it is by an anonymous employee or not, and whether it is by a current and anonymous employee. r, j,s is the idiosyncratic error term. We clustered the standard errors at firm-year level to account for correlations in the error term across reviews of a firm in a given year. Adding bank specific fixed effects does not change the results. Table 1.3 shows the results. We find higher discussions of both negative and positive discussions of competition-focused goals – compete culture (β=0.024, p<0.01 for negative discussions; β=0.002, p<0.01 for positive discussions). For rules and procedures – compete culture, we find lower negative discussions: β=0.0043, p<0.01, while positive discussions are not significantly different from zero. For create culture, we find that both positive and negative discussions are lower (β=0.008, p<0.01 for negative discussions; β=0.007, p<0.01 for positive dis- cussions). Finally, for overall culture discussions we find that Wells has lower pos- itive discussion: β=0.002, p<0.05, and higher negative discussion: β=0.004, p<0.01. 25 26 Table 1.3: Employee Reviews: Discussions of Culture Topics This table reports the output from Eq. 1.5. The dependent variables are positive (and negative) discussions in employee reviews of the four culture topics - create, collaborate, control, and compete, as well as the overall culture topic. Current employee and anonymous employees are dummy variables, and we also include their interaction. Robust standard errors (in parenthesis) are clustered at firm year level. Positive Discussions Negative Discussions VARIABLES Create Collaborate Control Compete Overall Create Collaborate Control Compete Overall Wells -0.007*** -0.002* -0.001 0.002*** -0.002** -0.008*** -0.001 -0.004*** 0.024*** 0.004*** (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) Log (Length of Review) -0.006*** -0.014*** -0.016*** -0.006*** -0.011*** -0.006*** -0.002*** -0.004*** -0.001 -0.003*** (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) Current Employee 0.009*** 0.004*** 0.003*** 0.003*** 0.005*** 0.004*** -0.005*** 0.002*** 0.001 0.001 (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) Anonymous Employee 0.007*** 0.003** 0.003*** 0.001 0.003*** 0.005*** 0.004*** 0.003*** -0.001 0.002*** (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.001) (0.002) (0.001) Current and Anonymous Employee -0.003 0.001 -0.001 -0.001 -0.001 0.001 0.001 -0.001 -0.002 -0.001 (0.002) (0.002) (0.002) (0.001) (0.001) (0.002) (0.002) (0.001) (0.002) (0.001) Constant 0.262*** 0.355*** 0.330*** 0.315*** 0.319*** 0.238*** 0.298*** 0.290*** 0.300*** 0.284*** (0.003) (0.003) (0.002) (0.002) (0.002) (0.002) (0.002) (0.002) (0.002) (0.001) Observations 20,099 20,099 20,099 20,099 20,099 18,826 18,826 18,826 18,826 18,826 R-squared 0.021 0.042 0.067 0.018 0.050 0.023 0.010 0.010 0.048 0.012 Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05. To enable comparison of coefficients of different cultures, we compute the semi-elasticities (see Table 1.4) for δWells from Equation 1.5. The standard errors for the semi-elasticities are calculated using the delta method (Gupta et al. 1996). From the net-sentiment (positive – negative) semi-elasticities, we find that em- ployees at Wells negatively review the compete culture (semi-elasticity of -0.070, p¡0.01), and positively review the control culture (semi-elasticity of 0.013, p¡0.01), whereas other semi-elasticities are not significant. Recall from the Report that these are the two main crisis causes identified in the report- too much emphasis on competitive sales goals, and too little oversight and control of poor practices. It is interesting that employees view positively the control culture, that we know from the report was too loose and dysfunctional for the company. It is possible some employees who indulged in fraudulent sales practices wrote these reviews, or employees who didn’t but still viewed positively the looser oversight. If these control dysfunctions caused the crisis, as the Report suggests they did, the result can be interpreted as employees viewing some of these dysfunctional culture as- pects as positive (despite being harmful for consumers). However, note that over- all, across all culture topics, the semi-elasticity is negative and significant (-0.020, p¡0.01), implying Wells employees review more culture cons than pros. 27 Table 1.4: Semi-Elasticities for Wells in Discussions of Culture This table reports the semi-elasticities for the dummy variables for Wells from Table 1.3. Robust standard errors (in parenthesis) are clustered at firm year level. Sentiment Culture Positive Negative Net Create -0.029*** -0.037*** 0.008 (0.005) (0.004) Collaborate -0.007* -0.001 -0.006 (0.003) (0.002) Control -0.002 -0.015*** 0.013*** (0.004) (0.002) Compete 0.008*** 0.078*** -0.070*** (0.002) (0.004) Overall -0.006** 0.014*** -0.020*** (0.003) (0.002) Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05. Next, we want to understand how long before the crisis reveal Wells’ culture differed from its rivals. Therefore, we run the following regressions for each of the culture topics (including overall culture topic) and sentiment, interacting Wells dummy with quarter dummy for the employee reviews in the 10 quarters leading up to the crisis unveil: ∑ ∑ dr, j,s = α j,s + β δqtr × δWells + ρ δqtr + γRr + r, j,s (1.6) qtr qtr Table 1.5 shows the regression results. Again, to compare across different culture topics over time, we focus on the semi-elasticities. We report the semi- 28 elasticities for δqtr × δWells for the two sentiments with the four culture topics as well as the overall culture topics in Table 1.6. Here we find that the most consis- tent results are up to ten quarters ahead of crisis reveal – employees quite consis- tently negatively review the compete culture. The net sentiment (positive discus- sion – negative discussion) semi-elasticities for compete culture are at least -0.053, p<0.01 or lower, and consistently significant at p<0.01 across the ten quarters. For rules and procedures – control culture, the net sentiment (positive discussion – negative discussion) semi-elasticities are positive and significant up until three quarters ahead of crisis. For other culture topics, we do not find any trends in the quarters ahead of crisis. Finally, for the overall culture topic we find that employ- ees reviewed negatively across the ten quarters ahead of crisis (with the exception of quarter 4). This suggests that employees discussed more negative aspects of compete culture, where as they reviewed positively the control culture up until three quarters ahead of crisis.15 This window of 2.5 years before the crisis reveal show the potential usefulness for managers and regulators to using employee re- views as leading indicator of firm crises. Summarizing, we find that employees at Wells exhibit more negative views of the overall culture at Wells. Across the four culture topics, we see that this is true especially for the compete culture at Wells. Interestingly, employees viewed rules and procedures – control culture more positively, however these views disappear in the three quarters ahead of crisis. This is consistent with the findings of the 15We did not observe any substantial changes in the number of reviews for any bank over time. Thus, these results are not driven by a sudden increase or decrease in number of reviews. 29 Report which identified the lack of formal controls at Wells that let the improper and aggressive sales practices continue uncorrected, which subsequently caused the crisis. Since the Report recognizes these as key causes for the crisis, we have greater confidence in the ability of employee reviews to capture critical weak- nesses of corporate culture that potentially caused the crisis. Employee reviews are publicly available for both own and rival companies, and over time, they can be used by managers to benchmark with their competitors and with firms in other industries that faced crises. Regulators can also use these reviews to monitor both individual firm and industry-level risks. 30 31 Table 1.5: Employee Reviews: Discussions of Culture Topics Over Time This table reports the output from Eq. 1.6. The dependent variables are positive (and negative) discussions in employee reviews of the four culture topics - create, collaborate, control, and compete, as well as the overall culture topic. Controls (not reported in the table for brevity) include log of the length of review, current employee dummy, anonymous employee dummy, and their interaction. Robust standard errors (in parenthesis) are clustered at firm year level Positive Discussions Negative Discussions Wells Create Collaborate Control Compete Overall Create Collaborate Control Compete Overall Pre 10 Qtr -0.006* 0.002 -0.001 0.001 -0.001 -0.010*** -0.003** -0.004*** 0.024*** 0.003*** (0.003) (0.002) (0.001) (0.002) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) Pre 9 Qtr -0.005*** -0.004* 0.001 0.004*** -0.001 -0.007*** 0.003*** -0.003*** 0.028*** 0.006*** (0.002) (0.002) (0.002) (0.001) (0.001) (0.002) (0.001) (0.001) (0.001) (0.001) Pre 8 Qtr -0.010*** -0.006*** -0.003** 0.003*** -0.004*** -0.009*** 0.003*** -0.006*** 0.030*** 0.006*** (0.002) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.002) (0.000) Pre 7 Qtr -0.012*** -0.010*** -0.005*** 0.001 -0.006*** -0.008*** 0.002 -0.004*** 0.026*** 0.005*** (0.002) (0.002) (0.002) (0.001) (0.001) (0.002) (0.0020) (0.001) (0.002) (0.001) Pre 6 Qtr -0.010*** -0.001 0.002 0.003** -0.001 -0.011*** -0.001 -0.007*** 0.027*** 0.003*** (0.003) (0.001) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.002) (0.001) Pre 5 Qtr -0.006*** -0.001 0.001 0.004*** -0.001 -0.008*** -0.001 -0.003*** 0.025*** 0.004*** (0.002) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.001) (0.000) Pre 4 Qtr -0.006** 0.001 0.003*** 0.004*** 0.001 -0.006*** -0.001 -0.004*** 0.020*** 0.003*** (0.003) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.001) (0.001) Pre 3 Qtr -0.009*** -0.002 -0.001 0.002* -0.002 -0.011*** -0.001** -0.004*** 0.028*** 0.004*** (0.002) (0.002) (0.003) (0.001) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) Pre 2 Qtr -0.007*** -0.004** -0.004*** 0.001 -0.003*** -0.009*** -0.001 -0.006*** 0.018*** 0.001 (0.003) (0.002) (0.001) (0.001) (0.001) (0.002) (0.001) (0.001) (0.001) (0.001) Pre 1 Qtr -0.005 -0.001 -0.002*** 0.001 -0.002 -0.006*** -0.001 -0.002** 0.021*** 0.004*** (0.003) (0.002) (0.001) (0.000) (0.001) (0.002) (0.001) (0.001) (0.003) (0.001) Observations 20,099 20,099 20,099 20,099 20,099 18,826 18,826 18,826 18,826 18,826 R-squared 0.021 0.042 0.068 0.019 0.050 0.023 0.010 0.010 0.049 0.013 Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05. 32 Table 1.6: Semi-Elasticities for Wells in Discussions of Culture Over Time This table reports the semi-elasticities for the joint dummy variables for Wells and quarter coefficient from Table 1.5 for culture topics. Robust standard errors (in parenthesis) are clustered at firm year level. Create Collaborate Control Compete Overall Culture Wells Positive Negative Net Positive Negative Net Positive Negative Net Positive Negative Net Positive Negative Net Pre 10 Qtr -0.023* -0.046*** 0.023* 0.006 -0.011** 0.017* -0.002 -0.014*** 0.011*** 0.001 0.077*** -0.075*** -0.003 0.0010*** -0.013** (0.012) (0.006) (0.007) (0.004) (0.004) (0.003) (0.005) (0.005) (0.006) (0.002) Pre 9 Qtr -0.020*** -0.029*** 0.009 -0.013* 0.010*** -0.023*** 0.005 -0.011*** 0.015*** 0.014*** 0.088*** -0.074*** -0.003 0.023*** -0.025*** (0.007) (0.009) (0.007) (0.003) (0.005) (0.003) (0.004) (0.004) (0.004) (0.002) Pre 8 Qtr -0.039*** -0.040*** 0.001 -0.018*** 0.011*** -0.029*** -0.011** -0.021*** 0.011* 0.010*** 0.096*** -0.086*** -0.013*** 0.021*** -0.034*** (0.010) (0.008) (0.006) (0.002) (0.005) (0.003) (0.003) (0.007) (0.004) (0.002) Pre 7 Qtr -0.047*** -0.033*** -0.014 -0.031*** 0.006 -0.036*** -0.017*** -0.013*** -0.004 0.004 0.084*** -0.080*** -0.021*** 0.019*** -0.040*** (0.009) (0.008) (0.005) (0.006) (0.006) (0.004) (0.003) (0.005) (0.004) (0.003) Pre 6 Qtr -0.039*** -0.048*** 0.009 -0.001 -0.001 0.001 0.005 -0.026*** 0.031*** 0.008** 0.086*** -0.078*** -0.004 0.012*** -0.016*** (0.010) (0.008) (0.004) (0.002) (0.004) (0.004) (0.004) (0.005) (0.004) (0.003) Pre 5 Qtr -0.021*** -0.038*** 0.016 -0.002 -0.001 -0.001 0.003 -0.012*** 0.014*** 0.012*** 0.079*** -0.068*** -0.001 0.015*** -0.016*** (0.008) (0.007) (0.007) (0.003) (0.004) (0.003) (0.003) (0.005) (0.005) (0.002) Pre 4 Qtr -0.022** -0.024*** 0.002 0.004 -0.004 0.008 0.011*** -0.015*** 0.026*** 0.012*** 0.065*** -0.053*** 0.003 0.011*** -0.008 (0.011) (0.007) (0.006) (0.003) (0.003) (0.004) (0.003) (0.004) (0.004) (0.002) Pre 3 Qtr -0.037*** -0.048*** 0.012 -0.005 -0.005** 0.001 -0.003 -0.014*** 0.010 0.006* 0.089*** -0.083*** -0.008 0.015*** -0.022*** (0.008) (0.006) (0.006) (0.002) (0.009) (0.004) (0.003) (0.007) (0.005) (0.003) Pre 2 Qtr -0.028*** -0.039*** 0.011 -0.012** -0.005 -0.008 -0.013*** -0.022*** 0.001* 0.003 0.058*** -0.055*** -0.011*** 0.004 -0.016*** (0.010) (0.008) (0.006) (0.003) (0.003) (0.004) (0.002) (0.005) (0.004) (0.003) Pre 1 Qtr -0.019 -0.028*** 0.010 -0.003 -0.001 -0.002 -0.007*** -0.008** 0.001 0.001 0.068*** -0.067*** -0.006 0.014*** -0.020*** (0.013) (0.007) (0.006) (0.004) (0.002) (0.003) (0.001) (0.008) (0.004) (0.004) Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05. 1.6.2 Risk Assessment of Other Banks Next, we analyze if the culture discussions by employees of any bank in our data, are similar to the culture discussions by Wells employees. For this purpose, we compare culture discussions of other bank employees for the four quarters after September 2016, with Wells’ culture discussion before September 2016. We em- ploy the Kolmogorov–Smirnov (KS) test to estimate this distance. The KS test pro- vides an upper bound for the difference in the cumulative distribution function for two distributions. More formally, for two samples P and Q, the two-sample KS test is given by: KS (P,Q) = maxi|CPi −CQi | (1.7) Where CP is the cumulative distribution function for the distribution P. Thus, with the Wells pre-crisis employee reviews as the benchmark, we estimate the KS- test for employee reviews for the banks for the duration 4 quarters post crisis. In the KS test, a lower deviation value implies that the two distributions are closer. For the purposes of this analysis, we restrict ourselves to the top 10 largest banks in US and estimate the cumulative significant total deviation of employee reviews for the culture topics discussion in the pros and cons sections as com- pared to Wells employee reviews for the three quarters prior to the crisis reveal. We find that one bank has reviews with zero cumulative deviation of culture dis- 33 cussions (p < 0.05), that is, closest to Wells’ reviews.16 To the extent that the Report helps us identify these culture measures as causing the crisis, this similarity of cul- tures should be worrying for both managers and regulators. While the above text analysis is useful to uncover troubling cultural similarity between pre-crisis Wells Fargo and current banks, it cannot reveal the causes of this problematic culture at this bank any more than it did at Wells. This is outside of the scope of our paper. 1.6.3 Generalizability to Other Consumer Facing Crises As discussed previously, the Wells crisis Report provides the ground truth, a yard- stick to verify whether employee reviews can indeed capture potentially causal culture dysfunctions. We now turn to employee reviews from the three other cri- sis firms. Since we are more confident about the causal effects of dysfunctional culture on corporate crises, we test whether these results generalize to other con- sumer facing crises. We measure culture differences between crises and non-crises companies in three other consumer facing crises that were unveiled during 2014- 16 – General Motors, Chipotle Mexican Grill, and Mylan.17 We briefly describe these crises. In the General Motors crisis in 2014 (GM from hereon), GM had failed to iden- tify and report faulty ignition switches in 2.6 Million of its cars manufactured 16For liability reasons we do not disclose the name of this bank 17We found these crises by searching for keywords: “consumers”, ”firm”, and ”crisis” on Wall Street Journal and New York Times for all articles for the time period: Jan 2014 – Dec 2016. 34 during 2000-14. GM self-initiated a recall of these cars on Feb 7, 2014, which was followed by a regulator-initiated investigation on Feb 26, 2014.18 In the Chipo- tle Mexican Grill crisis of E-coli infection in 2015 (Chipotle from hereon), the first outbreak happened in Aug, 2015, with more than 55 people infected by Jan, 2016. As a result, the FDA launched a criminal investigation against Chipotle on Jan 6, 2016.19 In the Mylan crisis in 2016, Mylan had increased the price for its drug that stops life threatening allergy reactions – Epipen in May 2016 for the 17th time since 2007 –up 568% to $608 per dose from its price of $94 per dose in 2007. My- lan CEO had to appear in front of the US House Committee on Oversight and Government Reform, to explain its reasons for the price increases.20 These crises are very different from Wells crisis; they share the common feature of substantial consumer harm that prompted regulatory investigation. We randomly sample 20 firms with the same SIC industry code from COM- PUSTAT to find comparable firms for these three crisis-facing firms. The firms are listed in the Appendix 1.9.5. There are 39,651 reviews across these firms for the time period of Jan, 2008 - April, 2018. Thus, using the culture word embeddings in the pros and cons sections, we calculate the similarity of the reviews to the cul- ture topics. We estimate probit regressions to test if the culture discussions are significant predictors of these crises. More formally, the regression is specified as 18See, CNN Business, Feb 13, 2015, “51 deaths linked to GM ignition switch flaw”; NHTSA May 2014 Recall Notice to GM 19See, USA Today, Oct 31, 2015,“Chipotles close in Ore., Wash., after 22 sick from E. coli”; NBC News, Jan 6, 2016, “Chipotle Says It Faces Criminal Investigation in Food Illness Case” 20See, WSJ, Aug 24, 2016, “Mylan Faces Scrutiny Over EpiPen Price Increases”; FHCOGR, Sep 21, 2016, “Full Committee Hearing: “Reviewing the Rising Price of EpiPens” 35 follows for the each of the ten quarters leading up to the crises unveil: ∑∑ y∗r,t, f = α + βc,sDr,c,s + γrLr + ρX f + Ind.F.E. + r,t, f (1.8) c s Where yr,t, f is the indicator that the review r at a quarter t is for a crisis firm. Dr,c,s is the measured discussion of one of the culture topics c and sentiment s, Rr is the log length of review, and X f re firm level characteristics of log of assets and profitability in the previous quarter. We include industry fixed effects, and cluster the standard errors at the firm level. For ease of comparison of coefficients, we once again convert them to semi-elasticities. Table 1.7 shows the semi-elasticities. Similar to the findings from Wells, we find that employees review negative the compete culture at these crisis firms; these effects are visible three quarters ahead of the crisis. Like in Wells, employee reviews write positively about the control culture at crisis firms 9- ten quarters ahead of the crisis. Unlike Wells these effects are no longer significant after that. Furthermore, employees express negatively about compete culture. These views are visible starting from six quarters ahead of crises, and are consistently negative and significant starting from three quarters ahead of crises. Combining the analysis from Wells and these three crises, the common element is employees negative discussion of compete culture, and that employee reviews of culture can be leading indicators of upcoming crises. 36 37 Table 1.7: Generalizability to Other Crises This table lists elasticities from the probit regressions estimated for the crises of Mylan, Chipotle, and GM. Controls include log (lagged assets), lagged profitability, log (length of review), and industry fixed effects. Robust standard errors (in parenthesis) are clustered at firm level. Quarter Prior to Crisis Variables 10 9 8 7 6 5 4 3 2 1 Create Positive 1.066 3.760*** 0.928 -0.892 1.115 2.004 4.444*** 0.076 1.966* 2.137*** (1.634) (0.869) (0.797) (1.734) (1.187) (1.377) (1.137) (1.823) (1.187) (0.747) Negative 3.156*** 5.536*** 1.335* 2.654** 2.391 0.141 0.403 4.643*** 3.062** 2.376** (1.090) (1.230) (0.796) (1.099) (2.216) (1.325) (2.185) (1.494) (1.226) (1.021) Net -2.090 -1.777 -0.406* -3.546 -1.275 1.864 4.042 -4.567* -1.096 -0.238 Collaborate Positive -0.257 -2.808** -3.172*** -0.742 -0.154 -1.080 -2.348 -0.004 1.670* 2.177** (1.339) (1.357) (1.201) (1.111) (1.046) (1.022) (1.559) (1.137) (0.887) (1.059) Negative -2.198 -3.364 -2.041 -0.097 -4.297** -0.750 -4.532** -0.053 -1.747 0.614 (1.974) (2.229) (1.799) (0.696) (1.684) (1.425) (2.114) (1.976) (1.174) (0.771) Net -2.455 0.556 -1.131 -0.645 4.143** -0.331 2.184 0.050 3.416** 1.563 Control Positive 6.052*** 2.015** 0.222 -4.060*** -2.402*** -1.332 0.153 -0.218 -0.669 -1.128** (1.747) (0.925) (2.568) (1.021) (0.660) (1.050) (0.892) (0.875) (0.905) (0.478) Negative -2.543 -1.991 0.093 -2.872** -3.133 1.215 1.248 -0.213 -3.376*** -2.631*** (2.254) (1.411) (1.035) (1.258) (2.080) (1.455) (2.558) (1.525) (0.607) (0.653) Net 8.595*** 4.006*** 0.129 -1.188 0.732 -2.547 -1.010 -0.005 2.707*** 1.503 Compete Positive -7.183*** 1.520 0.310 3.202 -1.017 -4.244** -3.454* -2.314 -5.453** -6.393*** (2.026) (2.031) (1.913) (2.272) (1.607) (2.083) (1.928) (1.646) (2.331) (2.112) Negative -0.820 -1.354 2.073** 3.797** 2.755*** 2.316** 1.338 1.753 2.269** 1.743* (1.396) (1.663) (0.962) (1.638) (0.835) (0.915) (2.522) (1.224) (1.089) (1.030) Net -6.363*** 2.874 -1.763 -0.595 -3.771** -6.560*** -4.793 -4.066** -7.722*** -8.136*** Observations 524 463 751 693 925 1,074 1,227 1,061 1,451 1,959 Pseudo R2 0.076 0.094 0.086 0.098 0.056 0.085 0.081 0.095 0.071 0.053 Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05. We are also interested in whether overall culture is predictive of these crises, we reformulate Eq 8 to include overall culture topic. More formally, we estimate the following regression for the 10 quarters prior to the crises unveil: ∑∑ y∗r,t, f = α + βc,sOCr,s + γrLr + ρX f + Ind.F.E. + r,t, f (1.9) c s where OCr,s is the measured overall culture in a review r with sentiment s. We directly report the elasticities of overall culture for the purposes of brevity. We find that the net sentiment elasticity associated with overall culture is negative starting five quarters ahead of the crises. It is significant for quarters 5, 3, and 1 ahead of crises (β=-6.282, p<0.01; β=-8.310, p<0.01;and β=-2.672, p<0.01 respectively), and consistent in sign, i.e. negative, for quarters 4 and 2 ahead of crises. Summarizing, compared to non-crisis firms in the same industry, employee reviews from automotive, restaurants, and pharmaceuticals crises all show neg- ative views of culture several quarters ahead of the crisis revelation. In all these crises, employees express negative views on “compete” or competition-oriented goals. In the Wells’ crisis, per the Report, and the consistent evidence in em- ployee reviews, we know this culture (along with dysfunctional “control” or over- sight functions) caused the crisis. While we do not have an externally validated source of causes for the other three crises, we can cautiously claim that aggressive competition-oriented goals might cause these crises- at the least, employees nega- tive views of focus on competition-oriented goals appear to be leading indicators 38 of corporate crises. We also do not find evidence that financial variables appear to cause these crises. This information can be useful for regulators as they can mea- sure corporate culture using employee reviews and can also subsequently flag po- tentially “at-risk” firms. Finally, investors and activist shareholders investors can use these measures of culture to identify and fix firms with problematic cultures. 39 40 Table 1.8: Generalizability to Other Crises – Elasticities for Overall Culture This table lists the elasticities for overall culture topic (and financial) variables for the crises of Mylan, Chipotle, and GM. Controls include log of the length of review and industry fixed effects. Robust standard errors (in parenthesis) are clustered at firm level. Quarter Prior to Crisis Variables 10 9 8 7 6 5 4 3 2 1 Overall Culture Positive 0.076 3.626*** -1.598** -2.250* -1.000 -2.988*** 0.095 -1.650** -0.718 -0.737 (0.967) (0.639) (0.813) (1.264) (0.776) (0.896) (0.441) (0.685) (0.796) (0.558) Negative -1.898 1.467 2.157 2.830 -2.292** 3.295** -1.169 6.660*** 0.724 1.936** (2.149) (1.170) (1.524) (2.309) (0.989) (1.516) (0.820) (1.724) (1.513) (0.862) Net 1.974 2.160 -3.755** -5.080* 1.292 -6.282*** -1.074 -8.310*** -1.442 -2.6723*** Firm Financials Assets -0.107 -0.087 -0.061 -0.147 -0.083 -0.037 -0.149 -0.200 -0.053 -0.027 (0.122) (0.140) (0.122) (0.126) (0.126) (0.131) (0.151) (0.177) (0.148) (0.137) Profitability 1.701 0.881 0.559 2.764 0.179 -0.027 4.012 5.909 0.983 0.092 (1.522) (1.469) (0.586) (2.183) (0.358) (0.480) (3.301) (3.721) (0.942) (0.806) Observations 524 463 751 693 925 1,074 1,227 1,061 1,451 1,959 Pseudo R2 0.046 0.069 0.078 0.082 0.042 0.075 0.062 0.089 0.049 0.031 Statistical Statistical significance levels: ∗ ∗ p < 0.01, ∗p < 0.05. 1.7 Conclusion In this paper we empirically measure the corporate culture at Wells before the cri- sis became publicly known, and examine whether other banks might currently be at-risk based on their corporate culture. We also show the generalizability of our findings to three other consumer facing crises. Unlike traditional corporate cul- ture research which uses survey-based methods and ask participants specifically for information on culture, we use publicly available anonymous employee re- views on a leading job reviews website in the US to extract discussions of culture. The presence of the post-mortem Report serves the key role of corroborating and verifying whether employee reviews at Wells, written before the crisis occurred, capture the causes of crisis. We find employee reviews of corporate reviews are leading indicators of corporate crisis, with employees reviews overall more nega- tive than at competitor firms, and especially with a negative view of the compete or profit-and-sales goal culture. Our work makes substantive contributions, as well as managerial and regula- tory contributions. Employee reviews provide a free source of information that managers can use to assess corporate culture, and benchmark with competitors. For regulators, employee reviews can provide valuable information in the internal workings and corporate culture at firms, which can be pivotal in flagging at-risk firms and subsequently prevent consumer harm. There are several limitations of our work. First, we have studied consumer 41 harm crises. We have not considered crises such as accounting frauds, sexual ha- rassment, or significant environmental violations. Therefore, we are unable to generalize what types of dysfunctional cultures might cause these other crises. Second, we are not able to causally separate dissatisfied employees and problem- atic cultures, and we cannot identify other potential causes of problematic cultures either since we do not have data from when Wells and the other three crisis firms had non-problematic cultures. There are several avenues for further research. Measures of culture imple- mented here can be used to study firm responses a variety of natural experiments like changes in laws (e.g. requirements on diversity of the board) and other exter- nal shocks (e.g. the impact of elections on employee satisfaction). With more research on measures of corporate culture and under different conditions and changes, we hope will emerge more theory-based and empirically generalizable understanding of corporate culture (see Puranam et al. 2018). 1.8 References Aggarwal, Rajesh K., and Andrew A. Samwick (1999). ”Executive compen- sation, strategic competition, and relative performance evaluation: Theory and evidence.” The Journal of Finance 54.6: 1999-2043. Anginer, Deniz, Asli Demirguc-Kunt, Harry Huizinga, and Kebin Ma (2018). 42 ”Corporate governance of banks and financial stability.” Journal of Financial Eco- nomics 130.2: 327-346. Arnold, Chris (2016). ”Former Wells Fargo employees describe toxic sales cul- ture, even at HQ.” NPR. October 4. Bhandari, Avishek, Babak Mammadov, Maya Thevenot, and S. Hamidreza Vakilzadeh. (2017). ”The Invisible Hand: Corporate Culture and Its Implications for Earnings Management.” Borah, Abhishek, and Gerard J. Tellis (2016). ”Halo (spillover) effects in social media: do product recalls of one brand hurt or help rival brands?.” Journal of Marketing Research 53.2: 143-160. Burks, Stephen V., and Erin L. Krupka (2012). ”A multimethod approach to identifying norms and normative expectations within a corporate hierarchy: Evi- dence from the financial services industry.” Management Science 58.1: 203-217. Cameron, Kim S., Robert E. Quinn, Jeff DeGraff, and Anjan V. Thakor. Com- peting values leadership. Edward Elgar Publishing, 2014. Chang, Sea-Jin, Ji Yeol Jimmy Oh, and Kwangwoo Park (2017). ”The power of silent voices: Employee satisfaction and acquirer stock performance.” Chen, Ming-Jer (2001). Inside Chinese Business: A Guide for Managers World- wide. Harvard Business Press. 43 Chen, Yubo, Shankar Ganesan, and Yong Liu (2009). ”Does a firm’s product- recall strategy affect its financial value? An examination of strategic alternatives during product-harm crises.” Journal of Marketing 73.6: 214-226. Corritore, Matthew, Amir Goldberg, and Sameer B. Srivastava (2019). ”Duality in diversity: How intrapersonal and interpersonal cultural heterogeneity relate to firm performance.” Administrative Science Quarterly Crémer, Jacques (1993). ”Corporate culture and shared knowledge.” Industrial and Corporate Change 2.3: 351-386. Deshpandé, Rohit, and John U. Farley (2004). ”Organizational culture, mar- ket orientation, innovativeness, and firm performance: an international research odyssey.” International Journal of Research in Marketing 21.1: 3-22. Fiordelisi, Franco, and Ornella Ricci (2014). ”Corporate culture and CEO turnover.” Journal of Corporate Finance 28: 66-82. Graham, John R., Campbell R. Harvey, Jillian Popadak, and Shivaram Rajgopal (2017). Corporate culture: Evidence from the field. No. w23255. National Bureau of Economic Research. Green, T. Clifton, Ruoyan Huang, Quan Wen, and Dexin Zhou (2019). ”Crowd- sourced employer reviews and stock returns.” Journal of Financial Economics 134.1: 236-251. Guiso, Luigi, Paola Sapienza, and Luigi Zingales (2015a). ”Corporate culture, 44 societal culture, and institutions.” American Economic Review 105.5: 336-39. Guiso, Luigi, Paola Sapienza, and Luigi Zingales (2015b). ”The value of corpo- rate culture.” Journal of Financial Economics 117.1: 60-76. Gupta, Sachin, Pradeep Chintagunta, Anil Kaul, and Dick R. Wittink (1996). ”Do household scanner data provide representative inferences from brand choices: A comparison with store data.” Journal of Marketing Research 33.4: 383- 398. Hartnell, Chad A., Amy Yi Ou, and Angelo Kinicki (2011). ”Organizational culture and organizational effectiveness: a meta-analytic investigation of the com- peting values framework’s theoretical suppositions.” Journal of Applied Psychol- ogy 96.4: 677. Huang, Minjie, Pingshu Li, Felix Meschke, and James P. Guthrie (2015). ”Fam- ily firms, employee satisfaction, and corporate performance.” Journal of Corpo- rate Finance 34: 108-127. Ji, Yuan, Oded Rozenbaum, and Kyle T. Welch (2017). ”Corporate culture and financial reporting risk: Looking through the glassdoor.” Available at SSRN 2945745. Lavine, Marc (2014). ”Paradoxical leadership and the competing values frame- work.” The Journal of Applied Behavioral Science 50.2: 189-205. Lee, Thomas Y., and Eric T. Bradlow (2011). ”Automated marketing research 45 using online customer reviews.” Journal of Marketing Research 48.5: 881-894. Liu, Angela Xia, Yong Liu, and Ting Luo (2016). ”What drives a firm’s choice of prod- uct recall remedy? The impact of remedy cost, product hazard, and the CEO.” Journal of Marketing 80.3: 79-95. Lukas, Bryan A., Gregory J. Whitwell, and Jan B. Heide (2013). ”Why do cus- tomers get more than they need? How organizational culture shapes product ca- pability decisions.” Journal of Marketing 77.1: 1-12. Marinescu, Ioana Elena, and Eric A. Posner (2019). ”Why Has Antitrust Law Failed Workers?.” Available at SSRN 3335174. McKenney, James L., and Peter GW Keen (1974). ”How managers’ minds work.” Harvard Business Review 52.3 : 79-90. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean (2013). ”Distributed representations of words and phrases and their composition- ality.” In Advances in neural information processing systems, pp. 3111-3119. Mitzroff, I. I., and R. H. Killman (1978). ”Methodological Approaches to Social Science: Integrating divergent concepts and theories.” Moniz, Andy (2017). ”Inferring employees’ social media perceptions of corpo- rate culture and the link to firm value.” Available at SSRN 2768091. Moniz, Andy, and Franciska de Jong (2014). ”Sentiment analysis and the im- pact of employee satisfaction on firm earnings.” European Conference on Infor- 46 mation Retrieval. Springer, Cham. Myers, Isabel Briggs (1962). ”The Myers-Briggs Type Indicator: Manual (1962).” Ochs, Susan M (2016). ”The leadership blind spots at Wells Fargo.” Harvard Business Review 10. O’Reilly, Charles (1989). ”Corporations, culture, and commitment: Motivation and social control in organizations.” California Management Review 31.4: 9-25. O’Reilly III, Charles A., Jennifer Chatman, and David F. Caldwell (1991). ”People and organizational culture: A profile comparison approach to assessing person-organization fit.” Academy of Management Journal 34.3: 487-516. Panayotopoulou, Leda, Dimitris Bourantas, and Nancy Papalexandris (2003). ”Strategic human resource management and its effects on firm performance: an implementation of the competing values framework.” International Journal of Human Resource Management 14.4: 680-699. Pearson, Christine M., and Judith A. Clair (1998). ”Reframing crisis manage- ment.” Academy of Management Review 23.1: 59-76. Puranam, Dinesh, Vishal Narayan, and Vrinda Kadiyali (2017). ”The effect of calorie posting regulation on consumer opinion: A flexible latent Dirichlet alloca- tion model with informative priors.” Marketing Science 36.5: 726-746. 47 Puranam, Phanish, Yash Raj Shrestha, Vivianna Fang He, and Georg von Krogh (2018). ”Algorithmic induction through machine learning: Using predic- tion to theorize”, working paper, SSRN. Tellis, Gerard J., Jaideep C. Prabhu, and Rajesh K. Chandy (2009). ”Radical in- novation across nations: The preeminence of corporate culture.” Journal of Mar- keting 73.1: 3-23. Timoshenko, Artem, and John R. Hauser (2019). ”Identifying customer needs from user-generated content.” Marketing Science 38.1: 1-20. Van Heerde, Harald, Kristiaan Helsen, and Marnik G. Dekimpe (2007). ”The impact of a product-harm crisis on marketing effectiveness.” Marketing Science 26.2: 230-245. Weber, Roberto A., and Colin F. Camerer (2003). ”Cultural conflict and merger failure: An experimental approach.” Management Science 49.4: 400-415. Wei, Jiuchang, Zhe Ouyang, and Haipeng Chen (2017). ”Well known or well liked? The effects of corporate reputation on firm value at the onset of a corporate crisis.” Strategic Management Journal 38.10: 2103-2120. Williamson, Oliver E (1981). ”The economics of organization: The transaction cost approach.” American Journal of Sociology 87.3: 548-577. Zhao, Yi, Ying Zhao, and Kristiaan Helsen (2011). ”Consumer learning in a tur- bulent market environment: Modeling consumer choice dynamics after a product- 48 harm crisis.” Journal of Marketing Research 48.2: 255-267. Zhong, Ning, and David A. Schweidel (2020). ”Capturing changes in social media content: a multiple latent changepoint topic model.” Marketing Science 49 1.9 Appendix 1.9.1 Wells List of Firms The table below lists the set of large national commercial banks in US based on consolidated assets as of March 31, 2018. Bank Name Consolidated Assets Domestic Domestic (Bil $) Assets (Bil $) Branches JP Morgan Chase 2,198 1,676 5,115 Bank of America 1,765 1,661 4,484 Wells Fargo 1,716 1,662 5,913 Citibank 1,406 821 706 US Bank 452 442 3,138 PNC Bank 368 364 2,515 Capital One 289 289 620 KeyBank 135 135 1,217 Citizens Bank 122 122 797 Huntington National Bank 104 104 1,018 Zions Bank 66 66 435 Peoples United Bank 43 43 403 First National Bank of Tennessee 40 40 347 BOK Bank 33 33 124 First National Bank of Pennsylvania 31 31 416 Associated Bank 31 31 225 Sterling National Bank 30 30 126 Valley National Bank 29 29 239 Webster Bank 26 26 166 TCF National Bank 23 23 331 MB Bank 20 20 106 First National Bank of Omaha 19 19 127 OLD National Bank 17 17 197 Washington Federal Bank 15 15 237 Trustmark National Bank 13 13 202 Fulton Bank 11 11 110 Community Bank 10 10 224 Centerstate Bank 10 10 129 NBT Bank 9.16 9.16 152 Park National Bank 7.46 7.46 108 Woodforest National Bank 5.7 5.7 729 DBA First Convenience Bank 1.89 1.89 307 Source: Federal Reserve Statistical Release - Large Commercial Banks, release date: March 2018 50 1.9.2 Seed Words We derive the seed words from the descriptions of the four culture types in the competing values framework as words that qualitatively describe the culture type of interest. Create Collaborate Control Compete entrepreneurial family control market dynamic mentor controlling competition innovative mentors coordinate competitive agile parent coordination profits initiative loyalty consistent profit freedom loyal efficient sales flexibility tradition schedule results individuality commitment structure driven entrepreneur cohesion efficiency tough dynamism cohesive stability demanding innovation morale stable succeed creative teamwork smooth external experimentation team predictability strong adaptation consensus predictable hard fad participation centralized success fast-failure participate formalized tasks risk-taking concern structured decisive leading-edge sensitivity procedures productivity bold sensitive rules deadlines future friendly policies performance sharing micromanagement targets facilitator paperwork aggressive everyone red-tape difficult employees roadblocks perform people directive business community layoffs goals goal customer customers pressure numbers products product money overtime We note that the number of seed words for the four culture topics vary - 20 51 seed words for create culture, 26 seed words for collaborate culture, 25 seed words for control culture, and 36 seed words for compete culture. We argue that these seed words are drawn directly from the culture descriptions from Cameron et al. (2014), and it should not be a concern in our context given that these seed words are not frequently used in the reviews data. An alternate to this seed word list is the list used by Bhandari et al. (2017). They base their seed words following the approach of Fiordelisi and Ricci (2014), who measure culture discussions from firms’ 10-K filings using a bag of words approach. Comparing our seed words list to that of Bhandari et al. (2017), we find the following overlap: 11 out of 20 of our seed words for create culture, 6 out of 26 of our seed words for collaborate culture, 5 out of 25 of seed words for control culture, and 20 out of 36 of our seed for compete culture overlap with their seed words list. We argue that Bhandari et al. (2017) context is of formal financial disclosures (different from employee reviews on a jobs website), and thus the difference in seeds words is not a potential concern. 52 1.9.3 Data Pre-Processing We now describe the pre-processing steps we conduct on the reviews corpus to training the Word2Vec models. First, we convert all text to lower case so that we treat words such as “Company” and “company” in the same manner. Second, since we are interested in the nature and meaning of discussions in text, that is, how are words related to each other in their meaning, we remove commonly oc- curring stop words, such as “is”, “the”, “of” that do not contribute to the meaning in the discussions.21 Finally, we exclude reviews that do not have any textual data. 1.9.4 Word2Vec Hierarchical Softmax For the hierarchical softmax, the vocabulary is defined as a binary Huffman tree, and a word can be reached as a random walk in the tree, with frequent words being closer to the root of the tree. Suppose we are interested in reaching a word w in the tree. Let L(w) be the length of the path to reach the word w from the root, n(w, j) be the jth node on this path, and ch(n) be any arbitrary child of node n. Thus, j = 1 is the first node on the path, and by definition,n(w, 1) is the root of the tree. Similarly, at the end of the path, j = L(w) is the word w itself, thus n(w, L(w)) = w. Thus the probability of observing a context word w given an input word wi is given by: 21Stop words source: CoreNLP, Stanford (2016). 53 L∏(w)−1 | ′p(w wi) = σ([[n(w, j + 1) = ch(n(w, j))]]vn(w, j)>vwi) (1.10) j=1 Where [[n(w, j + 1) = ch(n(w, j))]] is the delta function which takes value 1 if that particular child node of n(w, j) is the node is n(w, j + 1), and takes value 0 otherwise. σ(x) is the sigmoid function that is given by: 1 σ(x) = − (1.11)1 + exp( x) The hierarchical softmax alleviates the computational burden of the softmax by reducing the computational cost to log2V , and words have one representation, instead of two representations with the softmax specification. 1.9.5 List of Other Consumer Facing Crises Firms The table below lists the randomly selected firms for the crises of General Motors, Chipotle, and Mylan. There is difference in the number of firms, as they are lim- ited by the number of firms in the same industry: based on the same COMPUSTAT SIC industry code. 54 General Motors Chipotle Mexican Grill Mylan Inc Daimler AG Ark Restaurants Corp. Abbott Laboratories Federal Signal Corp. Bj’s Restaurants Inc. Astrazeneca PLC Ford Motor Co. Buffalo Wild Wings Inc. Avanir Pharmaceuticals Inc. General Motors Co. Chipotle Mexican Grill Inc. Baxter International Inc. Honda Motor Co. Ltd. Dennys Corp. Depomed Inc. Lci Industries Domino’s Pizza Inc. Dr Reddy’s Laboratories Ltd. Navistar International Corp. Frisch’s Restaurants Inc. Endo International PLC Nissan Motor Co Ltd. Jack In The Box Inc. Glaxosmithkline PLC Oshkosh Corp. Jamba Inc. Hospira Inc. Paccar Inc. Kona Grill Inc. Ionis Pharmaceuticals Inc. Spartan Motors Inc. Mcdonald’s Corp. Jazz Pharmaceuticals PLC Subaru Corp. O’Charley’s Inc. Lifevantage Corp. Tata Motors Ltd. Panera Bread Co. Mannatech Inc. Tesla Inc. Red Robin Gourmet Burgers Map Pharmaceuticals Inc. Tower International Inc. Ruby Tuesday Inc. Mylan NV Toyota Motor Corp. Ruths Hospitality Group Inc. Novo Nordisk A/S Volvo AB Sodexo Nu Skin Enterprises Sonic Corp. Par Pharmaceuticals Hldgs. Texas Roadhouse Inc. Regeneron Pharmaceuticals Wendy’s Co. Taro Pharmaceuticl Inds. Ltd. Teva Pharmaceuticals 55 CHAPTER 2 SMOKE AND MIRRORS: IMPACT OF E-CIGARETTE TAXES ON UNDERAGE SOCIAL MEDIA POSTING 2.1 Introduction E-cigarettes are battery operated devices that vaporize liquids to deliver nicotine infused aerosols, and their usage is referred to as vaping.1 Vaping has increased dramatically in the last few years,2 and the US Surgeon General report on e- cigarettes raises concerns that vaping among youth and young adults has reached alarming levels.3 This is worrisome to regulators as these products contain addic- tive nicotine, posing severe health consequences, and is reversing decades-long trend of declining underage smoking. Reliable data on underage consumption is not available since their consumption is illegal due to minimum age sales laws. Other data sources such as large-scale national surveys are expensive and time consuming to conduct, and researchers have traditionally relied on smaller-scale survey Barrington-Trimis et al. (2016); Soneji et al. (2017). We propose studying images of vaping posted on a leading social media site. This posting behavior is a rough proxy for influencing and normalizing behavior, and possibly for con- sumption behavior among underage population. This chapter is: Anand, Piyush, and Vrinda Kadiyali (2020). “Smoke and Mirrors: Impact of E-Cigarette Taxes on Underage Social Media Posting”. 1See for e.g. National Institute on Drug Abuse, 2020, Vaping Devices (Electronic Cigarettes) 2Barshad, A., April 7, 2018. The juul is too cool. New York Times 3U.S. Department of Health and Human Services, 2016, E-cigarette Use Among Youth and Young Adults - A Report of the Surgeon General 56 We are especially interested in examining the effect of tax policies on posting behavior. We investigate state-wide regulations of California, Kansas, Pennsyl- vania, and West Virginia that impose taxes on e-cigarette products, and estimate their impact on vaping posting behavior from publicly available user-posted im- ages on a leading social media website in the US. We detect posting user’s age, gender, race, as well as disguising (emoji overlay) in images, an important con- founding factor,4 using an ensemble of image analysis from methods in computer vision – Mask R-CNN He et al. (2017) and Aggregated Residual Neural Networks Xie et al. (2017). Managers and regulators are likely to be concerned with under- age vaping posts as well as their disguising on social media since this deters their efforts to denormalize youth vaping. Additionally, tax impact on posting by gen- der and race will be of importance to regulators and managers given the‘ concerns of unequal health outcomes for minority groups. Our dataset consists of 388,593 scraped social media images that span the du- ration of Jan 2016 – Dec 2018. From these images we extract social media postings posters’ age, gender, race, and disguising (emoji overlay). To detect these de- mographics, we use Aggregated Residual Neural Networks Xie et al. (2017) on UTKFace (Zhang and Qi, 2017) and FairFace (Kärkkäinen and Joo, 2019) datasets, and achieve a test-set accuracy of 93.2% for underage detection, 97.8% for gender detection, and 86.37% average precision for race detection. To detect disguising, we use Mask R-CNN He et al. (2017) on an RA-annotated dataset of 5,040 images, 4See for e.g. Daily Mail, Revealed: How teenagers use secret emoji code to deal Class A drugs on Snapchat and Instagram as gangs target ’digital savvy’ school pupils, July, 2017; Drug Addic- tion Now, Emojis Give Youth a New Way to Communicate About Substance Abuse, Oct 2018 57 and achieved test-set accuracy of 95.15%. We estimate the causal effect of tax policies on posting behavior using the generalized synthetic controls method Xu (2017).5 We find that the states with higher taxes - Pennsylvania and California were effective in deterring underage vaping posts on social media. California’s decline in underage posting is preceded by increased disguised posts, and Penn- sylvania’s decline in underage posting is accompanied by increased engagement. Kansas and West Virginia, with lower taxes, did not see lower underage vaping image posting behavior. Kansas had an increase in posts with race: Black. Our conjecture for this unusual result is a possible confound – in 2015, a year prior to vaping tax in 2016, Kansas increased its cigarette tax significantly, by 63%. The cigarette industry has historically advertised heavily to Blacks, resulting in greater rates of consumption. We would need consumption data for both e- and regular cigarettes to verify our conjecture. Regardless, from a regulatory point of view, increased posting by Blacks post-tax is likely worrisome. We advance image literature - we estimate disguising behavior from images and construct a labeled dataset for disguising. Our work builds on a growing literature in marketing that uses images to address questions of interest to mar- keters Burnap et al. (2019); Dew et al. (2019); Liu et al. (2020); Zhang et al. (2017). Managers might find useful our methods of detecting and flagging inappropriate usage of their products, since heightened regulatory oversight poses significant risk to business viability. Regulators can monitor whether the taxes are effective in denormalizing youth vaping on social media. 5We find similar results with difference-in-difference estimators - see Appendix 2.9.3. 58 The rest of this chapter is organized as follows. In Section 2.2, we discuss the e-cigarette industry and state tax policies. In Section 2.3, we discuss relevant literature. In Section 2.4, we discuss the data. In Section 2.5, we discuss the image analysis and causal inference methods. In Section 2.6 presents the results, and we conclude in Section 2.7. 2.2 Electronic Cigarettes and State-Wide Taxes in the US Electronic cigarettes were a $3.6 Billion industry in 2018 in the US,6 with JUUL, that was launched in 2015, accounting for $1 Billion in sales in 2018.7 Regulators are concerned with its rising popularity as vaping among teenagers and youth has reached alarming levels, and research has yet to reach a consensus on whether va- por products are as harmful as (or more harmful) cigarettes. From the perspective of the managers and investors of firms in this industry, this heightened regulatory concern poses significant risk to their businesses’ viability as regulators have re- stricted their product sales, and are debating the possibility of banning e-cigarettes entirely. Monitoring taxes imposed to restrict usage of e-cigarettes would be of criti- cal interest to both regulators and managers. Many states have passed legislation that treat vaping as equivalent to smoking and imposes additional restrictions. 6Statista, Electronic cigarettes (e-cigarettes) dollar sales in the United States from 2014 to 2018 (in billion U.S. dollars) 7Reuters, Altria says Juul sales skyrocket to $1 billion in 2018, Jan 31, 2019 59 Of the key states in US, states of California, Kansas, Pennsylvania, and West Vir- ginia passed taxes on e-cigarettes.8 California amended the definition of the term “smoking” to include vaporization-based devices and increased taxes by 27.3% of wholesale cost in April 2017.9 Kansas enforced a tax of $0.20/milliliter of e-liquid for e-cigarettes starting July 2016, with somewhat mixed design and implemen- tation.10 Pennsylvania enforced a tax of 40% of wholesale price starting August 2016.11 West Virginia added a $.075/milliliter of e-liquid for e-cigarettes starting July 2016.12 A typical pack of four Juul pods contains 2.8 milliliter of e-liquid13 and thus became subject to $0.56 tax in Kansas and $0.21 in West Virginia. While we do not have wholesale prices for Juul pods, assuming a conservative 75% margin for retailers leads to a wholesale price of $4 per pack that retails at $16. This cor- responds to conservative estimates of per pack tax of $1.09 increase in California and $1.60 tax in Pennsylvania. Thus, we note that Pennsylvania and California introduced much higher taxes than the states of Kansas and West Virginia. We investigate the impact of these tax policies on vaping posting behavior on social media. Specifically, we estimate the prevalence of underage vaping, 8Note that Minnesota enacted taxes on e-cigarettes in 2013, and in 2015 the states of Delaware, Louisiana, and North Carolina also enacted taxes. We are unable to analyze these policies given the limitation of time span since most of our data are on / after Q2, 2015. We exclude these four states in all our analysis. 9CNBC: Feds give big tobacco new headache as California taxes proving hazardous to cigarette sales, July 28, 2017 10Kansas Health Institute (2016), E-Cigarette Policy, Regulation and Marketing (February 2016). Kansas later revisited its taxes and reduced the tax to 5 cents / ml, offering credit to retailers who might have already paid these taxes and delayed effective date to July 1, 2017. See: E-Cigarette Tax Fix Moves Forward In Kansas, KCUR, April 2017. For our analysis purposes, we consider July 1, 2016 as the tax enactment date. 11Pennsylvania Department of Revenue (2016): Other Tobacco Products Tax 12West Virginia State Tax Department: E-cigarette Liquids Excise Tax FAQ 13Juul: Discover More About JUULpods & Flavors 60 identify the demographics of individuals in posts, and estimate the extent of dis- guising from social media images. We plan to extend our study to include the recent stricter regulations enacted by the states of New York, Massachusetts, and Michigan in September 2019, and also estimate the interaction of these regulations with the COVID health crisis of 2020. From a managerial perspective, firms can use near real-time images data to detect inappropriate usage of their vape prod- ucts. Regulators will also be interested in the impact of regulation on social media posting behavior as they monitor denormalization of vaping among youth. Deep learning methods from computer vision literature are particularly suited to extract these variables which traditional structured data lack, and large-scale surveys are often expensive to conduct. This motivates the need to study e-cigarettes using image analysis. 2.3 Literature Review Three areas of research are related to our work. The first stream of literature rel- evant to our work is the growing literature in marketing that uses images and image analysis methods. These include developing new product designs and pre- dicting customer aesthetic appeals Burnap et al. (2019), logo creation Dew et al. (2019), presence of human faces in images and engagement on Twitter Li and Xie (2020), brand features from social media images Liu et al. (2020), and image fea- tures and rental demand on Airbnb Zhang et al. (2017). We differ from the above 61 papers as we combine image analysis methods with causal inference. Further- more, our methodological contribution is detecting disguising from images. The second stream of relevant literature is on health policies and regulations.14 Some relevant papers include the impact of calorie posting regulations on discus- sions of health in online reviews Puranam et al. (2017), price elasticities for tobacco products that incorporate addiction Gordon and Sun (2015), physician payment disclosures and prescription behavior Guo et al. (2020), and recreational marijuana legalization and online cannabis related search Wang et al. (2019). We differ from the above papers as our substantial interest is in underage/ illegal consumption of vaping products, and we use data from images posted in social media. The third stream of literature is on e-cigarettes consumption and regulation,15 where research has studied optimal taxation policies Allcott and Rafkin (2020), effects of e-cigarettes taxation policies on cigarette consumption Chen and Rao (2020), minimum age sales laws and cigarette consumption Dave et al. (2019), Minnesota e-cigarette tax Pesko and Warman (2017); Saffer et al. (2018), and e- cigarette advertising Tuchman (2019). We differ from the above papers as follows: we estimate underage posting and disguising from social media images, which cannot be estimated using consumption data or textual data. Furthermore, con- 14There is also literature in machine learning that has studied health topics. For example, Twit- ter posts and discussions of Juul Allem et al. (2018), detecting obesity from satellite images Ma- harana and Nsoesie (2018), detecting binge drinking from social media ElTayeby et al. (2017), and predicting age from facial images Rothe et al. (2015). 15There is also literature in public health that has studied e-cigarettes and its health effects. For example, e-cigarettes as a cessation device for smoking Barbeau et al. (2013); Brown et al. (2014), pulmonary toxicity of e-cigarettes Chun et al. (2017), youth usage e-cigarette usage and subsequent smoking habits Barrington-Trimis et al. (2016); Miech et al. (2017); Soneji et al. (2017). 62 sumption data is non-existent given the illegality of underage consumption. 2.4 Data As discussed earlier, we scraped publicly available images about vaping from January 2015 – January 2019 from a leading social media website in the US in which users post with images. We adopted the following procedure to determine which posts are about vaping - we first scraped all posts with the hashtag “juul”,16 and from these posts, we identified the 10 most commonly occurring other hash- tags to identify other vaping related posts.17 This is because of scraping limitations – there were 37,841 unique hashtags that we found when we scraped hashtag juul. These 10 hashtags occurred in 10.37% of the posts, whereas the average hashtag occurred in 0.0026% of the posts. Furthermore, scraping these 10 hashtags was costly and time consuming despite parallelizing on three online virtual servers. Since we are interested in estimating effects of state-wide tax policies in the US, we restrict the scraping to US based posts based on the location tagged in the posts. The resulting sample has 785,431 US-based posts across these 11 hashtags. Table 2.1 lists the hashtags and number of posts scraped. Figure 2.1 shows the (log) number of posts by state and year. We find that in 2015, the number of posts were much higher in California relative to other states. By 2018, we see that number of posts across states has increased, suggesting wider 16We start with the hashtag “juul” since Juul is the largest company in this industry. 17We plan to add more hashtags in future research. 63 Table 2.1: Number of Posts by Hashtag This table lists the number of social media posts that were scraped for the duration of Jan 2015 – Jan 2019. Hashtag Number of Posts vape 540,009 vapenation 81,409 vapelife 64,935 vapeshop 33,297 vapelyfe 19,185 juul 18,282 vapecommunity 9,965 vapeporn 9,282 vapefam 7,767 juulvapor 693 juulpods 607 spread of posting practices across the US. We exclude 2015 data and Jan 2019 data due to data sparsity, which leaves us with a total of 750,819 posts.18 To remove posts that are potentially spam posts, or posted by vape shops, we next scrape all the user’s profiles and calculate the 18We note that the number of posts in 2015 and Jan 2019 is 4.5% of the total data, and sparse for many states. Furthermore, for 2019 we do not have data for entire Q1 2019 (due to scraping constraints). 64 Figure 2.1: Number of Vaping Related Posts by State and Year in the US This figure shows the log number of social media posts that were scraped by year and state 65 66 Table 2.2: Information Observed for Each Post This table lists the information that is scraped for each of the social media posts Variable Description Post ID Unique identifier for the post User ID Unique identifier for the user who made the post Timestamp Time when the post was posted by the user Latitude Latitude of the post’s location Longitude Longitude of the post’s location Num Likes Number of likes as of the scraping date Num Comments Number of comments as of the scraping date Caption Text associated with the post vaping related posts as a percentage of their total posts. We do not consider those users’ posts that have more than 25% of vaping related posts.19 Thus, we are left with 388,593 posts after excluding those whose users have posted less than 25% of their posts that are vaping related. Table 2.2 shows the information observed for each of these posts. We observe when the post was posted, the user’s total number of posts, latitude and longitude of location, numerical count of likes and comments for post, text caption, and an associated image. We estimate a post’s 67 state location based on its latitude and longitude. 2.5 Methodology First, we discuss the intuition behind convolutional neural networks, which forms the basis for the deep learning methods for image analysis. Second, we discuss the methods to estimate demographics and detect disguising in images - Aggregated Residual Neural Networks Xie et al. (2017) and Mask R-CNN He et al. (2017). Finally, we discuss generalized synthetic controls Xu (2017) for causal inference. 2.5.1 Convolutional Neural Networks We first describe the intuition behind convolutional neural networks (CNNs henceforth) and the building blocks of a CNN - convolutional layers and pool- ing layers. A convolutional layer is characterized by several filters, or kernels, that are ap- plied in a sliding manner (i.e. from one pixel to the next). The weights of these kernels are learned during the training process. Figure 2.2 shows an illustration of a convolutional layer on a greyscale image.20 The image has dimensions of 5 × 5, 19Figure 2.13 in the Appendix 2.9.1 shows number of users and posts at different cutoffs. 20This example is for illustrative purposes with a greyscale image (thus 1 channel). In practice, images have three channels corresponding to red, blue, and green colors, and are of much larger dimensions (a 1 Megapixel image is usually 1024 × 1024 dimension). The conceptual process re- 68 and each pixel is represented as a pixel value (unit normalized). The kernel is a 3×3 matrix with parameters of weights w1 : w9 and an overall offset, or bias b, which are learned during training. The weights of the kernel (w1 : w9) and image pixel values are multiplied element-wise (in a sum-product manner), and then added with a kernel bias b to obtain output feature map values. In practice, con- volutional layers have several kernels (ranging from 64 to 512), and the intuition is that these kernels learn different local correlations. For example, if an image of face is passed through a convolutional layer, these kernels can learn to detect local correlations such as face boundaries, nose, ears, and other facial features. A pooling layer is used to introduce down-sampling, i.e. it reduces the di- mensionality of the input feature map by retaining a single statistic of the local window from the input feature map, such as maximum value. Figure 2.2 shows an illustration of max-pooling layer. There are several variants of pooling layers, such as max-pooling (maximum value in the window), average pooling (average value in the window), among others. The intuition for pooling layers is that these statistics (maximum/average) are approximately sufficient for CNNs to perform tasks such as image classification based. CNNs are especially useful to analyze images, as kernels and pooling layers capture spatial correlations in images, that is nearby pixels in an image are con- nected (i.e. local connectivity, for example if a pixel in image is one of the many pixels of a “dog”, then its adjacent pixels are also likely to be those of a “dog”). mains the same regardless of image size and color. 69 Figure 2.2: Building Blocks for CNNs Convolutional Layer A 3 × 3 kernel with weights w1 : w9 and bias b is applied in a sliding manner on an image with dimensions 5 × 5. The resulting feature map (left) is obtained by element by element multiplication of image pixel value and kernel weight with a bias added. Notice that the top left element in the feature map on the right corresponds to the kernel operation with the image (solid orange box on the image on the left). Other elements in the feature map are obtained by sliding the kernel over the image. Max Pooling Layer A 2 × 2 max pooling operation takes the maximum value in the window as the output for the feature map, and is applied in a sliding manner over the input feature map. Notice that the top left element in the output feature map (right) corresponds to the maximum value in the 2 × 2 window (solid blue box on the input feature map on the left). Other elements in the output feature map are obtained by sliding the window over the input feature map. 70 Furthermore, CNNs can also handle translations of objects in images (i.e. transla- tion invariance, for example if the “dog” is moved to a different location or scale, i.e. different in size in the image, then the feature values for the “dog” also adjust accordingly). Typically, a CNN has a sequence of several convolutional layers, pooling lay- ers, and non-linear activations such as ReLU to construct feature maps. The scope (or coverage) of the feature maps changes in a CNN with layers – feature maps obtained from the shallower layers, i.e. first few layers capture more local corre- lations than feature maps obtained from the deeper layers, i.e. layers towards the end of a CNN capture more global correlations. This is because, for instance, a 3×3 filter in the first layer has a coverage of 3 × 3 pixels in the original image, whereas a 3×3 filter in the second layer has a coverage of 5×5 pixels in the original image. Finally, these feature maps can then be used in classifiers such as fully-connected layers or support vector machines for tasks such as object detection and image classification. 2.5.2 Detecting Demographics in Images We now describe the approach to estimate age, gender, and race from faces in images. We train three models to detect these demographics as classification tasks: 71 1. Age: Underage, i.e. less than 21 years old, or not 2. Gender: Female or male 3. Race: Asian, Black, White, and Others We use a class of convolutional neural networks: Aggregated Residual Neural Nets (Xie et al., 2017, ResNeXts henceforth) to classify demographics in images. Residual Neural Nets (He et al., 2016, ResNets henceforth), tackle the degradation problem associated with deep learning models – as the number of layers in a neu- ral network increases, optimizing the non-linear layers becomes harder. One main reason for this issue is vanishing gradients - with more layers the gradients of the loss function can go towards zero for the initial layers, and the initial layers are unable to update their parameters. ResNets overcome this limitation by convert- ing the learning objective of the neural network layers from learning functional mappings to optimize residuals by adding identity connections, i.e. input feature maps are added to the output after every few convolutional layers. ResNeXts builds on this further with an additional parameter of cardinality, which we dis- cuss next. Model Architecture We first briefly discuss a ResNet-18 model, i.e. a ResNet with 18 layers, and subsequently discuss how ResNeXts build on it. The network takes in as input an image and applies 64 convolutional filters of size 7 × 7 to obtain 64 feature maps. 72 This is followed by 16 convolutional layers with feature maps varying from 64 to 512.21 Figure 2.3 shows the ResNet-18 architecture. ResNet-18 has residual blocks which comprise of 2 convolutional layers, given by F(x) = C2[ReLU(C1(x))], where x is the input to the residual block, C1 and C2 are the two convolutional layers in the residual block. ReLU is the rectified linear unit activation function, given by: ReLU(x) = max(0, x) (2.1) The rectified linear unit activation function enhances the capability of the neu- ral network to learn nonlinearities in the data. We use this activation function after each layer in the network, excluding the last fully connected layer. A resid- ual block is followed by an identity mapping as shown in Figure 2.4 (i.e., adding the input at the end of the residual block), and is given by: y = F(x) + x (2.2) where y is the input to the subsequent residual block, and F(x) is the residual mapping that is learned during training. F(x) approximates H(x) − x,where H(x) is the underlying functional mapping. The reason for choosing residual mapping F(x) is that it is easier to optimize non-linear convolutional layers to push the 21ResNet-18 applies max-pool layer after the first convolutional layer, and average pool layer after the last convolutional layer. The model architecture, number of filters, and number of layers is from He et al. (2016) 73 Figure 2.3: ResNet-18 Architecture This figure shows the architecture for the ResNet-18 model. The orange box represents the input image of dimensions 224×224 pixels that is fed to the network. The convolutional layers have the syntax as “a×b conv,c where a×b is the size of the convolutional filter in pixels, and c represents the number of features maps, i.e. filters used for that layer. A “×2” next to a convolutional block represents that there are two of such convolutional layers. Solid black lines represent the residual connections that maintain the dimension- ality of the previous layer, whereas dotted black lines represent the residual connections that double the dimensionality from the previous layer. The final layer is the fully con- nected layer whose output is dependent on the task. For instance, for age classification the output will be 2, whereas for the race classification the output will be 4. 74 Figure 2.4: Residual Block of ResNet This figure shows a residual block for the ResNet-18 model. It comprises of two convolu- tional layers, with ReLU activation after the first layer. The input to the residual block, x, is added back after the two convolutional layers as an identity mapping. residuals (H(x) − x) to zero rather than expect the non-linear layers to learn the exact functional mappings. He et al. (2016) also empirically demonstrate the ad- vantage of the residual mapping approach across several image tasks (ImageNet ILSVRC 2015 and COCO 2015 competitions). We also apply batch normalization before every layer to prevent scale effects, such that the scale of the data does not affect model training. Therefore, ResNet- 18 consists of 17 convolutional layers and a fully connected layer at the end. This fully connected layer gives a c-dimensional output, where c is the number of classes for our data. That is, for age and gender classification task we have c = 2, and for race classification task we have c = 4. Model predictions are obtained by applying a softmax operation to the output of the fully connected layer. Specifi- 75 cally, the softmax operation applied on model output xi for a class i with a total of c classes is given by: ∑ exiso f tmax(xi) = c x (2.3)j j=1 e We now describe the ResNeXt model Xie et al. (2017) which builds further on ResNets. Xie et al. (2017) propose an architectural modification to ResNets that allows for the split-transform-merge procedure. They empirically demonstrate that adding parallel paths in the residual blocks increases the model’s capacity / flexibility by reducing the risk of over-adapting model architecture to a specific dataset, and allows for better training as the split paths are lower dimensional fea- tures that require less computational cost to optimize. More formally, with arbi- trary transformation functions Ti and cardinality C, the aggregate transformation is given by: ∑c F(x) = Ti(x) (2.4) i=1 With ResNeXt models, it is straightforward to switch to ResNet models if we set the cardinality to 1. Figure 2.5 shows a block of ResNet and its ResNeXt equiv- alent. There are 32 parallel paths, i.e. Ti, which are summed up at the end to maintain dimensionality. The identity shortcut mapping is subsequently added to fit the residual neural networks framework. We use the ResNeXt model with 101 layers and cardinality of 32 and bottle- 76 Figure 2.5: ResNeXt and ResNet Blocks This figure shows the block for residual neural networks: ResNet Block on the left, and the block for aggregate residual network: ResNeXt Block on the right. The ResNeXt block has a cardinality of 32, i.e. there are 32 parallel paths of smaller dimensionality which are aggregated and subsequently summed with the identity mapping. neck size of 8. The final two layers in the demographic detection models are fully connected layers with sizes of 2048 and 512 respectively, and the output of the model is obtained using a softmax. Training We first describe the training data used to train the demographic models. We use the UTKFace and FairFace datasets. The UTKFace dataset consists of 23,708 images which are labeled with age in years, gender of male or female. Figure 2.6 77 Figure 2.6: Distribution of Age in Training Data This figure shows the distribution of age in the training data of UTKFace. The dotted gray line represents the threshold for underage, i.e. points to the left of the dotted gray line represent underage faces with an age of < 21 years old. shows the distribution of training data with different age labels. The dataset consists of 20.54% of the data represents underage faces, i.e. with age of less than 20. There are 47.7% female faces in the UTKFace dataset. We chose the FairFace dataset for race labels since it has balanced distribution in the dataset 78 across different the different races – 14.1% for race: Black, 19.1% for race: White, 26.6% for race: Asian, and 41.2% across other races – Latino, Indian, and Middle Eastern. For the age and gender classification tasks, we use the UTKFace dataset which comprises of 21,263 training images and 2,445 test images. For the race classification task, we sub-sample 14,863 training images and 1,407 test images from the FairFace dataset.22 We next describe the loss function. We use the cross-entropy loss function to train the model. The loss for a batch of size n with a total of c object classes is given by: 1 ∑n ∑c loss = yi jlog(ŷn i j ) (2.5) i=1 j=1 where yi j is equal to 1 if image i belongs to class j, and 0 otherwise. ŷi j is the model predicted output for image i and class j, with values in the range of [0, 1]. This loss function is then back-propagated through the layers using stochastic gradient descent to learn the model parameters. More formally, the following update procedure is used: 4loss θr ← θr + η 4 (2.6)θr where θr are the parameters of the network and η is the learning rate. We 22Note that the FairFace dataset comprises of 100,000 images, however we use a random sub- set for computational feasibility reasons – the computation time and GPU requirements increase substantially with larger training data. 79 use ADAM optimizer Kingma and Ba (2014) which is stochastic gradient descent with momentum. We train the three models for 100 epochs, i.e. iterations over the data, with a mini-batch size of 50 images. We start with randomly initialized model parameters and use an initial learning rate of 0.01. The learning rate sched- ule is such that it decreased by a factor of 10 after every 8 consecutive epochs without improvement in accuracy on the test dataset. On the test dataset, the age model has a precision of 93.2% in detecting underage faces, the gender model has a precision of 97.8% in detecting female faces, and the race model has an average precision of 86.37% for detecting the four race classes. 2.5.3 Detecting Disguising in Images We now turn to object detection and segmentation methods from computer vision literature to detect disguising behavior in images. We use Mask R-CNN He et al. (2017) to detect disguising behavior in the form of emoji overlay in images.23 There are several media discussions of emoji usage on social media that motivate our task for detecting emoji overlay in images.24 Note that the disguising measure obtained by detecting emoji in images provides a lower bound, as there can be other forms of disguising such as hiding Juul behind persons or other objects.25 23As a robustness, we conducted an RA-based validation to check if the 85 emojis are used in social media images for disguising. The results of this paper are substantially similar when we restrict to only the 20 emojis (of the 85 emojis) most commonly used for disguising. 24See for e.g. Daily Mail, Revealed: How teenagers use secret emoji code to deal Class A drugs on Snapchat and Instagram as gangs target ’digital savvy’ school pupils, July, 2017; Drug Addic- tion Now, Emojis Give Youth a New Way to Communicate About Substance Abuse, Oct 2018 25We can also identify other commonly co-occurring objects with emoji and use these objects as 80 Figure 2.7: Example of Object Detection for Juul and Emoji The figure on the left is an image with a Juul device, and the figure on the right uses an emoji to disguise Juul Figure 2.7 shows an example of disguising using emoji overlay in images. We first briefly describe object detection methods. There are two steps in these methods – region proposal generation and object classification. In the first step, the region proposal block gives candidate regions, i.e., a part of the image that po- tentially contains an “object”. In the second step, the object detection block takes the object candidates and classifies these objects among different object classes. Previously, several region proposal models have been used in the literature, with measures of disguising. Note however that such analysis will only be correlation based and not a definitive measure of disguising. We report the commonly objects commonly co-occurring with emojis in the Appendix 2.9.6 81 selective search Uijlings et al. (2013) being a commonly used method. Selective search uses low level image features of color, texture, size, and fill to combine pix- els in an image to generate candidate object regions using pixel similarity. We refer the readers to Uijlings et al. (2013) for further description of the selective search method. Having a separate region proposal block is computationally expensive. Fur- thermore, it fails to take advantage of correlations in the region proposal gener- ation task and the object detection task. That is, if a region in image is an object candidate, then its features will be useful for the object detection task. Recent CNN based methods capitalize on this and share convolutional layers. This has been employed in the recent such as Faster R-CNN Ren et al. (2015) and Mask R-CNN, that use shared convolutional layers as inputs to both region proposal block and object detection block. This eliminates the need for a separate region proposal method, thereby reducing computation time for training object detec- tion methods. Furthermore, sharing convolutional features for region proposal and object detection leads to higher accuracy. We use Mask R-CNN in this paper to detect disguising in images. Model Architecture and Training We first describe the Faster R-CNN Ren et al. (2015) approach and then dis- cuss how Mask R-CNN builds on it. Faster R-CNN builds on Fast R-CNN which uses VGG-16 neural network Simonyan and Zisserman (2014) as the backbone to 82 compute convolutional features. VGG-16 is a 16-layer neural network, with 13 convolutional layers and 3 fully connected layers. Faster R-CNN has two blocks: the region proposal generation block and the object detection block, which share the convolution layers. Figure 2.8 shows the Faster R-CNN architecture. The shared convolutional layers are used by both the region proposal generation block (middle block in Figure 2.8: region proposal block) and the object detection block (final block in Figure 2.8: object detection block). To generate region proposals, we slide a 3x3 filter convolutional layer, pre- dicting k = 9 “anchors”, i.e. centers of region proposals at each location.26 This is followed by two fully connected layers which predict the bounding boxes co- ordinates and an “objectness” score, i.e., whether the anchor box contains an object or not. The detection criteria for a region proposal block is intersection over union (IOU), i.e. the amount of overlap for the ground truth boxes and predicted boxes. Region proposals with IOU >0.5 are taken as positive training samples, and those with IOU < 0.5 are taken as negative training samples.27 256 regions proposals are generated for each image, 128 positive and 128 negative proposals. These region proposals, i.e. bounding box co-ordinates and “objectness” scores are then passed to the object detection block. ROI Pooling layer is used to quantize the features in the regions for subsequent classification. ROI Pooling layer divides the regions 26k = 9anchor boxes are calculated at each slide of the convolutional filter since we predict at 3 different aspect ratios (1 : 1, 1 : 2, 2 : 1) and 3 different scales (128 × 128, 256 × 256, 512 × 512) 27We follow the standard practice in computer vision literature of intersection over union > 0.5 as the threshold for a positive detection during evaluation. 83 Figure 2.8: Faster R-CNN Architecture This figure shows the architecture for Faster R-CNN. The shared convolution layers pro- vide features which are used by both the region proposal model and the object detection model. The region proposal model applies a convolutional layer that outputs interme- diate features. These features are then fed to two fully connected layers which give for k regions 2k class scores (object or no-object) and 4k co-ordinates for region proposals. These region proposals are then fed to the object detection model which applies a ROI Pooling layer on the region proposals on the features of the shared convolutional layers. This is followed by two fully connected layers to generate intermediate features: region of interest feature vector. Two fully connected layers then output the predicted class and bounding box co-ordinates for the region of interest. 84 in a 7x7 grid and takes the maximum value from the features to output a 7x7 fea- ture vector. This feature vector is then applied to two separate fully connected layers to get bounding box co-ordinates and softmax to get the object class with the maximum probability. Mask R-CNN builds on Faster R-CNN, as it also outputs masks for the regions of interest. A mask (pixel mapping of objects in the image) is useful for object segmentation, i.e. extracting regions (pixel level) of the image where the object is present. He et al. (2017) argue that ROI Pooling layer of Faster R-CNN is ap- propriate for object classification task, however the quantization of features done in the ROI Pooling layer leads to sub-optimal segmentation. They suggest ROI Align instead. ROI Align avoids any quantization of features as it performs bi- linear interpolation of features. We refer the readers to the He et al. (2017) paper for a detailed description of the bi-linear interpolation operations performed in the ROI Align layer. Mask R-CNN keeps the architecture of Faster R-CNN the same up until the object detection block, i.e. the shared convolutional layers and region proposal block is exactly the same. Figure 2.9 shows the Mask R-CNN architecture. For purposes of brevity, the architecture shown is after the region proposals are gen- erated.28 Different from Faster R-CNN, Mask R-CNN uses 5 scales and 3 aspect ratios, for a total of 15 anchor boxes per location. After the ROI Align operation, a convolutional layer is used to generate the intermediate region of interest feature 28This is done because Mask R-CNN and Faster R-CNN have the same architecture until region proposal generation block. 85 maps. These are then fed to two separate fully connected layers to predict object class and bounding box co-ordinates. There are also two additional convolutional layers that predict the object mask of dimension 14x14. The non-linear activation for the hidden layers is ReLU. The shared convolutional layer block comprises of the convolutional layers of the ResNeXt-101 block as described Section 2.5. Training We first describe the training dataset for detecting emojis in images. The emojis for detection are given in Figure 2.14 in the Appendix 2.9.2. There are a total of 85 emojis in our library. We manually sub-sampled 1000 images from our data with a Juul present based on visual inspection. We then superimposed the emojis over the Juul device in the images. From those 1000 images, we randomly sample 50 as training images for a total of 4,250 training images. For these images, we constructed the masks, i.e. annotated regions in the image where the emoji is present using the open source VIA image annotation tool Dutta and Zisserman (2019). We appointed an RA to do these tasks. Figure 2.10 shows an image from our training data with annotations. The Mask R-CNN output is similar to the annotation format: which emoji is detected, i.e. object class, the bounding box for the emoji, and the pixels in the image where the emoji is present, i.e. mask. We next describe the loss function for Mask R-CNN. The loss consists of three components - classification loss, bounding box regression loss, and mask loss. The classification loss: Lcls is cross-entropy loss (as defined in Section 2.5 ) for whether 86 Figure 2.9: Mask R-CNN Architecture This figure shows the architecture (post region proposal generation) for Mask R- CNN. The shared convolution layers provide features which are used by both the region proposal model and the object detection model. The proposal from the region proposal model (same as Faster R-CNN) are fed to the ROI Align layer to subsequently through a convolutional layer to obtain intermediate features – region of interest feature vector. Two fully connected layers then output the pre- dicted class and bounding box co-ordinates for the region of interest. Masks are generated using two subsequent convolutional layers. 87 Figure 2.10: Example of Annotated Data for Disguising The figure shows a heart emoji used to disguise Juul usage. The annotated data consists of a bounding box (pink box), object class (emoji “0”), and mask (pixels shaded to annotate heart emoji). the predicted object class is the true object class. The bounding box regression loss: Lreg quantifies the loss in predicted bounding box and the ground truth bounding box. The mask loss: Lmask the per-pixel loss for whether the pixel mask matched the ground truth masks – average cross entropy loss (as defined in Eq. 5). More formally, the overall loss function for Mask R-CNN is given as: 1 ∑ ∗ 1 ∑ ∗ 1 ∑∑L = L ∗ N cls (pi, pi ) + piLN reg (ti, ti ) + 1N jk Lmask(δ jk, δ jk) (2.7) cls i reg i mask i j where, for a box i, pi is the predicted probability of detecting an object, p∗i is the ground truth whether the box contains an object or not. ti are the predicted coordinates, and t∗i are the ground truth coordinates. 29 A coordinate consists of 29We refer the readers to the Mask R-CNN paper for details on the transformations for the coordinates. 88 four variables: x, y,w, and h, where x and y are the two-dimensional co-ordinates for the center of the bounding box, and w and h are the width and height respec- tively for the bounding box, and these are subsequently scaled around center of the bounding box and log-transformed. The bounding box regression loss Lreg is: ∑ ′ Lreg(ti, t∗i ) = smooth v L1(ti − tvi ) (2.8) v∈x,y,w,h smoothL1(x) = 0.5 × x2, |x| < 1 smoothL1(x) = |x| − 0.5, |x| ≥ 1 The mask loss is a pixel by pixel average cross entropy loss. It takes as input the feature value for pixel j in the 14 × 14 predicted mask with δ jk being predicted mask value (whether pixel j is predicted mask for object class k), and δ∗jk is the ground truth mask value. We train the model using back-propagation with stochastic gradient descent (as defined in Section 2.5) for 20,000 iterations with a mini-batch size of 1,000 re- gions (2 images, 500 region proposals per image). We follow the same learning rate schedule as He et al. (2017), starting with a learning rate of 0.01, weight de- cay of 0.0001, and momentum of 0.9. To test the accuracy of the method, we held out 100 randomly selected images with emoji and 100 randomly selected images without emoji in them. We tested whether the method is able to predict which images have emoji in them. Our model achieved precision of 92.46% and recall of 98% for an F1 score of 95.15%. Note that we do not have a direct benchmark 89 for comparing performance since there is no emoji detection dataset or challenge to the best of our knowledge. The closest benchmark task we found is the object detection task for 200 object categories in the ILSVRC 2017 challenge, where the best performing model has a mean average precision of 73.14%. 30 2.5.4 Estimating the Effect of Tax Policies We use the generalized synthetic control method Xu (2017) to estimate the effect of state taxes on vaping posting behavior. This method unifies the synthetic control method with the interactive fixed effects (IFE) model. Counterfactuals for treated units are imputed in three steps. In the first step, the IFE model is estimated on the control data. In the second step, the factor loadings for each treated unit are estimated by linearly projecting treated units’ pretreatment data on the factor space. Finally, these estimated factor loadings and factors are used to impute treated counterfactuals. The generalized synthetic control method has been used in marketing literature Guo et al. (2020); Pattabhiramaiah et al. (2019), and builds further on the synthetic control method Abadie et al. (2010), which has been used extensively in marketing literature Li (2019); Tirunillai and Tellis (2017). Suppose there are N states, with T states that passed taxes - the treatment states, and C states that did not pass any taxes. Let Yit be the dependent outcome variable for state i at time period t. We assume a parametric form for Yit given by 30See ILSVRC 2017: ImageNet Large Scale Visual Recognition Challenge 2017 results, accessible from: http://image-net.org/challenges/LSVRC/2017/results#det 90 a linear factor model as: Yit = δitDit + βXit + λiFt + it (2.9) where Dit is a dummy variable that takes value of 1 when the state i is treated at time period t, Xit are the set of state-time varying covariates of per-capita con- sumption of cigarettes in the preceding year, (log) population in the previous quarter, and (log) per-capita income in the previous quarter. λiFt is the linearly additive factor component of the model.31 The IFE model parameters (β, λi, Ft) are estimated by minimizing the sum of squares of the fitted values subject to the constraints for factor loadings and factors and to be unit normalized and or- thogonal.32 We then estimate the factor loadings for treated units in the second step by minimizing the mean squared error for the predicted outcomes in pre- treatment period. In the final step, we impute the treated counterfactuals in the post-treatment period using the factor loadings estimated in step 2 and the model parameters estimated in step 1. In our analysis, the states that did not pass taxes are considered as control units.33 In this manner, treated counterfactuals are imputed for the post-treatment pe- 31These factors allow for capturing varying unobserved heterogeneities, such as unit-specific linear trends or quadratic trends, auto-regressive components, and more generally heterogeneities of unobserved random variables that can be decomposed multiplicative form of a time factor and state loading. This is especially useful in our context to control for unobserved changes in states over time. 32We chose from zero - three number of factors, using cross-validation with leave-one-out pro- cedure to minimize the mean squared percentage error. 33We also exclude the states of Minnesota, Louisiana, North Carolina, and Delaware which had previously enacted taxes. 91 riod, and the average treatment effect across all states at time period t is given as: 1 ∑ ÂTTt = [Yit(1) − Ŷit(0)] (2.10)T i∈T where Yit(1) is the observed treated outcome for state i at time period t, and Ŷit(0) is the imputed treated counterfactual. Confidence intervals are computed using the percentile confidence intervals with parametric bootstrap for 1000 boot- straps. 2.6 Results We first report the summary statistics of the state-quarter level data. Table 2.3 reports the summary statistics for the observed covariates and image extracted variables respectively. We use the ResNeXt model as described in Section 2.5 to estimate underage, gender (female), and race in images, and use Mask R-CNN model as described in Section 2.5 to detect disguising. We construct the depen- dent variables for each week as moving average in the past month, similar to the approach followed by Hollenbeck (2018). At the state-week level, there are on av- erage 3.75 underage faces and a total of 111.62 faces (both adult and underage). The extent of underage disguising is 0.28 on average. There are on average 66.03 female faces. The average for race in the data are 18.84, 21.57, and 64.94 for Asian, 92 Table 2.3: Summary Statistics for State-Week Level Variables This table shows the summary statistics for state level variables of cigarette consumption per capita, (log) population, (log) per-capita income. Variables extracted from images: underage, underage disguising, female, race:Asian, race:Black, and race:White are win- sorized within 5% and 95% level. Variable Mean Std. Dev. Min Max Underage 3.74866 5.04659 0 19 Underage Disguising 0.27832 0.59836 0 2 Female 66.03131 82.64968 0 317 Race: Asian 18.84004 23.63845 0 90 Race: Black 21.57946 27.87771 0 107 Race: White 64.93459 80.45354 0 311 Cigarette Consumption 3.71763 0.43070 2.58776 4.57780 (Log) Population 15.29676 0.97821 13.26772 17.49185 (Log) Per-Capita Income 10.79572 0.16077 10.47622 11.22916 Black, and White vaping posts. The rest of the results section is organized in three parts. First, we report the average treatment effects of taxes for underage vaping from social media images. Second, we report the results for underage disguising in social media images af- ter regulation. Third, we test whether there was a differential effect of taxes on demographics of race and gender. 93 Figure 2.11: Effect of Taxes on Underage Posts This figure reports the results of the generalized synthetic control method for gap between treated and counterfactual (and 90% confidence intervals) for number of underage vaping posts. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Underage: All Treated States 4 0 −4 −8 0 50 100 Time relative to Treatment 2.6.1 Impact of Taxes on Underage Vaping Posts We estimate the effect of taxes using the generalized synthetic controls. Figure 2.11 shows the results for the average treatment effects for each of the four states of California, Kansas, Pennsylvania, and West Virginia. 94 Coefficient Underage: CA 10 0 −10 −20 −50 0 50 Time relative to Treatment Underage: KS 15 10 5 0 −5 −10 0 9550 100 Time relative to Treatment Coefficient Coefficient Underage: PA 10 5 0 −5 −10 0 50 100 Time relative to Treatment Underage: WV 10 5 0 −5 0 9650 100 Time relative to Treatment Coefficient Coefficient We find a decrease in underage vaping posts across the treated states in the weeks 30-90 post tax introduction. We next turn to each individual state and esti- mate the treatment effects. We find that California and Pennsylvania had a decline in underage vaping posts, however California had a delay of about a year after taxes – the effect is significant after week 70. Note that California saw a decline of 5-12 underage vaping posts per week, as compared to an average of 17.17 under- age posts per week in the pretreatment period. This corresponds to a reduction of 29.12% post taxes introduction. Pennsylvania saw a decline of 1.25 – 7 underage vaping posts per week during the weeks of 20 – 55, as compared to an average of 5.39 underage vaping posts per week, corresponding to a decline of 23.19% post tax introduction.34 The timing of these effects is similar to other papers in mar- keting and addiction products literature Tuchman (2019). We do not observe such decline in underage faces for the states of Kansas and West Virginia after passing taxes. As the next step, we measure impact of taxes for engagement measures for posts that contain underage faces. These are the following three measures: aver- age number of likes for these posts; average number of comments on these posts; and proportion of posts with solo faces, i.e. they contain only one face. Figures 2.15, 2.16, and 2.19 in the Appendix 2.9.4 shows the average treatment effects es- timated. We find that Pennsylvania and West Virginia had an increase in number of likes for posts with underage faces, and that this effect is significant after week 34Supplementary analysis for effectiveness of taxes with difference-in-differences approach is in the Appendix 2.9.3. We find consistent evidence that California and Pennsylvania had a decline in underage vaping posts. 97 60 post tax introduction. We do not find significant effects for number of com- ments on underage posts. This evidence suggests that while Pennsylvania saw a decline in underage vaping posts, however their engagement on social media in- creased post tax introduction. The West Virginia finding of increased engagement is puzzling. Thus, we find that state taxes of California and Pennsylvania were effective in deterring underage youth social media posting related to vaping, whereas Kansas and West Virginia did not see such a decline. The Pennsylvania tax effectiveness is weakened by the finding of increased engagement of the remaining posts. Reg- ulators will be interested in these findings as higher engagement and influence of vaping posts on social media can counter regulatory efforts to denormalize youth vaping. 2.6.2 Impact of Taxes on Underage Disguising We estimate the effect of taxes on underage disguised posts in a manner similar to that of Section 2.6.1. Figure 2.12 shows the results. 98 Figure 2.12: Effect of Taxes on Underage Disguising This figure reports the results of the generalized synthetic control method for gap between treated and counterfactual (and 90% confidence intervals) for number of underage vaping posts with disguising. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Underage Disguising: All Treated States 2 1 0 −1 0 50 100 Time relative to Treatment 99 Coefficient Underage Disguising: CA 2 0 −2 −50 0 50 Time relative to Treatment Underage Disguising: KS 3 2 1 0 −1 −2 0 10500 100 Time relative to Treatment Coefficient Coefficient Underage Disguising: PA 2.5 0.0 −2.5 0 50 100 Time relative to Treatment Underage Disguising: WV 2 1 0 −1 −2 0 10510 100 Time relative to Treatment Coefficient Coefficient We find that California had an increase in disguising post tax passing, however this effect goes away after week 25. The effect is as strong as a 100% increase in un- derage disguising compared to the pretreatment levels (1.75 underage disguised posts per week). We do not observe similar effects for the other states.35 As the next step, we measure impact of tax for engagement measures for posts with disguising. These are the following three measures: proportion of posts with underage faces, i.e. posts with disguising that contain an underage face; average number of likes for these posts; and average number of comments on these posts, however we do not find significant effects. Figures 2.17 and 2.18 in the Appendix 2.9.4 show the average treatment effects. Combined with the findings of underage vaping (Section 2.6.1), these findings are important for regulators as our results suggest that while underage users tried to disguise or deceive about e-cigarette usage in their social media posts in Cali- fornia, the effect subsequently subsided and California taxes did lead to a decline in underage vaping posts. Importantly, managers will also be interested in moni- toring disguising as they aim to curtail unintended usage of their vaping products to ensure regulatory compliance. 35We then conduct additional analysis with difference-in-difference approach, and report the results in the Appendix 2.9.3. We find convergent evidence of increased disguising in social media posts for California. 102 2.6.3 Impact of Taxes by Race and Gender We estimate the effect of taxes on race and gender of posting in the same man- ner as Section 2.6.1. Appendix 2.9.5 shows the results for the average treatment effects for each of the four states of California, Kansas, Pennsylvania, and West Virginia. Our preliminary analysis suggests that Pennsylvania had a decline in female vaping posts for the weeks 20-55 post tax passing. We do not find such effects for other states. The research on differential gendered effects of vaping is currently nascent and unclear. Nonetheless, managers and regulators might find it useful to know trends of posting by gender, as research on gendered effects of vaping becomes clearer. For race, we find that Kansas saw a decline for race: Asian and White, whereas it had an increase for race: Black. This is an unusual result given the expectation of decrease in consumption with tax increase. A possible confound is that in 2015, a year prior to vaping tax in 2016, Kansas increased its cigarette tax significantly, by 63%. The cigarette industry has historically advertised heavily to Blacks, result- ing in greater rates of consumption. We would need consumption data for both e- and regular cigarettes to verify our conjecture. Regardless, from a regulatory point of view, increased posting by Blacks post-tax is likely worrisome. Health outcomes already disfavor minority groups and these go against regulatory ef- forts to reduce health disparities.36 For managers, these results are concerning as they highlight the future possibility of bad publicity or litigation resulting from 36For health outcome inequalities by race, see: CDC Health Disparities and Inequalities Report - United States, 2013 103 differential impact on minority groups. 2.7 Discussion Vaping has become a concern for health regulators as nicotine addiction from vaping causes pulmonary harm to youth and also increases the probability of sub- sequent smoking. Lack of data on underage consumption makes it harder to mea- sure the efficacy of taxation and other regulation. We utilize social media images as a rough proxy of normalization of vaping and potentially of consumption. We find that the states with higher taxes: California and Pennsylvania had a decline in number of underage vaping posts. California’s decline in underage vaping posts was preceded by an increase in disguising activity, and Pennsylvania’s decline in underage vaping posts was accompanied by increased engagement for the under- age posts. There also appear to be differential effects by race in Kansas. Our work has both regulatory and managerial implications. These findings are worrisome for regulators aiming to denormalize vaping among underage users, and among minorities who already have less favorable health outcomes. From a managerial perspective, our methods can be used by firm to detect inappropriate usage of their products and gauge regulatory risk as firms must carefully navi- gate the regulatory landscape. Continued regulatory violations could potentially endanger their viability as a business. We plan to extend our work on the following three fronts. First, we plan to ex- 104 tend our data to estimate how long the tax effects persist. Second, we plan to study the recent regulations passed in September 2019 by the state of Massachusetts that imposes a four month ban on all vaping products, and that by the states of Michi- gan and New York that bans sales of flavored e-cigarettes. Finally, we plan to estimate the interaction effects of these regulations with the 2020 COVID health crisis. 105 2.8 References Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program. Journal of the American statistical Association, 105(490):493–505. Allcott, H. and Rafkin, C. (2020). Optimal regulation of e-cigarettes: Theory and evidence. Technical report, National Bureau of Economic Research. Allem, J.-P., Dharmapuri, L., Unger, J. B., and Cruz, T. B. (2018). Characterizing juul-related posts on twitter. Drug and alcohol dependence, 190:1–5. Barbeau, A. M., Burda, J., and Siegel, M. (2013). Perceived efficacy of e-cigarettes versus nicotine replacement therapy among successful e-cigarette users: a qual- itative approach. Addiction Science & Clinical Practice, 8(1):5. Barrington-Trimis, J. L., Urman, R., Berhane, K., Unger, J. B., Cruz, T. B., Pentz, M. A., Samet, J. M., Leventhal, A. M., and McConnell, R. (2016). E-cigarettes and future cigarette use. Pediatrics, 138(1):e20160379. Brown, J., Beard, E., Kotz, D., Michie, S., and West, R. (2014). Real-world effec- tiveness of e-cigarettes when used to aid smoking cessation: a cross-sectional population study. Addiction, 109(9):1531–1540. Burnap, A., Hauser, J. R., and Timoshenko, A. (2019). Design and evaluation 106 of product aesthetics: a human-machine hybrid approach. Available at SSRN 3421771. Callaway, B. and Sant’Anna, P. H. (2018). Difference-in-differences with multiple time periods and an application on the minimum wage and employment. arXiv preprint arXiv:1803.09015. Chen, J. and Rao, V. R. (2020). A dynamic model of rational addiction with stock- piling and learning: An empirical examination of e-cigarettes. Management Sci- ence. Chun, L. F., Moazed, F., Calfee, C. S., Matthay, M. A., and Gotts, J. E. (2017). Pul- monary toxicity of e-cigarettes. American Journal of Physiology-Lung Cellular and Molecular Physiology, 313(2):L193–L206. Dave, D., Feng, B., and Pesko, M. F. (2019). The effects of e-cigarette minimum legal sale age laws on youth substance use. Health economics, 28(3):419–436. Dew, R., Ansari, A., and Toubia, O. (2019). Letting logos speak: Leveraging mul- tiview representation learning for data-driven logo design. Available at SSRN 3406857. Dutta, A. and Zisserman, A. (2019). The via annotation software for images, audio and video. In Proceedings of the 27th ACM International Conference on Multimedia, pages 2276–2279. ElTayeby, O., Eaglin, T., Abdullah, M., Burlinson, D., Dou, W., and Yao, L. (2017). Detecting drinking-related contents on social media by classifying heteroge- 107 neous data types. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pages 364–373. Springer. Gordon, B. R. and Sun, B. (2015). A dynamic model of rational addiction: Evalu- ating cigarette taxes. Marketing Science, 34(3):452–470. Guo, T., Sriram, S., and Manchanda, P. (2020). Let the sunshine in: The impact of industry payment disclosure on physician prescription behavior. Marketing Science, 39(3):516–539. He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778. Hollenbeck, B. (2018). Online reputation mechanisms and the decreasing value of chain affiliation. Journal of Marketing Research, 55(5):636–654. Kärkkäinen, K. and Joo, J. (2019). Fairface: Face attribute dataset for balanced race, gender, and age. arXiv preprint arXiv:1908.04913. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Li, K. T. (2019). Statistical inference for average treatment effects estimated by synthetic control methods. Journal of the American Statistical Association, pages 1–16. 108 Li, Y. and Xie, Y. (2020). Is a picture worth a thousand words? an empirical study of image content and social media engagement. Journal of Marketing Research, 57(1):1–19. Liu, L., Dzyabura, D., and Mizik, N. (2020). Visual listening in: Extracting brand image portrayed on social media. Marketing Science, 39(4):669–686. Maharana, A. and Nsoesie, E. O. (2018). Use of deep learning to examine the asso- ciation of the built environment with prevalence of neighborhood adult obesity. JAMA network open, 1(4):e181535–e181535. Miech, R., Patrick, M. E., O’malley, P. M., and Johnston, L. D. (2017). E-cigarette use as a predictor of cigarette smoking: results from a 1-year follow-up of a national sample of 12th grade students. Tobacco control, 26(e2):e106–e111. Pattabhiramaiah, A., Sriram, S., and Manchanda, P. (2019). Paywalls: Monetizing online content. Journal of Marketing, 83(2):19–36. Pesko, M. and Warman, C. (2017). The effect of prices and taxes on youth cigarette and e-cigarette use: Economic substitutes or complements? Available at SSRN 3077468. Puranam, D., Narayan, V., and Kadiyali, V. (2017). The effect of calorie posting regulation on consumer opinion: a flexible latent dirichlet allocation model with informative priors. Marketing Science, 36(5):726–746. Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time 109 object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99. Rothe, R., Timofte, R., and Van Gool, L. (2015). Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 10–15. Saffer, H., Dench, D., Dave, D., and Grossman, M. (2018). E-cigarettes and adult smoking. Technical report, National Bureau of Economic Research. Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Soneji, S., Barrington-Trimis, J. L., Wills, T. A., Leventhal, A. M., Unger, J. B., Gib- son, L. A., Yang, J., Primack, B. A., Andrews, J. A., Miech, R. A., et al. (2017). As- sociation between initial use of e-cigarettes and subsequent cigarette smoking among adolescents and young adults: a systematic review and meta-analysis. JAMA pediatrics, 171(8):788–797. Tirunillai, S. and Tellis, G. J. (2017). Does offline tv advertising affect online chat- ter? quasi-experimental analysis using synthetic control. Marketing Science, 36(6):862–878. Tuchman, A. E. (2019). Advertising and demand for addictive goods: The effects of e-cigarette advertising. Marketing Science, 38(6):994–1022. Uijlings, J. R., Van De Sande, K. E., Gevers, T., and Smeulders, A. W. (2013). 110 Selective search for object recognition. International journal of computer vision, 104(2):154–171. Wang, P., Xiong, G., and Yang, J. (2019). Frontiers: Asymmetric effects of recre- ational cannabis legalization. Marketing Science. Xiang, S. and Li, H. (2017). On the effects of batch and weight normalization in generative adversarial networks. arXiv preprint arXiv:1704.03971. Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500. Xu, Y. (2017). Generalized synthetic control method: Causal inference with inter- active fixed effects models. Political Analysis, 25(1):57–76. Zhang, S., Lee, D. D., Singh, P. V., and Srinivasan, K. (2017). How much is an image worth? airbnb property demand estimation leveraging large scale im- age analytics. Airbnb Property Demand Estimation Leveraging Large Scale Image Analytics (May 25, 2017). Zhang, Zhifei, S. Y. and Qi, H. (2017). Age progression/regression by condi- tional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. 111 2.9 Appendix 2.9.1 Number of Social Media Posts by Users and Proportion of Posts Related to Vaping Figure 2.13 shows the number of posts and users based on proportion vaping posts cutoff. Figure 2.13: Number of Posts and Users based on Proportion Vaping Posts Cutoff. This figure shows the number of posts and users by cutoff for posts 112 2.9.2 Detecting Disguising in Images Figure 2.14 shows the emojis used in the object detection method of Section 2.5 Figure 2.14: Emoji List for Detection 113 2.9.3 Supplementary Analysis with Difference-in-Differences We now discuss the staggered difference in differences approach of Callaway and Sant’Anna (2018) (SDID henceforth). The traditional approach of difference in differences with propensity score matching is not suitable in our context since we have varying treatment periods across different states. Callaway and Sant’Anna (2018) show the efficacy of SDID in tackling variation in treatment periods, and also demonstrate its applicability by estimating the impact of federal minimum wage regulations during 2007-2010 on teen employment. We first discuss the notation. Suppose there are T time periods, Dt be the bi- nary indicator if a unit is treated in time period t, Gg is the binary indicator for whether the unit is first treated in time period g, and C is the binary indicator for whether the unit is in the control group, i.e. it is never treated. The authors propose generalized propensity score method to match treated units with control units. The generalized propensity score pg is given by: pg(X) = P(Gg = 1|X,C +Gg = 1) where pg(X) is the probability that a unit with observed covariates X is treated at time period g. The generalized propensity score is estimated using a logit model. The outcome variable at time t is Yt, with Yt(1) and Yt(0) as the poten- tial outcome for the treated and control units. Thus, the average treatment effect, ATT (g, t), for units first treated at time period g at a time period t(t ≥ g) is given 114 by: ATT (g, t) = E(Yt(1) − Yt(0)|Gg = 1) Apart from the standard assumptions required for difference in differences, SDID requires a modified parallel trends assumption for identification. SDID makes conditional parallel trends assumption, i.e. parallel trends assumption holds conditional on covariates, i.e. E(Yt(0) − Yt−1(0)|X,Gg = 1) = E(Yt(0) − Yt−1(0)|X,C = 1) With these assumptions, the average treatment effect for a unit treated in time period g for a time t is then a weighted average of the ’long differences’ of the outcome variable. The weights are the generalized propensity scores. Thus, the average treatment effect, ATT (g, t), is given non-parametrically given as:   pg(X)C   ATT (g t) E  Gg − 1−  pg(X)C  , = (Y − Y )E[G ] p (X)C  t g−1g E[ g 1−pg(X)C ] In our context, the observed state-quarter level covariates consist of (lagged) log of population, log of per-capita income, average cost of cigarettes, proportion of non-adults (less than 18 years old), and proportion of online search related to vaping. These variables are similar to the ones used by Abadie et al. (2010) for evaluating California Tobacco regulation. The pre-period is 2016 Q1 – till regula- 115 tion for each of the states. We report the results for four quarters post taxes for the outcome variables of proportion of underage posts, proportion of female posts, proportion of posts with race: Asian, proportion of posts with race: Black, proportion of posts with race: White, and proportion of posts with disguising (overall disguising). Table 2.4 reports the results for underage vaping posts. Table 2.6 reports the average treatment effects of taxes on underage vaping in social media posts with additional covariates of Youth Behavioral Risk Survey, and with lagged outcome variables. Table 2.5 reports the results for gender and race. Table 2.7 reports the average treatment effects of taxes on disguising in social media posts. Table 2.8 reports the average treatment effects of taxes on disguising in social media posts with additional covariates of Youth Behavioral Risk Survey, and with lagged outcome variables. 116 Table 2.4: State-Wide Effect of Taxes: Underage This table reports the average treatment effect of taxes on underage vaping in social me- dia posts for the treated states of California, Kansas, Pennsylvania, and West Virginia. Quarters Q1, Q2, Q3, and Q4 are quarters 1, 2, 3, and 4 post treatment. Controls include lagged log of population, log of per-capita income, average cost of cigarettes, per-capita consumption of cigarettes, proportion population that is non-adult, and online search vol- ume for vaping related searches. Standard errors (in parenthesis) are clustered at the state level. Q1 Q2 Q3 Q4 California -0.01289** -0.0134** -0.00272 -0.01341*** (0.00647) (0.00683) (0.00716) (0.00596) Kansas -0.00637 -0.05086*** -0.03876*** -0.05068*** (0.01444) (0.00516) (0.01044) (0.00557) Pennsylvania -0.00028 0.00465 -0.01068*** 0.00367* (0.00842) (0.00689) (0.00356) (0.00204) West Virginia -0.01578*** -0.00086 0.01298 0.00813 (0.00518) (0.0079) (0.01897) (0.0105) Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1 117 Table 2.5: State-Wide Effect of Taxes: Other Demographics This table reports the average treatment effect of taxes on other demographics of gen- der (female), race – Asian, race – Black, and race - White. Quarters Q1, Q2, Q3, and Q4 are quarters 1, 2, 3, and 4 post treatment. Controls include lagged log of population, log of per-capita income, average cost of cigarettes, per-capita consumption of cigarettes, proportion population that is non-adult, and online search volume for vaping related searches. Standard errors (in parenthesis) are clustered at the state level. Gender Q1 Q2 Q3 Q4 California 0.01389 0.00171 0.00467 -0.00441 (0.01888) (0.01336) (0.00337) (0.01591) Kansas -0.06566*** -0.08426** -0.04711*** -0.02098 (0.02737) (0.03996) (0.01576) (0.01981) Pennsylvania 0.00134 0.05355*** 0.00168 -0.01485 (0.02497) (0.01898) (0.01203) (0.02819) West Virginia -0.01385 -0.00442 0.10064*** 0.09348*** (0.05503) (0.00966) (0.02833) (0.0323) Race-Asian Q1 Q2 Q3 Q4 California 0.01389 0.00171 0.00467 -0.00441 (0.01888) (0.01336) (0.00337) (0.01591) Kansas -0.06566*** -0.08426** -0.04711*** -0.02098 (0.02737) (0.03996) (0.01576) (0.01981) Pennsylvania 0.00134 0.05355*** 0.00168 -0.01485 (0.02497) (0.01898) (0.01203) (0.02819) West Virginia -0.01385 -0.00442 0.10064*** 0.09348*** 118 (0.05503) (0.00966) (0.02833) (0.0323) Race-Black Q1 Q2 Q3 Q4 California 0.01167 -0.00017 0.01317 0.02706*** (0.01819) (0.01587) (0.01019) (0.00691) Kansas 0.08646*** 0.1848*** 0.01704 0.07787*** (0.0249) (0.0482) (0.03286) (0.03556) Pennsylvania 0.01503 -0.00506 -0.01845 0.01469 (0.01242) (0.01647) (0.01586) (0.01241) West Virginia -0.0441 0.01229 -0.06886*** -0.03161 (0.07026) (0.01237) (0.02884) (0.03065) Race-White Q1 Q2 Q3 Q4 California -0.01228 0.01546 0.01151 -0.01182*** (0.01699) (0.01053) (0.01741) (0.00498) Kansas 0.00204 -0.15076*** 0.06235 -0.04311 (0.02775) (0.04512) (0.04192) (0.04105) Pennsylvania -0.01881** -0.01003 0.04818*** -0.01311 (0.00918) (0.01672) (0.01732) (0.01368) West Virginia 0.13592*** 0.26979*** 0.32453*** 0.1401*** (0.04457) (0.0376) (0.01746) (0.04665) Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1 119 Table 2.6: Robustness Results for Table 2.4: Model Specifications These tables report the average treatment effect of taxes on underage vaping in social media posts for the treated states of California, Kansas, Pennsylvania, and West Virginia. Quarters Q1, Q2, Q3, and Q4 are quarters 1, 2, 3, and 4 post treatment. Standard errors (in parenthesis) are clustered at the state level. With YBRSS Variables - Additional controls include youth cigarette usage, e-cigarette usage, and marijuana usage. Note that YBRSS did not collect data for Kansas during the years of 2015 or 2017. Q1 Q2 Q3 Q4 California 0.00454 -0.00894*** -0.00039 0.00348 (0.00447) (0.00288) (0.00476) (0.00481) Kansas Pennsylvania -0.00385 -0.0061 -0.02091*** 0.00472*** (0.00346) (0.00746) (0.00645) (0.0019) West Virginia 0.01563 -0.01159 0.00764 0.01324*** (0.00999) (0.01386) (0.00981) (0.0015) With Lagged Outcome Variables - Controls are lagged outcome variables. Q1 Q2 Q3 Q4 California -0.00084 -0.01713*** -0.00348 -0.00644* (0.00404) (0.00345) (0.00367) (0.00388) Kansas -0.05059* -0.05003*** -0.047*** -0.04677*** (0.02671) (0.00867) (0.00512) (0.01129) Pennsylvania -0.02155* 0.0029 -0.01127*** 0.00221 (0.01192) (0.00417) (0.00339) (0.00498) West Virginia -0.0258 -01.0200267 0.01144*** 0.01357 (0.02721) (0.00815) (0.00472) (0.01017) Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1 Table 2.7: State-Wide Effect of Taxes: Disguising This table reports the average treatment effect of taxes on disguising behavior in social media posts. Quarters Q1, Q2, Q3, and Q4 are quarters 1, 2, 3, and 4 post treatment. Con- trols include lagged log of population, log of per-capita income, average cost of cigarettes, per-capita consumption of cigarettes, proportion population that is non-adult, and online search volume for vaping related searches. Standard errors (in parenthesis) are clustered at the state level. Q1 Q2 Q3 Q4 California 0.00351 0.01063*** 0.01061 0.01116*** (0.00826) (0.00476) (0.00685) (0.00321) Kansas 0.07775*** 0.05825*** 0.06205*** 0.05966*** (0.0135) (0.00919) (0.01726) (0.01702) Pennsylvania -0.0056 -0.00658 0.00197 -0.00314 (0.01104) (0.00844) (0.0057) (0.00745) West Virginia 0.02403 0.02367 -0.00185 0.03642 (0.05309) (0.03105) (0.0396) (0.04608) Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1 121 Table 2.8: Robustness Results for Table 2.7: Model Specifications These tables report the average treatment effect of taxes on disguising in social media posts for the treated states of California, Kansas, Pennsylvania, and West Virginia. Quar- ters Q1, Q2, Q3, and Q4 are quarters 1, 2, 3, and 4 post treatment. Standard errors (in parenthesis) are clustered at the state level. With YBRSS Variables - Additional controls include youth cigarette usage, e-cigarette usage, and marijuana usage. Note that YBRSS did not collect data for Kansas during the years of 2015 or 2017. Q1 Q2 Q3 Q4 California 0.004 0.01477 0.00498 -0.00058 (0.00555) (0.01246) (0.01384) (0.0178) Kansas Pennsylvania 0.00511 -0.0003 -0.00256 0.00265 (0.0083) (0.00512) (0.0095) (0.00462) West Virginia -0.02193 -0.00358 -0.05324*** -0.03171*** (0.01254) (0.01076) (0.01018) (0.01277) With Lagged Outcome Variables - Controls are lagged outcome variables. Q1 Q2 Q3 Q4 California 0.01202** 0.01525*** 0.01765*** 0.01154* (0.00571) (0.00545) (0.00592) (0.00692) Kansas 0.05291*** 0.05031*** 0.04655*** 0.0438*** (0.00681) (0.00606) (0.00655) (0.00658) Pennsylvania 0.00811 0.01728 0.0068 0.01089 (0.01111) (0.01372) (0.01221) (0.00901) 122 West Virginia -0.02168 -0.00946 -0.08358*** -0.02417*** (0.01434) (0.00783) (0.02585) (0.00573) Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1 2.9.4 Engagement Results Figure 2.15 reports the results for effect of taxes for likes on underage vaping posts. We find that there is an overall increase post week 75, and that Pennsylvania and West Virginia saw an increase in number of likes starting week 60 after tax intro- duction. Figure 2.16 shows results for number of comments for underage vaping posts. We do not observe any significant effects. Figure 2.17 reports the results for effect of taxes for likes on underage vaping posts with disguising. We do not find substantial increase post tax introduction. Figure 2.18 reports the results for effect of taxes for comments on underage vaping posts with disguising. We do not find substantial increase post taxes. Figure 2.19 reports the results for effect of taxes for solo faces in vaping posts. We do not find substantial increase post taxes. 123 Figure 2.15: Effect of Taxes on Likes: Underage This figure reports the results of the generalized synthetic control method for gap be- tween treated and counterfactual (and 90% confidence intervals) for the outcome variable of number of likes for underage vaping posts. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylva- nia, and West Virginia Underage Likes: All Treated States 40 20 0 −20 0 50 100 Time relative to Treatment 124 Coefficient Underage Likes: CA 60 30 0 −30 −50 0 50 Time relative to Treatment Underage Likes: KS 80 40 0 −40 0 12550 100 Time relative to Treatment Coefficient Coefficient Underage Likes: PA 50 0 −50 0 50 100 Time relative to Treatment Underage Likes: WV 50 0 0 12560 100 Time relative to Treatment Coefficient Coefficient Figure 2.16: Effect of Taxes on Comments: Underage This figure reports the results of the generalized synthetic control method for gap be- tween treated and counterfactual (and 90% confidence intervals) for the outcome vari- able of number of comments for underage vaping posts. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Underage Comments: All Treated States 0.5 0.0 −0.5 0 50 100 Time relative to Treatment 127 Coefficient Underage Comments: CA 1.0 0.5 0.0 −0.5 −1.0 −50 0 50 Time relative to Treatment Underage Comments: KS 1 0 −1 −2 0 12580 100 Time relative to Treatment Coefficient Coefficient Underage Comments: PA 1 0 −1 −2 0 50 100 Time relative to Treatment Underage Comments: WV 2 1 0 −1 0 12590 100 Time relative to Treatment Coefficient Coefficient Figure 2.17: Effect of Taxes on Likes: Underage Disguised This figure reports the results of the generalized synthetic control method for gap between treated and counterfactual (and 90% confidence intervals) for the outcome variable of number of likes for underage disguised vaping posts. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Disguising Likes: All Treated States 100 0 −100 0 50 100 Time relative to Treatment 130 Coefficient Disguising Likes: CA 200 0 −200 −50 0 50 Time relative to Treatment Disguising Likes: KS 400 200 0 −200 0 13150 100 Time relative to Treatment Coefficient Coefficient Disguising Likes: PA 400 200 0 −200 0 50 100 Time relative to Treatment Disguising Likes: WV 200 0 −200 0 13250 100 Time relative to Treatment Coefficient Coefficient Figure 2.18: Effect of Taxes on Comments: Underage Disguised This figure reports the results of the generalized synthetic control method for gap be- tween treated and counterfactual (and 90% confidence intervals) for the outcome variable of number of comments for underage disguised vaping posts. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Underage Comments: All Treated States 5.0 2.5 0.0 −2.5 −5.0 0 50 100 Time relative to Treatment 133 Coefficient Disguising Comments: CA 10 5 0 −5 −10 −50 0 50 Time relative to Treatment Disguising Comments: KS 10 0 0 13540 100 Time relative to Treatment Coefficient Coefficient Disguising Comments: PA 10 0 −10 0 50 100 Time relative to Treatment Disguising Comments: WV 10 5 0 −5 −10 0 13550 100 Time relative to Treatment Coefficient Coefficient Figure 2.19: Effect of Taxes on Solo Faces in Posts This figure reports the results of the generalized synthetic control method for gap between treated and counterfactual (and 90% confidence intervals) for the outcome variable of number of posts with solo faces. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Solo Faces: All Treated States 0.2 0.0 −0.2 0 50 100 Time relative to Treatment 136 Coefficient Solo Faces: CA 0.2 0.0 −0.2 −50 0 50 Time relative to Treatment Solo Faces: KS 0.75 0.50 0.25 0.00 −0.25 0 13750 100 Time relative to Treatment Coefficient Coefficient Solo Faces: PA 0.0 −0.5 −1.0 0 50 100 Time relative to Treatment Solo Faces: WV 0.6 0.3 0.0 −0.3 −0.6 0 13850 100 Time relative to Treatment Coefficient Coefficient 2.9.5 Impact of Taxes by Race and Gender Figure 2.20 reports the results for effect of taxes for female faces in vaping posts. We find that Pennsylvania had a decline in female vaping related posts in the weeks of 20-55 post tax introduction. Figure 2.21 reports the results for effect of taxes for number of posts with race: Asian. We find that there is an overall decline post tax introduction, and that this is seen in the Pennsylvania and Kansas. Figure 2.22 reports the results for effect of taxes for number of posts with race: Black. We find that there is an overall decline post tax introduction, and that this is seen as a decrease in Pennsylvania, and an increase Kansas. Figure 2.23 reports the results for effect of taxes for number of posts with race: White. We find that there is an overall decline post tax introduction, and that this is seen in the Pennsylvania and Kansas. 139 Figure 2.20: Effect of Taxes on Gender: Female This figure reports the results of the generalized synthetic control method for gap between treated and counterfactual (and 90% confidence intervals) for the outcome variable of number of female vaping related posts. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Female: All Treated States 100 0 −100 0 50 100 Time relative to Treatment 140 Coefficient Female: CA 400 200 0 −50 0 50 Time relative to Treatment Female: KS 600 400 200 0 −200 −400 0 14150 100 Time relative to Treatment Coefficient Coefficient Female: PA 100 0 −100 −200 0 50 100 Time relative to Treatment Female: WV 150 100 50 0 −50 −100 0 14250 100 Time relative to Treatment Coefficient Coefficient Figure 2.21: Effect of Taxes on Race: Asian This figure reports the results of the generalized synthetic control method for gap between treated and counterfactual (and 90% confidence intervals) for the outcome variable of number of Asian vaping related posts. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Race Asian: All Treated States 10 0 −10 −20 −30 0 50 100 Time relative to Treatment 143 Coefficient Race Asian: CA 30 0 −30 −60 −50 0 50 Time relative to Treatment Race Asian: KS 50 25 0 −25 0 14540 100 Time relative to Treatment Coefficient Coefficient Race Asian: PA 20 0 −20 −40 0 50 100 Time relative to Treatment Race Asian: WV 40 20 0 −20 0 14550 100 Time relative to Treatment Coefficient Coefficient Figure 2.22: Effect of Taxes on Race: Black This figure reports the results of the generalized synthetic control method for gap between treated and counterfactual (and 90% confidence intervals) for the outcome variable of number of Black vaping related posts. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Race Black: All Treated States 20 10 0 −10 −20 −30 0 50 100 Time relative to Treatment 146 Coefficient Race Black: CA 60 30 0 −30 −50 0 50 Time relative to Treatment Race Black: KS 40 20 0 −20 −40 0 14570 100 Time relative to Treatment Coefficient Coefficient Race Black: PA 50 25 0 −25 −50 0 50 100 Time relative to Treatment Race Black: WV 40 20 0 −20 0 14580 100 Time relative to Treatment Coefficient Coefficient Figure 2.23: Effect of Taxes on Race: White This figure reports the results of the generalized synthetic control method for gap between treated and counterfactual (and 90% confidence intervals) for the outcome variable of number of White vaping related posts. The first plot shows the overall treatment effects, and the remaining four plots show the results for California, Kansas, Pennsylvania, and West Virginia Race White: All Treated States 50 0 −50 −100 0 50 100 Time relative to Treatment 149 Coefficient Race White: CA 100 0 −100 −50 0 50 Time relative to Treatment Race White: KS 200 100 0 −100 0 15050 100 Time relative to Treatment Coefficient Coefficient Race White: PA 100 0 −100 0 50 100 Time relative to Treatment Race White: WV 100 50 0 −50 −100 0 15150 100 Time relative to Treatment Coefficient Coefficient 2.9.6 Other Common Objects in Posts We use a Mask R-CNN model with architecture similar to that described in Section 2.5, that has been pre-trained on the ImageNet dataset. This allows us to detect 80 most common classes from the ImageNet dataset. We then take this pre-trained model to our social media dataset. The 5 most common objects (from ImageNet objects) in our data are: bottle, dining table, cup, cellphone, and book. We apply the pre-trained model on our data to estimate the following three tables. Table 2.9 reports the proportion of posts with disguis- ing that contain these common objects and compare with those in posts without disguising. Table 2.10 reports the proportion of posts with underage faces that contain these common objects and compare with those in posts without underage faces. Finally, Table 2.11 reports the comparison of proportion of posts with dis- guising that contain these common objects to proportion of posts with underage faces that contain these common objects. 152 Table 2.9: Disguising and Common Objects This table reports the difference in proportion of posts with common objects in posts that have disguising and in the posts that do not have disguising. N = 22049 N = 22045 Object In % Posts with Disguising In % Posts without Disguising Difference bottle 31.444% 29.996% 1.447%*** dining table 10.041% 8.269% 1.772%*** cup 9.660% 7.321% 2.339%*** cell phone 8.105% 9.634% -1.53%*** book 7.461% 7.013% 0.448%* Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1 Table 2.10: Underage and Common Objects This table reports the difference in proportion of posts with common objects in posts that have underage faces and in the posts that do not have underage faces. N = 9,910 N = 9,910 Object In % Posts with Underage Faces In % Posts with Non-Underage Faces Difference bottle 41.463% 34.299% 7.164%*** dining table 10.121% 9.213% 0.908%** cup 8.234% 9.203% -0.969%** cell phone 10.464% 10.444% 0.02% book 10.989% 7.598% 3.391%*** Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1 153 Table 2.11: Underage, Disguising, and Common Objects This table reports the difference in proportion of posts with common objects in posts that have underage faces and in the posts that have disguising. N = 9,910 N = 22049 Object In % Posts with Underage Faces In % Posts with Disguising Difference (Underage - Disguising) bottle 41.463% 31.444% 10.019%*** dining table 10.121% 10.041% 0.08% cup 8.234% 9.660% -1.426%*** cell phone 10.464% 8.105% 2.36%*** book 10.989% 7.461% 3.528%*** Statistical significance levels:∗ ∗ ∗p < 0.01, ∗ ∗ p < 0.05, ∗p < 0.1 154 CHAPTER 3 USING DEEP LEARNING TO OVERCOME PRIVACY AND SCALABILITY ISSUES IN CUSTOMER DATA TRANSFER 3.1 Introduction Firms’ sensitive customer data are highly sought after by researchers who use statistical and econometric models for causal and predictive analyses. The chal- lenges to obtaining this data entail both privacy and scalability issues. Marketers, for example, who need to build pricing and targeting models for consumer pack- aged goods, require access to sales data at either the household or store level, as well as the corresponding prices of given brands. While prices are publicly ob- servable in stores and through promotion and advertisements, customer privacy concerns, legal restrictions, or firms’ concerns regarding disclosure of valuable information to competitors, are impediments to external sharing of sales data. Therefore, traditional methods of external data release, e.g., through a third-party vendor such as The AC Nielsen Company, require a high transaction cost due This chapter is: Lee, Clarence, and Piyush Anand (2020). “Using Deep Learning to Overcome Privacy and Scalability Issues in Customer Data Transfer”. Appendix 3.8.7 has been shortened for brevity and due to document length limitations. Researcher(s) own analyses calculated (or derived) based in part on data from Nielsen Con- sumer LLC and marketing databases provided through the NielsenIQ Datasets at the Kilts Center for Marketing Data Center at The University of Chicago Booth School of Business. The conclu- sions drawn from the NielsenIQ data are those of the researcher(s) and do not reflect the views of NielsenIQ. NielsenIQ is not responsible for, had no role in, and was not involved in analyzing and preparing the results reported herein. 155 to prohibitive nondisclosure agreements (NDA) and restricted data usage agree- ments (DUA). Central to the NDAs and DUAs is the original data provider’s need to control the privacy and accuracy of the data released. The current paradigm widely used to facilitate this exchange process is transfer of samples of customer data, which are anonymized by transferring small samples that are either obfuscated or aggre- gated. On the one hand, the larger the amount of data firms release to researchers, the more accurate are the price elasticity and targeting estimates. Firms therefore have incentives to release more data that are as unobfuscated as possible. How- ever, firms incur fewer privacy risks the smaller the data sample release and the higher the degree of obfuscation. This trade-off between accuracy and privacy in data disclosures has been extensively discussed in prior literature: Real-world situations drive the data provider to exert control along this trade-off Duncan and Stokes (2004). Exacerbating the transaction cost of this process is the actual transfer of the data itself. In the age of big data and digital commerce, the four V’s of big data gain significance: volume, velocity, variety, and veracity (Chintagunta et al., 2016; Ansari and Li, 2018). In this paper, we focus on the volume and velocity aspects of big data, as they can present nontrivial obstacles to the data transfer itself. While researchers seeking to maximize the accuracy and generalizability of the data have the incentive to acquire as much data from providers as possible, transferring and housing large amounts of customer data can require nontrivial technical know- 156 how and significant data storage costs. Furthermore, the velocity of data that refreshes into a data provider’s databases, often a matter of seconds, can vastly outpace the speed of a single data exchange. Therefore, the need arises for an approach to customer data transfer that can potentially address these issues. Recent developments in deep learning offer the possibility of training a gener- ative model that can mimic data generating distributions with an unprecedented degree of accuracy (Goodfellow et al., 2013). Generative adversarial networks (GANs henceforth) provide a flexible framework that can train two neural net- works – a discriminator model (discriminator henceforth) and a generator model (generator henceforth) – simultaneously by pitting them against each other. GANs involve training a generative model that generates synthetic data and simulta- neously training a discriminator model able to distinguish between the real and generated synthetic data, resulting in the generator mimicking the firm-side data generating distribution with a high degree of accuracy. This obviates the need to share private and sensitive data with the generator, and also allows for updating the generator as additional real-time data arrive. In contrast to estimation tech- niques that first estimate a model on the firm’s side then subsequently transfer some form of “data” such as actual data or synthetic data to the researcher, the decoupled nature of this training algorithm has both privacy and scalability ad- vantages. We propose an approach for preserving customer privacy that involves trans- fer of a generator (from GANs) as opposed to the aforementioned traditional ap- 157 proaches. This also provides improved privacy protection: no private and sensi- tive customer data leaves the servers of the firm as only the discriminator, that is housed inside the firm’s firewalls, has access to the private data. Furthermore, we find that our proposed method has scalability advantages. The volume and velocity aspects of big data, require any analysis to be sufficiently flexible to handle large volumes of newly arriving data; accordingly, this method’s data exchange cost, measured in computational and logistical time, does not grow proportionate to data size. Further, marketers will be interested in our proposed approach to tackle marketing problems. We show two things along these lines. First, as a proof of concept, we show that two marketing problems: price markups for optimal profits and customer targeting, can be effectively tackled using our proposed approach. Second, we also show that a firm need not train multiple GANs to tackle different problems. That is, a single GAN trained on the firm data can be used to solve two marketing problems of price markups for optimal profits and also customer targeting. Thus, in this paper, we build on the privacy literature in marketing and additionally analyze data scalability and ability to tackle marketing problems. We therefore explore the following research questions: • Accuracy: How well do GANs mimic the data generating process (DGP)? • Privacy: How well do GANs preserve privacy in the event that the trans- ferred generator is compromised? • Scalability: How do GANs accommodate the volume and velocity aspects of big data? 158 • Applicability: How well do GANs perform on marketing problems of price markups for optimal profits and customer targeting? Can one GAN perform both tasks? We find that GANs perform exceptionally well against benchmark methods in terms of accuracy of replicating the original data, as evaluated via the stan- dard accuracy-privacy framework from prior literature. GANs outperform bench- mark methods in terms of mimicking the true data, both in density plots and as measured using Kolmogorov-Smirnov test, Jensen-Shannon divergence, and Kullback-Leibler divergence.1 Furthermore, by modifying the training algorithm to incorporate customer heterogeneity, the firm can control the accuracy-privacy trade-off. In both cases, we find that GANs have lower information loss and lower loss in privacy as compared to benchmarks. We also validate our findings on the Nielsen household-level data and find that our accuracy-privacy results hold. Next, we note that GANs, leveraging the “on-line” nature of stochastic gradi- ent descent (SGD), are designed to handle both volume and velocity. In terms of data volume, we find that the SGD framework scales well with respect to the size of the data set, due to its distributed nature allowing for out-of-the-box parallel CPU or GPU computing. The order of magnitude of the per iteration computa- tion time does not grow according to the data size, but grows instead according to factors under the researcher’s control, such as mini-batch size, number of train- 1We discuss in the Appendix 3.8.6 that increasing the GAN complexity (number of neurons) reduces information loss. However, the improvements in information loss have a point of dimin- ishing returns after an optimum value of number of neurons. 159 ing iterations, and GAN complexity. We find that training time per iteration only increases marginally as we increase data volume from one thousand rows of data to ten million rows of data, and stays the same order of magnitude. Furthermore, GANs tackle the velocity aspects as the decoupled estimation nature of GANs requires that only the gradients of the objective function, as opposed to a costly transfer of an entire data set. We find that transferring a generator instead of data is cheaper, due to lower file size, especially when the data volume grows large.2 This allows for a lightweight automated exchange method between the two par- ties, such as the use of an application program interface (API), to “stream” the latest gradients to the generator in the exchange process. We find that the infor- mation loss converges faster when we stream the gradients, as opposed to redo- ing the entire training with the new data. This light-weight, automated exchange method also has logistical benefits. The traditional data transfer approach from the synthetic data literature requires the involvement of trained data scientists for each synthetic data set generated subsequent to the inflow of a substantial amount of new data. This process can be both error prone and costly to firms as well as researchers. The automated exchange process potentially alleviates this problem.3 Finally, we find that GANs perform well on the two tasks of optimal price markups and customer targeting, as compared to benchmarks. The optimal prof- its obtained from GANs are a minimum of 99.96% of the true profits, as com- 2Note that the data provider can also train a generator on its own end and transfer the trained generator to the researcher. Our proposed approach is indifferent to either approaches that the data provider chooses. 3We note that quantifying the nature of API calls (volume, network bandwidth, server require- ments, among others) is not the primary focus of our paper, and argue that GANs can be trained using API calls with “sufficient” network bandwidth 160 pared to the closest benchmark: Top Coding with optimal profits at a minimum of 69.91%. Customer targeting model accuracy from GANs is an order of magni- tude higher than the nearest benchmark, Swap 20. We also find that GANs can handle both these tasks simultaneously, i.e., a single GAN can tackle both these problems and outperforms benchmarks with a loss of accuracy of 1.39% as com- pared to the closest benchmark: Rounding with an loss of 2.07%. Throughout these three contexts, GANs also outperform benchmarks on the accuracy-privacy trade-off. These results extend from GANs ability to mimic the data generation process closely while providing higher privacy protection. To the best of our knowledge, our investigation is among the small but grow- ing literature in marketing to utilize GANs. In contrast to studies in the computer science literature, we explore how GANs can be used to address the trade-off be- tween accuracy and privacy. Our most important contribution is to demonstrate a proof of concept in which a simple GAN can generate privacy-preserving clones of the real data that is capable of tackling multiple marketing problems. 3.2 Existing Literature Existing work in the privacy literature in marketing and economics focuses on protection of data under the paradigm of transfer of true data between parties. Security is afforded by masking true data via a predetermined mechanism and accepting the trade off between privacy and usefulness, as for targetability (Gold- 161 farb and Tucker, 2011). Past work in marketing and statistics literature on syn- thetic data protection has discussed, for example, such data masking mechanisms as (i) aggregation (Steenburgh et al., 2003; Christen et al., 1997; Tenn, 2006; Link, 1995), (ii) swapping (Reiter, 2010), (iii) truncation/rounding (Schneider et al., 2018), (iv) and random noise addition (Reiter, 2005). These varied benchmark methods and associated performance metrics are used by Schneider et al. (2018) to evaluate their proposed data protection schemes for point-of-sale data. Following the tradition of synthetic data transfer (Abowd et al., 2012; Hu et al., 2014; Schnei- der and Abowd, 2015), in which the provider generates synthetic data for transfer to the user, Schneider et al. (2018) proposes a Bayesian generalized linear model (GLM) for generating protected synthetic data ex-post data creation. Our work differs from extant synthetic data protection literature in that GANs can generate synthetic data for purposes of predictive modeling and inference via the “light weight” transfer of a generator instead of synthetic data in the transfer process. Contributions of this paper entail the examination of the desirable properties of this paradigm shift, which are data volume scalability, transfer-file compression, and data streaming capabilities. A growing stream of literature on utilizing machine learning in marketing has developed in response to the call for integrating methods from computer science and statistics to address the “V’s” of big data: volume, velocity, variety and verac- ity (Chintagunta et al., 2016; Ansari and Li, 2018). For example, Liu et al. (2016) leverage a combination of cloud computing, text mining, and machine learning to handle massive volumes of online social platform data to forecast sales, and Tim- 162 oshenko and Hauser (2019) use a convolutional neural network to identify cus- tomer needs from user-generated content. The latter neural network, estimated using stochastic gradient descent (SGD), scales well on volume of data as well as computing requirements. Without being restricted to large computer cloud clusters, model training can, with the proper settings, be performed on a lap- top. Rafieian and Yoganarasimhan (2018) use Extreme Gradient Boosting method that enables scalability in the prediction of click-through rates for mobile adver- tisements. Puranam et al. (2017) use a scalable Bayesian topic model to estimate the impact of New York City calorie posting regulation on discussions of health- related topic in restaurant reviews. A fully automated system designed by Culotta and Cutler (2016) to estimate brand ratings from near real-time keyword Twitter data addresses the velocity of big data. We build upon this stream of literature by demonstrating that, when GANs are implemented on the backbone of SGD-type training, the latter’s scalability properties carry over to considerations of volume and velocity associated with implementing algorithms for privacy protection. Lastly, this paper builds on the small but growing literature in marketing that employs GANs. Burnap et al. (2019) use an ensemble of deep learning methods to predict aesthetic appeal of automotive designs as a means of augmenting aesthetic design process. They use GANs in order to generate product aesthetic propos- als. Malik and Singh (2019) discuss different deep learning methods in computer vision and note that GANs have enabled realistic image generation. Our work differs from that reported in this literature in that we demonstrate that GANs can provide scalable and privacy preserving approach that can be used to solve mul- 163 tiple marketing problems. 3.3 Methodology: Extant Approach and Benchmarks We first compare the difference between the extant data transfer paradigm and our proposed data transfer paradigm. We then evaluate our methodology using benchmark methods. 3.3.1 Extant versus Proposed Data Transfer Paradigm In this section, we first examine the extant data transfer paradigm and its asso- ciated obstacles. This is illustrated in Figure 3.1. We then demonstrate how our proposed data transfer approach may alleviate these obstacles. Current approaches involving data transfer from a firm to researchers often require the researchers to sign legally binding contracts such as non-disclosure agreements (NDA) and data usage agreements (DUA) in order to access the data. Once the researchers sign these contracts, the firm then sets up mechanisms to transfer the data to the researchers. There are three broad decisions that the data provider makes. First, whether to provide the full data for all of its customers or for a sub-sampled set of its customers. The second decision that the data provider makes is whether to provide data from the true data, i.e. its actual data, or to 164 provide “synthetic” data, such as data generated using synthetic data generation method (see, for example Schneider et al. (2018)). The third decision is the level of obfuscation or aggregation done to the data in order to protect privacy. These include doing top coding, i.e. truncating at a certain percentile, rounding, i.e. rounding the data to a certain digit, swapping, i.e. randomly swapping sales data in a certain set of observations. The data provider can also chose to aggregate data at a certain level, for example at product lines level or markets level. Inherent to the third decision is the firm’s attempt to trade-off accuracy of data shared with researchers and its need to protect privacy of the data shared. These data are then transferred to the researchers, who apply research methods such as reduced form analysis, structural econometrics, or machine learning methods, for results comprising a combination of inference, prediction, and counterfactuals. There are three major concerns with this approach. First, data privacy is a con- cern: Once the data leave the confines of the firm, the firm has very little control over the data protection. The data are vulnerable to hacking, which would cre- ate a significant privacy breach for the firm. Second, there is the generalizability issue since the transferred data are often much smaller compared to the firm’s en- tire customer base. Third, the data transfer process is slow and time consuming, and this increases the firm’s transaction costs each time the research methods are trained on new data. Our approach eliminates the need for any real customer data or synthetic data to leave the firm. Instead, we propose transferring a generator to the researcher. 165 The generator is trained in an adversarial framework, and the discriminator sits inside the firm’s walls.4 Note that the generator, which sits with the researchers, never accesses the private data. Only the discriminator can access the private data, and the generator is trained using the gradients of the discriminator’s loss function. Thus the generator can generate data up to the size of the full population of a given firm’s customers, and can be retrained using a semi-automated interface (API) such that little or no manual intervention is needed. Of the existing approaches, an important one is that of Schneider et al. (2018). We note that while the Schneider et al. (2018) approach has been demonstrated on stores point of sales data, it can conceptually be extended to consumer level data. However, a key difference from our approach is that Schneider et al. (2018) requires prior knowledge of data generating process that is embedded in the syn- thetic data generation process itself. We argue that our approach is data gener- ating process agnostic (the GAN model is not explicitly trained using a specific inference model), and its only objective is to “mimic” the data-generation process of the true data.5 Thus, our approach allows us to tackle the three primary concerns of tradi- tional data transfer approaches. First, our approach offers higher privacy pro- tection because no customer data leaves the control of the firm. We empirically 4Note that the term “adversarial” comes from the name of the deep learning model: Genera- tive Adversarial Networks. The “adversaries” in this context are the generator and discriminator which “compete” with each other, i.e. the generator creates data in an attempt to fool the discrim- inator into classifying it as real data, and the discriminator has to classify the true data as different from the fake data. 5See section 3.3.3 for discussion of how we measure effectiveness in approximating the data- generating process of the true data 166 demonstrate that, should the generator on the researcher’s side be hacked, our ap- proach’s privacy protection remains superior to that of the benchmark methods. Second, the generalizability concern is potentially alleviated because the gener- ator can generate data up to the size of the firm’s customer population. Third, with new streaming customer data, the use of a semi-automated application pro- gramming interface (API) significantly reduces the transaction costs for the firm, as well as reducing the time needed to update the generator controlled by the researchers. Thus, our proposed approach has privacy advantages, reduced transactional costs for both firms and researchers, and scalability advantages when compared to traditional data transfer approaches. An important point to note is that the firm can choose when and how the researcher gets the generator. There are two possible approaches of training the generator in our paradigm, both of which will lead to the same results. In the first approach, the firm trains both the generator and the discriminator on its end, and hands over the generator to the researcher once the generator is trained. In this situation, the researcher starts with a pre- trained generator, and can update it as and when new data arrives with the firm. In the second approach, the researcher starts with an uninitialized generator at its end, and train the generator from scratch by making API calls to the discriminator residing inside the firm’s walls. In this situation, the researcher makes API calls for each of the training iteration as it updates the generator parameters. Note that the generator obtained after training in either of the approaches would be the same, and the firm can chose whether it wishes to pass on a trained generator to 167 the researcher, or ask the researcher to train a generator from scratch. 3.3.2 Benchmark Methodology In this section we describe our methodology for evaluating our proposed GANs and benchmarks. We do so along the following seven dimensions: 1. Data characteristics: To what extend do the probability distribution statistics (e.g. probability density function, KL divergence) and other distributional characteristics differ? 2. Information loss: To what extent do results differ from model-based analy- ses, such as price elasticity coefficient estimates from regressions? 3. Privacy: How well does the proposed model protect customer privacy com- pared to benchmark methods. 4. Volume: How well does the proposed model’s training speed and informa- tion transfer size scale with the volume of data. 5. Velocity: How does continual estimation compare to restart estimation of the model with the arrival of new customer data? 6. Generalizability to Marketing Problems: • Optimal Price Markups: How high are the optimal profits as compared to those obtained from true data? 168 Figure 3.1: Data Transfer Paradigm Comparison Figure 3.2: Benchmark. Figure 3.3: Proposed using GAN 169 • Customer Targeting: How accurate are the targeting models as com- pared to those trained on the true data? • Tackling Multiple Marketing Problems With One GAN: Can a single GAN trained on the full firm data be used generate synthetic data that can solve multiple marketing problems? We utilize methods commonly used in existing literature for data protection as benchmarks against which to compare our proposed approach (Table 3.1), ranging from aggregation (i.e., at market-level) to obfuscation (e.g., adding random noise). Schneider et al. (2018) find data protection schemes to generally entail a trade-off between accuracy and privacy; the goal of the seven benchmark methods, which include using true data, is to track juxtaposition of the respective metrics along these two dimensions. We modify these benchmark methods, which Schneider et al. (2018) apply to store-level point-of-sales data, to utilize household level sales and pricing data, while preserving the panel structure of the data set. Similar to Schneider et al. (2018), we protect only the sales variables of the individual households, with brand prices being public and observable in stores. 170 Table 3.1: Description of Benchmark Methods Benchmark Method Description 1 “True” or unprotected data Original household-level sales data without any protection. 2 Random noise Observations are binned into deciles based on sales, and random noise is added to the sales in each decile. 3 Rounding Sales are rounded to the nearest hundredth place. 4 Top coding Sales greater than the 95th percentile are truncated. 5 20% swapping 20% of observations are divided into two groups and their sales data exchanged. 6 50% swapping 50% of observations are divided into two groups and their sales data exchanged. 7 Market Level For each week, sales are summed and prices averaged across households to the market level. 3.3.3 Performance Metrics Comparison of Data Characteristics We utilize three measures commonly employed in the statistics and marketing literature: KL divergence; Jensen-Shannon divergence; and the Kolmogorov- 171 Smirnov statistic. We do so in order to measure the distance between the real data and the synthetic data generated from GANs and benchmarks. The KL and Jensen-Shannon divergences provide, respectively, asymmetric and symmetric distance measures of the distribution of the true data relative to the synthetic data generated by a protection method. We also calculate the Kolmogorov-Smirnov (KS) statistic as a quantitative estimate of the maximum difference in two cumulative distribution functions. The KS statistic has an addi- tional advantage that it exists regardless of the support of the two distributions (Toubia and Netzer, 2016). The Kullback-Leibler divergence (Kullback and Leibler, 1951) is a measure of relative-entropy between two probability distributions, P and Q. For discrete probability distributions, we have: ∑ DKL(P|| P(i) Q) = P(i) log (3.1) Q(i) i The KL divergence for distributions P and Q measures how much extra infor- mation is needed to arrive at Q as the posterior, when P is the prior, distribution. The closer the KL divergence to 0, the more “similar” the distributions P and Q.6 In order to see its ties to maximum log-likelihood estimation, we can write DKL(P||Q) = LL(P, P) − LL(P,Q), where LL(P,Q) = EP[logQ] is the log-likelihood 6Note that the KL divergence is not symmetric, as the amount of information needed to go from distribution P to Q need not be the same as the amount of information needed to go from distribution Q to P, whereas the Jensen-Shannon divergence is a symmetric measure. 172 of observing the data from P given the parameters of the distribution Q Eguchi and Copas (2006). Thus, minimizing the KL divergence: DKL(P||Q) is equivalent to obtaining the maximum likelihood estimates for the distribution Q. The Jensen-Shannon divergence (Lin, 1991), a symmetric measure of the infor- mation difference between two distributions, can be formulated in terms of the KL divergence. In the information sciences literature, it has been used to measure distances between distributions, and also provide the upper and lower bounds for the Bayes probability of error7. The JSD for discrete distributions P and Q, with average distribution A = 0.5(P+Q), is given by: JS D(P|| 1 || 1Q) = DKL(P A) + DKL(Q||A) (3.2)2 2 Finally, we use the Kolmogorov-Smirnov test as a quantitative estimate of the maximum difference in cumulative distribution functions and corresponding sig- nificance levels. The KS-test for two samples, P and Q, is given by: KS (P,Q) = max |Cp,i −Cq,i| (3.3) i where Cp is the cumulative distribution function associated with distribution P. 7See discussion in Lin (1991) on the derivation of the upper and lower bounds for the Bayes probability of error using the Jensen-Shannon divergence 173 Information Loss To calculate information loss, we first define a commonly used inference frame- work to estimate coefficients (β) from the true data. We then estimate the same coefficients using our proposed approach and benchmarks, denoted β̂. We estimate the following multiple regression framework with continuous in- dependent variables of prices P and dependent variables of sales S , we propose the following log-log regression in a standard panel data setting with entity i, brand j, and time period t: ∑ lnS i jt = µ j + µi j + β jlnPi jt + βklnPikt + i jt, (3.4) k, j Where k is the number of brands of interest, µ j is the brand specific intercept term, µi j the household-level random effects term drawn from a normal distribu- tion N(0, σµ), and  is the unobserved, independent error term. This log-log re- gression framework has been used widely in marketing and economics (Leeflang and Wittink, 2000), modeling continuous dependent variables such as store sales, worker wages, and customer demand. With the above inference model, we measure mean absolute percentage dif- ference (MAPD, Christen et al. (1997)) as a measure of information loss. MAPD provides an estimate of how good are the coefficient estimates from our proposed approach and benchmarks as compared to those obtained from the true data, as it 174 quantifies the difference between the regression estimates. More formally, MAPD for J number of coefficients of interest is given by: 1 ∑J MAPD = J ∣∣∣∣ β̂ j − β j ∣∣∣∣ × 100%, (3.5)β j=1 j where β̂ j is the estimated coefficient of interest on protected, and β j the esti- mated coefficient on real data and J refers to the number of relevant coefficients to be analyzed using a statistical modeling technique (e.g., regression).8 The afore- mentioned metric is not bound to the specific inference model defined above and can be applied more generally to estimates from other reduced-form or structural models. Loss of Privacy In the manner of Schneider et al. (2018), we employ maximum loss of privacy (MLP) as the metric for data protection. In order to compute MLP for the data, we first define the loss in privacy (LP) metric. Schneider et al. (2018) define the LP metric as the “intruder’s” confidence in the data to identify an entity. Thus, we employ the LP measure for a customer i (from n customers and across T time periods) as follows: 8We use the brands’ own price elasticities as the coefficients of interest in the subsequent sec- tions when MAPD is reported. We discuss the inference model and MAPD in detail in the Ap- pendix 3.8.1. 175 √ ∑n − 1 ∑T LPi = 1 + n [ P(Ŷit = IDi′ |S it, P 2it)] . (3.6) ′ Ti =1 t=1 Where P(Ŷit = IDi′ |S it, Pit) is the probability of identifying an observation Yit as belonging to a customer IDi′ given the observed sales S it and prices P∑it, normal- ized by the probability for customer 1: P(Ŷit = ID1|S , P 1 Tit it)). Thus, T t=1 P(Ŷit = IDi′ |S it, Pit) is the mean probability (mean across all time periods) of identifying a customer i in the data. We compute P(Ŷit = IDi′ |S it, Pit) as follows: (P(Ŷ ) J Jit = IDi′ |S ∑ ∑ln it, Pit) = a | i ′ jlnS i jt + bi′ jlnPi jt, P(Ŷit = ID1 S it, Pit) j=1 j=1 i = 1, ..., n; i′ = 2, ..., n; t = 1, ...,T (3.7) Note that Eqs. 6 and 7 for loss in privacy can be extended to include further un- protected variables such as marketing variables, customer visit counts, and other similar variables depending on the data context. Thus, with this metric, we then define the MLP metric. MLP can measure the maximum loss of privacy across all customers in the data set - it serves as the measure of the privacy for the least privacy protected customer in the data.9 9To account for out-of-sample fit, we calculate the above metrics using a leave-one-out cross- validation procedure, as specified by Schneider et al. (2018). Furthermore, we use the MLP of the true data as the upper-bound on the MLPs for all other methods. 176 MLP = max{LP1, ..., LPn}. (3.8) Trade-off between Information Loss and Privacy Protection The risk-utility curve introduced by Duncan and Stokes (2004), describes the fun- damental trade-off between the risk of confidential data disclosure and the util- ity of a data set for analysis. From this stream of literature, we know that firms and regulators collect data with the underlying promise that the data will be kept confidential. In order to honor this confidentiality pledge, firms need to share data such that the risk of disclosure is minimized. De-identification, i.e. removing identifiers such as names, addresses, phone numbers, etc from the data is not suffi- cient to reduce disclosure risks to acceptable levels, as “data snoopers”, i.e. entities with authorized access to the data but goals of uncovering individuals in the data, can link the data to other dataset which have names and identifiers associated with them, and that with such “linkage”, data can be re-identified. They argue that masking strategies, such as data coarsening, top-coding, aggregating, etc al- low for reduced disclosure risk as the data becomes less identifiable, however the data utility, i.e. quality of inference from this masked data, also reduces, since the perturbations, or noise, added to the data impact the inference that can be drawn from the data. This inherent trade-off between disclosure risk and data utility is the essence of the stream of literature that looks at accuracy-privacy trade-off. Using this concept to quantify the trade-offs between the two measures of ac- 177 curacy and loss of privacy, we compare the performance of our generator against those of the benchmark methods. Similar to Schneider et al. (2018), we plot var- ious methods’ information loss (utility of data) against the loss of privacy (risk of disclosure). We further explore how incorporating heterogeneity informs the privacy trade-off. Data Volume Scalability: Training Speed In this section, we examine scalability in terms of training time when protection is provided in terms of numbers of rows of data (N). One challenge in comparing speed of training using SGD is that the training algorithm can accommodate an arbitrary number of iterations. We therefore run the training algorithm well past the number of iterations at which the loss function becomes stationary from visual inspection. We then measure total run time and run time per iteration to examine how run time scales to volume of data. Data Volume Scalability: Information Size An additional benefit of using a GAN is that the size of information passed be- tween parties in big data settings is significantly less when only a generative model is being transferred, and not actual data. By incorporating the data gen- erating process, the generator effectively serves as a data compression algorithm. Size measured by information transfer is a function of GAN complexity measured 178 by number of neurons, as opposed to the size of the data set. Data Velocity Scalability We examine here how the “on-line” nature of SGD can be exploited to train the GAN as new data stream into the provider. First we train a GAN to convergence, subsequently referred to as the baseline model. Then, we explore how the SGD responds to a single burst of new data by simulating a small new training dataset from the data generating distribution. We then run two versions of the proposed model. In the first, the new data are “streamed” into the baseline model, and in the second, the training is “restarted” by retraining on the combination of new and old training data. We then compare the point at which both training methods regain the same level of information loss in the presence of new data. Generalizability to Marketing Problems: Price Markups and Optimal Profits Marketing managers are interested in estimating price markups for their products in order to obtain optimal profits based on their customers behavior. We now discuss how we evaluate price markups and optimal profits from our proposed approach and compare with those obtained from benchmark methods, following the approach given in Schneider et al. (2018). We use the Monte Carlo data and compute the price elasticities using Eq. 3.4. Thus, we first estimate the price markups for the original data and data from 179 benchmark methods as: C P ii = (3.9)1 + βi Where Pi is the price markup for a brand i, and Ci is the cost for the brand i. As the next step, we compute the optimal profit ratio using the following equation: ∏∏∗ β ∗i i + 1 β( i + 1 βi= ∗ )β (3.10) i βi + 1 β ∗ i + 1 βi ∏ Where ∗i is the profit obtained for brand i using the price obtained from E∏q. 3.9 with the price elasticities obtained using the benchmark methods, i.e. β∗i . i is the profit obtained for brand i using the price elasticity obtained from the true data, i.e. βi. We use the ratio of the optimal profits obtained from benchmark methods and that obtain∏ed from true data for each of the brands, and report the∗ optimal profit ratio, i.e. ∏i . This metric helps us estimate the relative loss in opti- i mal profit from using the benchmark methods, as opposed to the profits obtained from the true data. Generalizability to Marketing Problems: Customer Targeting We now discuss an application of GANs to customer targeting models. We borrow the purchase model from Park and Park (2016) and briefly discuss the setup. Park 180 and Park (2016) use click-stream data of an online retailer to predict purchase based on online visits and marketing efforts, by modeling the visit behavior and purchase behavior in their proposed model. As a proof of concept, we estimate their purchase behavior model.10 Similar to our Monte Carlo data, we construct purchase behavior for 200 cus- tomers and 52 weeks using the purchase probability model from Park and Park (2016). More formally: ui j = bi j + βv,i jVi j + βm,i, jMi, j + βppδpp,i j + i j (3.11) Pi j = 1 i f ui j > 0 (3.12) Where Pi j is whether a customer i makes a purchase in week j, which we ob- serve when the utility ui j is greater than zero. In the customer’s utility function, Vi j is the log of the visits made by the customers to store so far, Mi, j is a dummy for whether a customer was marketed in the week j or not, and δpp,i j is the dummy for whether the customer made a purchase in the week preceding week j. i j is the random error term drawn from type I extreme value distribution. We borrow the random coefficients from Park and Park (2016) and discuss them in further detail in the Appendix 3.8.3. We consider the data constructed in this manner as the true 10We model the visit probabilities of customers as draws from a random uniform distribution, and give further details in the Appendix 3.8.3 181 data, with purchase variable (whether a customer purchased in a current week) as the private, protected data, and the other variables as the public data. Thus, with the true data and the benchmark methods, we estimate the fol- lowing targeting model and calculate the MAPD for true data coefficients and benchmark data coefficients as a measure of targeting accuracy: ebi j+βv,i jVi j+βm,i, j Mi, j+βppδpp,i j P(Yi j = 1) = (3.13)1 + ebi j+βv,i jVi j+βm,i, j Mi, j+βppδpp,i j Generalizability to Marketing Problems: Tackling Multiple Marketing Prob- lems With One GAN We now discuss a context to evaluate whether a single GANs can handle multiple marketing problems. As a proof concept, we construct a Monte Carlo data for customer purchases when the firms set prices, and chose combinations of other marketing instruments of product feature and product display. In this setting, a customer in a given week observes publicly available prices and marketing variables for the five brands and subsequently makes purchases across the five brands. Consistent with our procedure before, we follow the log- log model as the data generating process. The data generating process specifica- tion is along the lines of Schneider et al. (2018), as they model purchase behavior of consumers based on observed prices and marketing mix variables along the lines of the market response model of SCAN*PRO. More formally: 182 lnS i jt = µ j + µi j + β jlnPi jt + ln(δ f j)Fi jt + ln(δd j)Di jt + ln(δ f d j)FDi jt + i jt, (3.14) Where S i jt is the sales made by a customer i for a brand j in a week t, µ j is the brand specific random effect, µi j is the customer-brand random effect, and Pi jt is the price observed by the customer i for brand j and time t. Fi jt, Di jt, and FDi jt are dummy variables for whether the brand j was featured, displayed, and both featured and displayed to the customer i during time t. The price distribution and coefficients are the same as those described in the Appendix 3.8.2 and the Appendix 3.8.4. We consider the data constructed in this manner as the true data, with sales variable (how much a customer purchased in a current week) as the private, protected data, and the other variables as the public data. Through this exercise, we aim to measure the effectiveness of GANs, as com- pared to benchmarks, in capturing both price elasticities as well as marketing variables of interest such as brand features and brand display. This also helps us evaluate whether a single GANs can solve multiple marketing problems. 183 3.4 Proposed Model 3.4.1 Generative Adversarial Networks In this section we describe the GAN method. The generator takes in as input the draws of random variable z and public data x and outputs generated data G(z|x; θg), where θg are generator’s parameters which are learned during the train- ing process. The discriminator take in as input both the real, private data y and generated data G(z|x; θg) and attempts to distinguish between the real and the gen- erated data in a binary classification task. The discriminators parameters are θd, which are learned during the training process. Following the design of Mirza and Osindero (2014), Conditional Generative Adversarial Networks have the follow- ing objective function: [ ] [ ] min max V(D,G) = Ey∼pdata(y) logD(y|x; θd) + Ez∼Pz(z) log(1 − D(G(z|x; θg)|x; θd)) , (3.15)G D The objective function has theoretical links to both Kullback-Leibler and Jensen-Shannon Divergence (Goodfellow et al., 2014), and the underlying intu- ition is that the training procedure minimizes the distance between the distribu- tion of the real and distribution of the generated data. Goodfellow et al. (2014) also provide theoretical guarantees that pg, i.e. generated data distribution, converges to pd, i.e. the true data distribution.11 11Goodfellow et al. (2014) derive theoretical guarantees for convergence in section 4.1 and 4.2 of 184 GANs have been traditionally used in the computer vision literature, where the generator learns the mapping θg from random noise to the space of real im- ages, as the discriminator predicts images as being real or fake in this min-max game. GANs have been shown to be able to generate realistic images of faces (Radford et al., 2015; Chen et al., 2016), and several other categories of images such as home interiors, animals, and vehicles (Kim and Bengio, 2016; Wang and Liu, 2016). In section 3.4.3, we discuss how we extend GANs to train on customer level data. 3.4.2 Design of Neural Networks We next discuss the design of the generator and discriminator neural networks. We base the design of both generator and discriminator on the original condi- tional GAN from Mirza and Osindero (2014). These are neural networks with only “fully-connected” hidden layers, and are represented in Figures 3.4 and 3.5. As proof of concept, we employ only one hidden layer for both neural net- works. We ReLU activation in the discriminator and Leaky ReLU activation in the generator. We also use batch normalization in the generator.12 From previ- ous studies, we understand that neural networks perform well in classification their paper, and argue that the generated data distribution converges to the true data distribution when the discriminator is allowed to reach its optimum at each iteration. We rely on this theoretical guarantee for convergence, and note that in our experiments we set the number of iterations to 100,000 as we found that the objective function stopped improving sufficiently prior to this number of iterations. 12We discuss the data flows in GANs and robustness to different network architectures in the Appendix 3.8.5 and 3.8.7 respectively. 185 Figure 3.4: Design of Generator Neural Network Inputs Hidden Layer Output (Random Noise and Prices) (Fully Connected) (Generated Data) Dimensions: N X (Z+K) Dimensions: (Z+K) X H Dimensions: N X K N observations of Z Number of neurons: H N observations of K dimensional random Each neuron has (K+K) generated sales noise and prices weights and 1 bias term Activation: Leaky ReLu(x) and data generation tasks due to the universal approximation theorem (Hornik et al., 1989; Cybenko, 1989): Given enough neurons in the hidden layers, a neural network can approximate any function. Essentially, this property allows the gen- erator to represent the true DGP of original unprotected data in a semi-parametric fashion: such that the more neurons included in each hidden layer, the better able the network is to capture the true DGP. As the number of parameters grows, how- ever, the model increases an estimation “burden” on the data as well, which can potentially yield a nonlinear relationship between the number of neurons and ac- curacy of the generator. We thus explore the efficacy of hyperparameters, such as the number of neurons, in influencing performance on accuracy and privacy metrics in the Appendix 3.8.6. 186 Figure 3.5: Design of Discriminator Neural Network Inputs Hidden Layer Output (Sales and Prices) (Fully Connected) (Probability of Real Data) Dimensions: N X (K+K) Dimensions: (K+K) X H Dimensions: N X 1 N observations of sales Number of neurons: H N probabilities: whether and prices Each neuron has (K+K) the data is from real data weights and 1 bias term or from generated data Activation: ReLu(x): max(0,x) 3.4.3 The Picture-Data Analogy and Extension to Heterogeneity We now describe our extension of GANs to train on customer data. Conditional GANs were originally designed to mimic the data-generating process for pictures, when given a particular vector that conditions on labels. Among the most well known examples is that of the MNIST dataset of handwritten digits. Given a vector denoting a number between zero and nine, it trains a generator to produce a picture of the designated handwritten digit. In essence, it teaches the generator to write. We see a direct parallel between the numerical matrices of which pictures are composed and the panel-data format often used in marketing, economics, and statistics research. 187 Figure 3.6: Picture Data Analogy The connection between pictures and data is illustrated in Figure 3.6. Just as in the realm of computer vision, the conditional GAN “conditions” on a label and then generates a picture of a handwritten digit, the proposed GAN can condition on a matrix of unprotected data columns X and generate a data matrix Y. To carry this analogy further, we define what would constitute a “picture” in the panel data setting. Figure 3.7 presents an example of a Nielsen Scanner Panel household data set in which rows correspond to a household’s weekly observations and columns to weekly sales and advertising spend per brand. We then need to protect the column “sales”, i.e. treat it as the private data, and share the rest of the columns, i.e. treat them as public data. Effectively, we treat each weekly observation as equivalent to a “picture” in the machine vision context, such that with a random noise matrix and conditional GAN specific X’s as the row, and generate a “pic- ture” of the protected variable, sales (i.e., Y). The assumption in this analogy is 188 that the information in each row of data is sufficient to characterize the data gen- erating patterns of each week of sales. In this way, each row is independently and identically distributed across each customer and week. Figure 3.7: A “Picture” in Panel Data Context: Without Heterogeneity. Figure 3.8: A “Picture” in Panel Data Context: With Heterogeneity. Note that in the presence of considerable customer heterogeneity, such as K 189 types of customers, this type of picture data analogy becomes less effective at cap- turing existing differences across customers. Heterogeneity implies that there ex- ists unique segment averages in the X and Y variables, for each type of customer, that differentiate them from customers in the other segments. Practically speak- ing, this implies that each block of customer data (of T rows), rather than each data row, would be i.i.d. A simple way to capture this heterogeneity would be to treat as a “picture” each block of customer data, as grouped by the time series structure (see Figure 3.8). We therefore define the two variants of our proposed models as (a) without heterogeneity (No Het.), and (b) with heterogeneity (Het.), depending on how we treat the picture equivalent in the data, i.e. either each customer-week as a picture, or each block of customer data across time periods as a picture. In the generator without heterogeneity - GAN (No Het.), we define each unit of the mini-batch samples as random draws at the row level. In the generator with heterogeneity - GAN (Het.), we define each unit of the mini-batch samples as a block of cus- tomers. We use the Jensen-Shannon divergence, Kullback-Leibler divergence, and KS-test as discussed in Section 3.3.3 to get quantitative estimates of how well do GAN (Het.) and GAN (No Het.) mimick the data generating process. Accuracy and privacy results are compared and discussed in the following section. By changing the way the GAN is trained, we hypothesize that the data provider can trade off between accuracy and privacy with a single, easy switch. Our analysis treats sales as the protected data, and the rest of the data as public 190 data. This approach is consistent with existing literature (for example, see Schnei- der et al. (2018)). 3.4.4 Training We now discuss the training process for the conditional GAN. Recall from the last sub-section that there are two possible approaches we can take for the data: Het., where we consider one customer’s data for all weeks as a picture equivalent, and No Het., where we randomly sample across customers and weeks to generate a picture equivalent. For the purposes of illustration, we discuss below with the notation for Het. case.13 We estimate the parameters for the generator: θg, and discriminator: θd via stochastic gradient descent (SGD) with momentum using the ADAM optimizer (Kingma and Ba, 2014). Stochastic gradient descent updates the parameter θg (and similarly θd) based on the loss function for the generator J(θg) (J(θd) for the dis- criminator) for a mini-batch of the data of size n customers using the following update procedure: ∑n θg ←− θg − ηg.∇ 1 ( ) θJg(θg), Jg(θg) = log 1 − D(G(zi, pi, θg), θd) (3.16)n i=1 13We consider k=52 weeks as the duration, thus each customer has 52 weeks of purchase data which constitutes as a picture data for the training purposes. For the No Het. case, we randomly sample 52 customer-weeks across the entire data as rows to construct a picture equivalent. 191 ←− − 1 ∑n ( ) ( ) θd θd ηd.∇θJd(θd), Jd(θd) = (log D(sr,i, pi, θd + log 1 − D(G(zi, pi, θg), θd) ) (3.17)n i=1 Where ηg, ηd are the learning rates and n is the mini-batch size of the data (number of customers) sampled in the iteration.14 sr,i are the “real” sales observed for a customer i observed in the true data, pi are the prices observed by the cus- tomer, and G(zi, pi) are the “generated” sales for the customer which we get from the generator with random noise zi. Thus, the discriminator serves as a binary classifier, as it maximizes the objective function Jd(θd) such that it minimizes the probability of incorrectly labelling the “generated” data and real, and it maximizes the probability of correctly labelling the “real” data as real. The generator max- imizes objective function Jg(θg), which maximizes the probability of fooling the discriminator, i.e., generating data such that the “generated” data is more likely to be labelled as real. We discuss in the Appendix 3.8.5 the training process for our GAN with gradients flow used to update the parameters. This approach allows us several advantages. First, because the GAN training framework allows for the separation of generator and discriminator, the generator needs only the loss function Jg(θg), and uses the gradient ∇θJg(θg) to update its parameters. Note that the private, protected data of customer sales: sr,i is available only to the discriminator via its objective function Jd(θd). Second, open source software like Tensorflow allows for scalable parallel computing on graphics or tensor-processing units (Abadi et al., 2016). Third, the optimization is done in 14ADAM uses adaptive learning rate such that ηg, ηd hyperparameters are optimized during training. We refer the readers to the Kingma and Ba (2014) paper for a detailed description of the ADAM optimizer. 192 mini batches to update parameters, which allows for the scalability advantages of “on-line” training. We explore in detail these scalability advantages provided by stochastic gradient descent in our results. 3.5 Empirical Context and Results This section is organized in three parts. First, we demonstrate effectiveness of GANs on the accuracy and privacy protection metrics as compared to bench- mark methods using a Monte Carlo data. Furthermore, we validate the accuracy- privacy trade-off on real world data. Second, by using Monte Carlo data, we explore scalability advantages of GANs - how GANs handle volume and velocity of data. Finally, as a proof of concept, we show generalizability of GANs. That is, we demonstrate that GANs can be used to tackle marketing problems of setting prices for optimal profits and customer targeting. Furthermore, we also demon- strate that a single GAN can handle both contexts combined. 3.5.1 Accuracy - Privacy Trade-off In this section we estimate how well GANs perform on the accuracy and privacy metrics as compared to benchmark methods using a Monte Carlo data and subse- quently validate on a real world Nielsen data. 193 Monte Carlo Experiment In this Monte Carlo experiment, we generate household-level customer data for five representative brands using the data generating process specified in Section 3.3.3. The data context is thus similar to the real world Nielsen data. We take as a starting point 200 customers over a span of 52 weeks for five brands’ sales and prices. Note that brand prices are public data (i.e. accessible by both researcher and the firm), whereas sales are private data (i.e. accessible only by the firm). Table 3.2 reports summary statistics for the Monte Carlo data. We discuss further details of the Monte Carlo data in the Appendix 3.8.2. Distributional Accuracy We first examine the proposed GAN’s generated synthetic data distributional ac- curacy relative to that of the true data. Figure 3.9 compares data densities for log-sales of all brands protected by various methods. For purposes of this analy- sis, the 20% and 50% swap methods will not provide meaningful comparisons, as the distribution of the variable of interest (sales), by construction, does not change after merely swapping sales in the data.15 Across the five brands, we find that GANs fit the true data the most closely, and that the fit for GAN (Het.) is higher than for GAN (No Het.). The next best fit is 15Nor would market level data be considered, as this analysis aggregates data across customers at the weekly level and the distribution range would not be amenable to comparison using meth- ods that track individual customer level sales. 194 Figure 3.9: Distributional Accuracy for Synthetic Data Brand 1 Brand 2 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 −5.0 −2.5 0.0 −5.0 −2.5 0.0 Brand 3 Brand 4 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 −5.0 −2.5 0.0 −5.0 −2.5 0.0 log Sales Brand 5 0.6 Model GAN (Het. H:512) 0.4 GAN (No Het. H:512) Random 0.2 Rounding Top Coding 0.0 True Data −5.0 −2.5 0.0 log Sales 195 Density Density Density Table 3.2: Summary Statistics for Monte Carlo Data (N=10,400) This table shows the summary statistics for the Monte Carlo data generated using the process described in the Appendix 3.8.2. The prices are public data, i.e. observed by both researcher, and the sales are private data, i.e. visible only to the firm. Note that the generator in the GAN model has access to only the public data (prices), and the private data (sales) never leaves the walls of the firm. Variable Mean Std. Dev. Min Max Price (Brand 1) 8.69 0.93 5.06 12.24 Price (Brand 2) 5.08 0.50 2.92 7.11 Price (Brand 3) 5.48 0.50 3.51 7.34 Price (Brand 4) 5.94 0.39 4.37 7.45 Price (Brand 5) 4.15 0.77 0.90 7.40 Sales (Brand 1) 0.07 0.09 0 1.40 Sales (Brand 2) 0.11 0.15 0 3.75 Sales (Brand 3) 0.05 0.09 0 4.38 Sales (Brand 4) 0.05 0.06 0 1.21 Sales (Brand 5) 0.12 0.19 0 4.87 from top-coding, which by construction, differs from true data post the truncation point of 95th percentile. We also find that random noise and rounding fluctuate around the true data, and have a poorer fit. This finding provides preliminary evidence that GAN most closely fits the true data. In order to get quantitative and robust measures of the fit of the different ap- proaches to the true data, we estimate the distance between the true data distri- 196 bution and that of the data protected by various methods (proposed and bench- mark). In Table 3.3, we examine the corresponding distribution metrics, namely, JSD, KL divergence, and KS Statistic. Examining the JSD metric, we observe the lowest value for the GAN (Het.): 0,0213. Rounding benchmark follows second with JSD of 0.0288, closely followed by GAN (No Het.): 0.0307. This finding indi- cates that the probability distributions for GANs and true data are the close. When we consider the KL divergence metric, we find that GAN (Het.) also has the low- est value of 0.0231. Thus, GAN (Het.) probability distribution is closest to the true data distribution. This conclusion is also the case for the KS Statistic, GAN (Het.) registering the lowest score on the KS test of 0.0077, which gives the upper bound on the difference in cumulative density functions for two distributions. Note that GANs (No Het.) also beats the best performing benchmark on the KS test, with a value of 0.0322 as opposed to 0.05 for Top Coding. Thus, through these three different metrics, we find that the GAN (Het.) distribution is closest to the true data across all measures of statistical differences in distributions. This provides confirmatory empirical evidence that the GANs best mimics the true data. Balance between Accuracy and Privacy We use the information loss metrics to examine accuracy and maximum loss of protection (MLP) for the benchmarks. Although the separability of the GAN pro- vides a first layer of protection, the MLP metric gives us quantitative estimates of the loss in privacy in the situation that the transferred generator were hacked. 197 Table 3.3: Distribution Metrics (Lower the Better) Model JSD KL KS Random Noise 2.147 3.8847 0.1173 Rounding 0.0288 0.0274 0.1036 Top Coding 0.4718 0.8474 0.0500 GAN (No Het.) 0.0307 0.0419 0.0332 GAN (Het.) 0.0213 0.0231 0.0077 Employing the case of a compromised generator, we investigate the likelihood that the generated data can be traced back to the original IDs of customers. We find evidence consistent with those in the Nielsen data. Figure 3.10 shows the results. Benchmark methods for random noise, rounding, and top-coding have lower loss of information, MAPD, but higher loss in privacy protection com- pared to other benchmark methods. The 20% swap has a much lower information loss as compared to the 50% swap, which by construction has information loss, MAPD, of approximately 50. 50% swap, however, has much better privacy protec- tion than other individual customer level benchmark methods. The market-level benchmark method offers the best privacy protection, MLP of 0, by construction, but comes with high information loss of 56. Ideally, we want to be at the bottom left of the MAPD-MLP plot, with low information loss and low loss of privacy. We find that our proposed generators show consistently lower information loss and superior privacy protection than all 198 Figure 3.10: Performance on Monte Carlo Data. ● Market model ● Benchmark ● Swap 50 ● GAN (Het.) ● GAN (No Het.) 40 ● True Data ● Swap 20 Random Noise 20 ● Rounding ● ● ● No Het. (H: 512) Top Coding Het. (H: 512) ● 0 ● True Data 0.00 0.01 0.02 0.03 0.04 Maximum loss of protection (MLP) the benchmark methods. Specifically, we find lower information loss, in terms of MAPD for GAN (Het.) than for GAN (No Het.). GAN (Het.) with 512 neurons has an MAPD of 1.2, which is a 4.6 times improvement in accuracy as compared to the best benchmark method, which is top-coding, with an MAPD of 5.3. This finding is consistent with JSD and KS Statistic measures obtained in the previous section. We find, however, with lower information loss comes a trade-off regarding pri- 199 Information Loss (MAPD) vacy protection. GAN (No Het.) has significantly superior privacy protection than GAN (Het.), but with higher loss in information. Interestingly, we find that GANs (No Het.) have MLP of 0.0035, which is closest to the market-level data as compared to each of the other methods: The loss of information varies between 4.6 and 10, which is significantly superior to the information loss for 50% swap. At the cost of potential privacy loss, GAN (Het.) has much lower information loss than all other methods. Furthermore, despite this trade-off, we find that our proposed generators occupy the bottom left of the MAPD-MLP plot, thus indicat- ing that relative to the benchmark methods they offer a superior overall balance between accuracy and privacy.16 Real Data Validation: Nielsen Data We apply the proposed and benchmark methods for protecting a data set in a “real-world” setting using the 2006 Nielsen Household Panel and Retail Scan- ner data sets. Both have been studied extensively in the marketing literature and are used by marketing practitioners. Although our method should be applicable to any data transfer setting in downstream applications using any class of infer- ence models, this canonical data inform a natural proof of concept examining real world performance related to information and privacy loss. 16We explore the relationship between model parameters: number of neurons, and the accuracy of GANs in the Appendix 3.8.6. We also explore the robustness of model’s architecture such as activation functions, batch normalization, and the noise distribution used to generate data in the Appendix 3.8.7. 200 To demonstrate the applicability of our proposed method on a reasonably large data set in a real world setting, our initial analysis uses the Nielsen data set to construct a sample with at least ten thousand rows composed of data for two hundred households across fifty weeks for the year 2006. We define variables similar to those used by Hendel and Nevo (2006) and Schneider et al. (2018). Following Hendel and Nevo (2006), we examine consumer purchases in the liquid detergent category aggregated at brand level for the leading brands: Tide, Cheer, All, and Wisk, the remainder combined as Others brand. The unit of ob- servation is household-week, and we observe purchases ($ amount) of each brand by each household, and the prices ($ amount) observed during that week for each of the brands. We consider prices as the publicly available data, and treat sales as the private data that only the data-provider has access. We thus create a dataset of 200 randomly sampled households that made at least ten purchases in the year 2006. We then estimate the private data, i.e. sales from benchmark methods and from the data generated by our proposed GANs. In order to estimate accuracy, we compute coefficients from the true data and benchmarks for Eq. 3.4, estimate the MAPD metric using Eq. 3.5, and estimate the loss of privacy metric MLP using Eq. 3.8. Figure 3.11 illustrates our examination of information and privacy loss with the proposed and benchmark methods17 17We modify Top Coding (99.9 percentile instead of 95) and Random Noise (centiles instead of deciles) to increase the difficulty of the benchmark comparison, as the 95% and deciles have higher information loss and we did not want the real-data benchmark to be easier than the Monte Carlo data setting. Rounding is modified to the nearest dollar instead of nearest cent (hundredth place) or nearest 10th cent (tenth place), since in the true data the sales are often ending in 9 cents (for example $3.89 is rounded to $4.00) 201 Figure 3.11: Information Loss and Loss of Privacy on Real Data Market Level model Benchmark GAN (Het.) 75 GAN (No Het.) True Data Swap 50% 50 GAN (No Het.) Swap 20% Random Noise Rounding 25 Top Coding GAN (Het.) 0 True Data 0.0 0.5 1.0 1.5 Maximum loss of protection (MLP) We find that as compared to the best benchmark method , i.e., top-coding, with MAPD of 11%, GAN (Het.) has an MAPD of 5.6%.18 GANs therefore double the accuracy of the best benchmark method. GAN (No Het.), despite a higher MAPD of 45% compared to 5.6% for GAN (Het.), has the lowest loss in privacy among the non-aggregate benchmarks, with an MLP of 0.15 compared to 0.31 for 50% swap. Overall, we find that our proposed generators consistently outperform the benchmark data protection methods in terms of lower information loss and higher privacy protection. 18Both GAN (Het.) and GAN (No Het.) have 512 neurons each. We discuss how number of neurons affects accuracy in the Appendix 3.8.6. 202 Information Loss (MAPD) 3.5.2 Scalability In this section, we examine the scalability aspect of volume and velocity for GANs, i.e. how well do GANs scale with the volume of data in terms of model estimation time and transferred information size, as well how well do GANs handle newly arriving data, i.e. streaming data. For the purposes of this section, we use the Monte Carlo data described earlier and as summarized in Table 3.2. Estimation Time In this section, we discuss the relationship between volume of data and estima- tion time for GANs. We vary the size of data from ∼1000 rows of data (N), i.e. 10 customers (Nc) and 102 weeks (T ) per customer, to ∼10 Million rows of data, i.e. 100,000 customers and 102 weeks per customer. We report the total estima- tion time and time per iteration in Table 3.4.19. We observe that training time per iteration increases only marginally with data volume, from 6.33 milliseconds per iteration with 1,000 rows of data to 7.55 milliseconds per iteration with ten million rows of data. We note that the training time across different data volumes shows the same order of magnitude: I.e., the training time stays the same as we sub- stantially increase volume of data. This observation can be attributed to the SGD algorithm used to train the GAN, as its parameters are trained in each iteration us- 19We run the GANs with 512 neurons and mini-batch size of 128 customers in Tensorflow 1.4 on a computer with the following configuration: Intel Core i9-9000X 10 Core 3.3 GHz, 64 GB RAM, and Titan Xp GPU (Pascal), for 100,000 iterations. We use this as a training stopping point because the RMSE between the real data and the synthetically generated samples stabilizes prior to this point, implying GAN convergence. 203 ing a sample of the data, because of which the proposed generator scales well with respect to volume of data. However, we note that the training time may increase due to other factors, such as SGD mini-batch sample size and GAN complexity, both of which are controlled by the researcher and can be adjusted according to available computing equipment. Table 3.4: Data Volume and Estimation Time This table shows the estimation time for GANs based on the volume of the data. Nc is the number of customers in the data, T is the number of timer periods in the data, N is the volume of data, i.e. number of rows in the data, and equals Nc × T . Estimation time is the total time in minutes needed to estimate a GAN model, and we also report the time per iteration (in milliseconds). Rows of Nc T Estimation Time per Data (N) Time (min) Iteration (ms) ∼ 1, 000 10 102 10.55 6.33 ∼ 10, 000 100 102 10.65 6.39 ∼ 100, 000 1,000 102 10.79 6.47 ∼ 1, 000, 000 10,000 102 11.03 6.62 ∼ 10, 000, 000 100,000 102 12.58 7.55 204 Transferred Information Size Table 3.5: Data Volume and Generator Compression This table shows the transferred information size for GANs based on the volume of the data. Nc is the number of customers in the data, T is the number of timer periods in the data, N is the volume of data, i.e. number of rows in the data, and equals Nc × T . Original File Size is the size of the true data, Generator Size is the size of the model checkpoints that are transferred to a researcher in order instantiate a fully trained generator, and File Compression Factor is the ratio of Generator Size and Original File Size. Rows of Nc T Original File Generator File Compression Data (N) Size Size Factor (Original / Generator) ∼ 1, 000 10 102 180 KB 7.79 MB 0.023 ∼ 10, 000 100 102 1.8 MB 7.79 MB 0.23 ∼ 100, 000 1000 102 18 MB 7.79 MB 2.28 ∼ 1, 000, 000 10,000 102 185 MB 7.79 MB 23.75 ∼ 10, 000, 000 100,000 102 1,719.21 MB 7.79 MB 220.67 We next examine the relationship between data volume and transferred file size. Note that, as expected, the original file size that will otherwise need to be transferred (using comparable benchmark methods) grows with the rows of data. In the context in which we transfer the generator, however, the size of the file 205 Figure 3.12: Generator Size and GAN Complexity 20 15 10 5 0 128 256 512 768 1024 Number of Neurons grows only in proportion to GAN complexity. Table 3.5 shows the volume of in- formation transferred. We observe that, whereas file size is large at 7.79 MB in a small data setting of 1,000 rows, this approach excels with the transfer of larger amounts of data on the order of 100,000 rows or greater20. Figure 3.12 shows an almost linear relationship in generator size and number of neurons. This is due to the number of generator parameters increasing linearly with number of neurons. We note that in settings closer to real-world situations (e.g., more than one million rows), GANs are able to achieve a file compression factor that is at least twenty- three, i.e the transferred information with GANs is twenty three times smaller than the original data file size. These findings assure us that our proposed gener- 20The data size reported is the size of the checkpoint data that Tensorflow saves for the genera- tor parameters. The generator uses 512 number of neurons. 206 Model Size (MB) ator scales well for transferred information size with respect to data volume and GAN complexity. Data Velocity Scalability We examine information loss when the algorithm is trained on in-flowing data. Figure 3.13 presents the results of comparing the traditional “restart,” in which the GAN is trained from scratch with each new inflow of data, and the “stream- ing” method, in which the GAN is trained continuously from previously known estimates. We find that in the case of streaming rather than restart, information loss stabilizes sooner, and in the first 50 thousand iterations, the MAPD is lower. We observe less information loss in the streaming than in the restart case, in which the GAN parameters are learned from scratch. This finding results from employ- ing stochastic gradient descent as the training method for streaming, whereby training of the GAN parameters is continuous yet with more data. More gener- ally, the “online” nature of SGD can be exploited as a learning method in GANs with continuous streaming data to train GAN parameters as soon as new data presents. 207 Figure 3.13: Streaming Data and Information Loss model baseline restart 120 streaming 80 40 0 50000 100000 150000 200000 Number of Iterations 3.5.3 Generalizability to Marketing Problems In this section, we demonstrate as a proof of concept how GANs can generalize to marketing contexts of setting prices for optimal profits, customer targeting, and also demonstrate that a single GAN tackle both problems of price setting and cus- tomer targeting. We do so using a series of Monte Carlo datasets. For the purpose of this analysis, we focus on GAN (Het.) since we find from our prior results that GAN (Het.) achieves higher accuracy than GAN (No Het.). Furthermore, GAN (Het.) performs better than benchmarks on the accuracy-privacy trade-off. 208 Information Loss (MAPD) Price Markups and Optimal Profit Ratio We now discuss how GANs compare relative to benchmarks on setting price markups for optimal profits. We use the Monte Carlo dataset from before and as described in Table 3.2. Table 3.6 shows the markup percentages w.r.t. cost for each of the five brands. These markups are obtained using Eq. 3.9 which uses the price elasticities as computed from the true data and benchmark data protection methods. We find that for each of the brands, the price markups estimated from GAN (Het.) is closest to the true data, as compared to other benchmarks. Table 3.6: Price Markups from Eq. 3.9 for True Data and Benchmarks This table shows the price markups for optimal profits obtained from Eq 3.9 for the true data, GANs, and other benchmarks. We obtain these price markups for each of the five brands. Method Brand 1 Brand 2 Brand 3 Brand 4 Brand 5 True Data 229.35 226.72 98.15 162.68 110.32 GAN (Het.) 241.98 234.36 99.34 165.22 113.09 Random Noise 205.46 -558.21 -114.40 79.02 -233.61 Rounding 140.34 163.86 118.49 190.83 819.11 Swap 20 84.45 543.69 64.36 97.25 923.39 Swap 50 -332.17 -333.26 104.31 -106.79 -289.60 Top Coding 122.76 147.22 95.99 120.00 396.21 Thus, we now estimate optimal profit ratios using Eq. 3.10. Table 3.7 shows the 209 ratio of the optimal profits obtained from benchmark methods w.r.t. the optimal profits obtained from using the true data. We find here that the optimal profits obtained GAN (Het.) are consistently higher than 99.96% of those obtained from the true data for each of the five brands, and it also consistently outperforms other benchmark methods. Amongst the other benchmark methods, the closest is Top Coding, whose optimal profits vary from 69.91% for brand 5 to 99.99% for brand 3, however GAN (Het.) outperforms Top Coding across each of the five brands. Table 3.7: Optimal Profit Ratio from Eq. 3.10 for Benchmark Methods w.r.t True Data This table shows the optimal profit ratios (as % of profits obtained from the true data) using Eq 3.9 for GANs and other benchmarks. We obtain these optimal profit ratios for each of the five brands. Method Brand 1 Brand 2 Brand 3 Brand 4 Brand 5 GAN (Het.) 99.96% 99.98% 99.99% 99.99% 99.99% Random Noise 99.81% 31.41% 3.64% 90.22% 5.65% Rounding 96.20% 98.34% 99.11% 99.52% 44.63% Swap 20 84.65% 90.25% 95.64% 94.93% 40.99% Swap 50 31.96% 31.41% 99.91% 16.97% 5.65% Top Coding 93.85% 97.05% 99.99% 98.22% 69.91% This finding suggests that managers can use GANs to make pricing decisions that lead to higher profits as compared to benchmark approaches. Further, GANs fare better on the accuracy-privacy trade-off for this Monte Carlo data - see Fig- 210 ure 3.10. Thus, GANs can provide a suitable alternative to the true data as mar- keting managers using customer sales data will be interested in computing price markups and optimizing profits. Customer Targeting In order to estimate customer targeting accuracy for GANs and traditional bench- marks, we generate a Monte Carlo dataset using the process described in Section 3.3.3. The data comprises of 200 customers and 52 weeks, for a total of 10,400 observations. For each customer-week, we observe whether the customer was marketed to or not (dummy variable: Marketing), whether the customer made a purchase in the previous week (dummy variable: Previous Purchase), how many times the customer has visited the store so far: log(Visits So Far). The private data, and the outcome variable of interest, is whether the customer makes a purchase or not in the current week (dummy variable: Purchase). With this data, we estimate MAPD of coefficients from true data and benchmark data from Eq. 3.13. Note that since the outcome variable is whether a customer purchased (or not), the bench- marks methods of random noise, rounding, and top coding do not apply since they are applicable only on continuous outcome variables. Thus, we benchmark GANs with Swap 20 and Swap 50 methods using the logistic regression for mod- eling purchase behavior as described in Eq. 3.13. Figure 3.14 shows the accuracy and privacy results for GANs and benchmark methods. We find that GANs are able to achieve an accuracy of 0.1% in customer 211 Table 3.8: Summary Statistics for Monte Carlo Data (N=10,400) This table shows the summary statistics for the Monte Carlo data generated using the process described in Section 3.3.3. The following variables are public: Marketing, Previ- ous Purchase, and log(Visits So Far), i.e. observed by both researcher, and the Purchase variable is private data, i.e. visible only to the firm. Note that the generator in the GAN model has access to only the public data, and the private data never leaves the walls of the firm. Variable Mean Std. Dev. Min Max Marketing 0.10 0.30 0 1 Previous Purchase 0.24 0.43 0 1 log(Visits So Far) 2.38 0.84 0 3.83 Purchase 0.24 0.43 0 1 targeting model, and that this exceeds the benchmarks of swap 20, with an MAPD of 10.04%, and also swap 50, with an MAPD of 28.88%. Further, GANs are able to achieve higher privacy protection as compared benchmarks, with an MLP of 0.319 as compared to 0.355 for both Swap 20 and Swap 50. This finding suggests that marketing managers, who need to build customer targeting models, will obtain substantially higher accuracy at customer targeting with GANs, as compared to other benchmarks. Furthermore, GANs offer better privacy protection, thus alleviating privacy concerns of data providers who are sharing data. Thus, GANs can provide a suitable alternative to the true data as well as benchmarks to marketing managers interested in building customer tar- geting models. 212 Figure 3.14: Performance for Customer Targeting 0.3 Swap 50 model GAN (Het.) Swap 20 0.2 Swap 50 True Data 0.1 Swap 20 GAN (Het.) 0.0 True Data 0.325 0.350 0.375 Maximum loss of protection (MLP) Tackling Multiple Marketing Problems With One GAN Since the purpose of a GAN is to generate privacy protected synthetic data, we test whether data generated from GANs can be used to run a variety of inference mod- els similar to those that are possible on the true data. Thus, we test as a proof of concept whether a single GAN can handle combined marketing problems pricing and targeting. We generate a Monte Carlo dataset using the process described in Section 3.3.3. The data comprises of 200 customers and 52 weeks across 5 brands, for a total of 10,400 observations. For each customer-week, we observe across the five brands the public data: whether the brand was featured to the customer or not (dummy variable: Fea- 213 Information Loss (MAPD) Table 3.9: Summary Statistics for Monte Carlo Data (N=10,400) This table shows the summary statistics for the Monte Carlo data generated using the process described in Section 3.3.3. The data is for 200 customers for 52 weeks across 5 brands. The following variables are public for each brand: Display, Feature, Price, i.e. observed by both researcher, and the Sales variable for each brand is private data, i.e. visible only to the firm. Note that the generator in the GAN model has access to only the public data, and the private data never leaves the walls of the firm. Variable Mean Std. Dev. Min Max Display (Brand 1) 0.25 0.43 0 1 Display (Brand 2) 0.15 0.36 0 1 Display (Brand 3) 0.05 0.21 0 1 Display (Brand 4) 0.16 0.36 0 1 Display (Brand 5) 0.10 0.30 0 1 Feature (Brand 1) 0.30 0.46 0 1 Feature (Brand 2) 0.20 0.40 0 1 Feature (Brand 3) 0.15 0.36 0 1 Feature (Brand 4) 0.25 0.43 0 1 Feature (Brand 5) 0.10 0.30 0 1 log (Price Brand 1) 1.08 0.17 0.13 1.60 log (Price Brand 2) 1.59 0.21 0.07 2.20 log (Price Brand 3) 2.07 0.13 1.42 2.47 log (Price Brand 4) 2.48 0.08 2.14 2.75 log (Price Brand 5) 2.30 0.10 1.80 2.62 log (Sales Brand 1) -0.67 1.40 -5.67 5.14 log (Sales Brand 2) -2.76 1.31 -7.24 4.11 log (Sales Brand 3) -3.92 1.47 -7.98 5.57 log (Sales Brand 4) -4.65 1.48 -9.44 1.88 214 log (Sales Brand 5) -2.46 1.31 -6.79 4.69 ture), whether the brand was displayed to the customer or not (dummy variable: Display), and the price: log(Price). The private data, and the outcome variable of interest, is how much the customer purchases a certain brand during a week: log(Sales). Table 3.9 shows the summary statistics for the Monte Carlo data. We report the MAPD and MLP results from Eq. 3.14 in Figure 3.15.21. We find that GANs outperform benchmark methods - GANs have an MAPD of 0.0139, i.e. a 1.39% difference in the price elasticities and coefficients for feature and display and their interaction term. The only benchmark that comes close is rounding, with an MAPD of 0.0207, whereas other benchmarks have an MAPD an entire order of magnitude higher at about 0.2 or higher. Furthermore, we find that GANs are at the left most corner on the accuracy-privacy plot, and provide higher privacy protection as compared to benchmarks. Thus, our empirical evidence suggests that GANs can indeed incorporate multiple marketing problems with a single model, and that this outperforms other benchmarks in terms of accuracy-privacy trade-off. The finding that GANs can tackle multiple marketing problems will be of much interest to data providers as well as researchers. Data providers need to train only one GAN model on their entire dataset which can subsequently be used by researchers to draw multiple inferences such as pricing and customer targeting. 21Note that we do not consider market aggregated benchmark since the feature and display for a brand is at customer-week level, thus aggregating it across multiple customers is a weak benchmark 215 Figure 3.15: Performance for Tackling Multiple Problems Swap 50 0.5 model GAN (Het.) Top Coding Random Noise 0.4 Rounding Swap 20 Swap 50 Top Coding 0.3 True Data 0.2 Swap 20 Random Noise 0.1 GAN (Het.) Rounding 0.0 True Data 0.01 0.02 0.03 0.04 0.05 Maximum loss of protection (MLP) 3.6 Discussion In this paper we address the concerns of researchers who need access to firms’ sensitive customer data and present a novel approach that differs from traditional data transfer approaches. We address the three concerns firms and researchers have regarding data transfer: (i) our approach is effective in preserving the pri- vacy of sensitive customer data with higher accuracy; (ii) our proposed genera- tive model scales to big data; (iii) our proposed approach can be used to tackle 216 Information Loss (MAPD) multiple marketing problems. Additionally, by using the picture-data analogy to incorporate heterogeneity, we enable a given firm to control the trade-off between privacy and accuracy. Our proposed method, Generative Adversarial Networks, consists of two com- peting neural networks, a discriminator network and a generator network, the construction of which allows for separability. This decoupled nature has both privacy and scalability advantages. Privacy advantages derive from only the dis- criminator accessing the real data on the firm’s side, thereby ensuring that no real data leaves the walls of the firm. The scalability advantages derive from only the gradients of the loss function’s being passed from the discriminator to the generator. The researcher, with the generator neural network, can generate data mimicking the true data to a high degree of accuracy. We test our generative models on four datasets, household scanner panel data from AC Nielsen, three Monte Carlo customer datasets, and to validate the accu- racy of our proposed generative model in comparison to benchmarks. We find that data generated from GANs have probability distributions closest to the true data, and outperform benchmarks on the accuracy-privacy trade-off. We also evaluated GANs on marketing problems of optimal price markups for profit max- imization, customer targeting, and the ability to tackle multiple marketing prob- lems with the use of a single GAN. We find that GANs outperform benchmarks on tackling marketing problems, and also alleviate data providers logistical and computational overheads as the data providers need train only one GAN model 217 that can tackle several marketing problems. We also address the scalability concerns that are typical for big data. First, we find that our generator scales effectively with respect to data volume and velocity. We find that the training time per iteration is of the same order of magnitude for different data volumes, ranging from one thousand rows to ten million rows of data. Second, we find that the transferred information size outshines true-data transfer when the data volume is of the order of hundreds of thousands rows or more. Finally, we also demonstrate that the stochastic gradient descent (SGD) allows us to handle streaming data; i.e., because the generator training can be resumed without much loss in informational value, it scales effectively regarding new data. An important limitation of our GAN model is that we currently do not model consumer dynamics. This concern can be addressed by modifying the GANs to incorporate attention, which can enable us to capture a possible source of hetero- geneity. In conclusion, we present a novel scalable approach as a proof of concept for data transfer, which demonstrates improved privacy protection compared to benchmark methods and can be used to solve several marketing problems. In light of recent regulatory concerns over data privacy, our findings have significant implications for firms, consumers, and regulators, as privacy protection becomes increasingly important for marketers. 218 3.7 References REFERENCES Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe- mawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. (2016). Tensorflow: A system for large-scale machine learn- ing. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283. Abowd, J., Gittings, R. K., McKinney, K., Stephens, B., Vilhuber, L., and Woodcock, S. (2012). Dynamically consistent noise infusion and partially synthetic data as confidentiality protection measures for related time series. Ansari, A. and Li, Y. (2018). Big data analytics. In Handbook of Marketing Analytics. Edward Elgar Publishing. Burnap, A., Hauser, J. R., and Timoshenko, A. (2019). Design and evaluation of product aesthetics: a human-machine hybrid approach. Available at SSRN 3421771. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing gen- erative adversarial nets. In Advances in neural information processing systems, pages 2172–2180. 219 Chintagunta, P., Hanssens, D. M., and Hauser, J. R. (2016). Editorial: Marketing science and big data. Marketing science, 35(3):341–342. Christen, M., Gupta, S., Porter, J. C., Staelin, R., and Wittink, D. R. (1997). Using market-level data to understand promotion effects in a nonlinear model. Journal of Marketing Research, pages 322–334. Culotta, A. and Cutler, J. (2016). Mining brand perceptions from twitter social networks. Marketing science, 35(3):343–362. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314. Duncan, G. T. and Stokes, S. L. (2004). Disclosure risk vs. data utility: The ru confidentiality map as applied to topcoding. Chance, 17(3):16–20. Eguchi, S. and Copas, J. (2006). Interpreting kullback–leibler divergence with the neyman–pearson lemma. Journal of Multivariate Analysis, 97(9):2034–2040. Goldfarb, A. and Tucker, C. (2011). Online display advertising: Targeting and obtrusiveness. Marketing Science, 30(3):389–404. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680. Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural net- works. arXiv preprint arXiv:1312.6211. 220 Hendel, I. and Nevo, A. (2006). Sales and consumer inventory. The RAND Journal of Economics, 37(3):543–561. Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward net- works are universal approximators. Neural networks, 2(5):359–366. Hu, J., Reiter, J. P., and Wang, Q. (2014). Disclosure risk evaluation for fully synthetic categorical data. In International Conference on Privacy in Statistical Databases, pages 185–199. Springer. Kim, T. and Bengio, Y. (2016). Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439. Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1):79–86. Leeflang, P. S. and Wittink, D. R. (2000). Building models for marketing decisions:: Past, present and future. International journal of research in marketing, 17(2-3):105– 126. Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Transac- tions on Information theory, 37(1):145–151. Link, R. (1995). Are aggregate scanner data models biased? Journal of Advertising Research, 35(5):RC8–RC8. 221 Liu, X., Singh, P. V., and Srinivasan, K. (2016). A structured analysis of unstruc- tured big data by leveraging cloud computing. Marketing science, 35(3):363–388. Malik, N. and Singh, P. V. (2019). Deep learning in computer vision: Methods, interpretation, causation and fairness. Interpretation, Causation and Fairness (May 28, 2019). Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Park, C. H. and Park, Y.-H. (2016). Investigating purchase conversion by uncover- ing online visit patterns. Marketing Science, 35(6):894–914. Puranam, D., Narayan, V., and Kadiyali, V. (2017). The effect of calorie posting regulation on consumer opinion: a flexible latent dirichlet allocation model with informative priors. Marketing Science, 36(5):726–746. Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learn- ing with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Rafieian, O. and Yoganarasimhan, H. (2018). Targeting and privacy in mobile advertising. Reiter, J. P. (2005). Estimating risks of identification disclosure in microdata. Jour- nal of the American Statistical Association, 100(472):1103–1112. Reiter, J. P. (2010). Multiple imputation for disclosure limitation: Future research challenges. Journal of Privacy and Confidentiality, 1(2). 222 Schneider, M. J. and Abowd, J. M. (2015). A new method for protecting interre- lated time series with bayesian prior distributions and synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 178(4):963–975. Schneider, M. J., Jagpal, S., Gupta, S., Li, S., and Yu, Y. (2018). A flexible method for protecting marketing data: An application to point-of-sale data. Marketing Science. Steenburgh, T. J., Ainslie, A., and Engebretson, P. H. (2003). Massively categorical variables: Revealing the information in zip codes. Marketing Science, 22(1):40– 57. Tenn, S. (2006). Avoiding aggregation bias in demand estimation: A multivariate promotional disaggregation approach. Quantitative Marketing and Economics, 4(4):383–405. Timoshenko, A. and Hauser, J. R. (2019). Identifying customer needs from user- generated content. Marketing Science, 38(1):1–20. Toubia, O. and Netzer, O. (2016). Idea generation, creativity, and prototypicality. Marketing science, 36(1):1–20. Wang, D. and Liu, Q. (2016). Learning to draw samples: With application to amor- tized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722. Wittink, D. R., Addona, M. J., Hawkes, W. J., and Porter, J. C. (1988). Scan* pro: The estimation, validation and use of promotional effects based on scanner data. Internal Paper, Cornell University. 223 3.8 Appendix 3.8.1 Inference Model and Information Loss We now describe information loss and the inference model. As an example of a multiple regression framework with continuous independent and dependent variables, consider the following log-log regression in a standard panel data set- ting with entity i, brand j, and time period t ∑ lnYi jt = δ j + β jlnXi jt + βklnXikt + i jt, (3.18) k, j for k number of covariates of interest, where δ j s the brand-level intercept. This log-log regression framework has been used widely in marketing and economics, modeling continuous dependent variables such as store sales, worker wages, and customer demand One context that falls under this framework is the market response model pro- posed in SCAN*PRO Leeflang and Wittink (2000), which has been used exten- sively by AC Nielsen and consumer goods manufacturers Schneider et al. (2018). The SCAN*PRO model uses a log-log framework with own and competitor pric- ing as covariates and sales as the dependent variable.22 This corresponds to a data 22We exclude promotion indicator variables in this proof of concept in order to focus on the method’s power to protect continuous independent and dependent variables. We will explore how to protect indicator variables in future research. 224 generating process similar to a nonlinear customer response function, such as the multiplicative specification proposed by Wittink et al. (1988) for store-level data. Therefore, we formally capture weekly customer sales for multiple brands: ∏ S β ji jt = αi jPi jt P βk i jt ikte , i = 1, ..., n; t = 1, ...,T, (3.19) k, j which translates to a log-log regression model: ∑ lnS i jt = µ j + µi j + β jlnPi jt + βklnPikt + i jt, (3.20) k, j where S i jt denotes household weekly purchase in dollar amount and Pi jt price in dollars for customer i, product brand j, and week t. Pikt are all prices observed by the customer for competitor brands; µ j is the brand specific intercept term and µi j the household-level random effects term drawn from a normal distribution N(0, σµ). For log sales models in which the independent variables are log-prices, β j, the price coefficient, is also the price elasticity;  is the unobserved, indepen- dent error term. Mean absolute percentage difference (MAPD), proposed by Christen et al. (1997) as a measure of the difference between the regression estimate of a market- level aggregation method and true data, allows for assessing the relative differ- ence between a protection method and true data. Essentially, MAPD measures 225 the average absolute difference of coefficient estimates across J number of coeffi- cients of interest. 1 ∑J ∣ MAPD = ∣∣∣ β̂ ∣j − β j ∣∣∣ x 100%, (3.21)J β j=1 j where β̂ j is the estimated coefficient of interest on protected, and β j the esti- mated coefficient on real data and J refers to the number of relevant coefficients to be analyzed using a statistical modeling technique (e.g., regression). We use the brands’ own price elasticities as the coefficients of interest in the subsequent sections when MAPD is reported. 3.8.2 Monte Carlo Experiment 1 Data In this section, we describe the Monte Carlo data generation process for experi- ment 1. We use this data to benchmark GANs and other data protection models. We use the following equation (from section 3.8.1) to generate the sales data S for 5 brands ( j), 200 customers(i), and 52 weeks (t): lnS i jt = µ j + δi j + β jlnPi jt + i jt We chose the values for the parameters: µ = (−0.1,−0.05,−0.09,−0.3,−0.1) is 226 inherent preference for a brand j, δ = N(0, 2) is the customer specific random effect i for a brand j. For the price elasticities, we borrow from Christen et al. (1997) the price elastitices: β = (−1.5,−1.7,−2.01,−1.98,−1.9). We draw prices for a brand j as random draws from a normal distribution as follows: P j = N(µp j, σp j), where µp = (8.68, 5.09, 5.48, 5.94, 4.15) and σp = (0.93, 0.5, 0.5, 0.39, 0.77). With the above parameter settings, we generate the customer sales data that is subsequently used as the “true” data for the Monte Carlo data. 3.8.3 Monte Carlo Experiment 2 Data - Customer Targeting We construct purchase behavior for 200 customers and 52 weeks using the pur- chase probability model from Park and Park (2016). More formally: ui j = bi j + βvisit,i jvisitsi j + βmarketing,i, jmarketingi, j + βprevpurchaseδprevpurchase,i j + i j (3.22) Pi j = 1 i f ui j > 0 (3.23) We borrow the random coefficients values from Park and Park (2016) as bi j = N(−2.547, 0.804), βvisit,i j = N(0.369, 0.406), βmarketing,i, j = N(0.310, 0.522), and 227 βprevpurchase = −0.618. We consider the data constructed in this manner as the true data. 3.8.4 Monte Carlo Experiment 3 Data - Tackling Multiple Mar- keting Problems The data generating process specification is along the lines of Schneider et al. (2018). More formally: lnS i jt = µ j + µi j + β jlnPi jt + ln(δ f j)Fi jt + ln(δd j)Di jt + ln(δ f d j)FDi jt + i jt, (3.24) The price distribution and coefficients are the same as those described in the Appendix 3.8.2. The feature, display, and feature and display coefficients are from Schneider et al. (2018). In order to simulate whether a brand was featured, dis- played, or both for a customer-brand-week, we draw randomly from a uniform distribution, with the following thresholds for feature: (0.3, 0.2, 0.15, 0.25, 0.1), and the following threshold for display: (0.25, 0.15, 0.05, 0.15, 0.1).23 Thus, with this data generating process, we construct a dataset of 200 customers for 52 weeks and 5 brands, and compare the MAPD GANs, as compared to benchmarks, for price 23We note that the choice of these thresholds is to illustrate as a proof of concept, and that these thresholds should not influence the subsequent results. 228 coefficients, and marketing variables coefficients for feature, display, and both. 3.8.5 GAN Design and Training Figure 3.16 shows the training process with gradients flow used to update the pa- rameters. For each iteration, the generator samples a mini-batch size of n, i.e., n image equivalents of customer data. Each image equivalent data i in the mini- batch consists of a k ×m random noise, where k is the number of time periods and m is the dimension of the vector sampled for each time period. It also observes the publicly available prices across brands: piKB for a brand B at time period K. The generator then outputs “generated” data sg for n customers. The discrimina- tor, having access to the data provider’s private data, is provided samples n cus- tomers sales data: sr. Thus, the discriminator receives a mini-batch samples for the “real” data and the “generated” data. The discriminator as a binary classifier then predicts labels for the “real” and “generated” data, and the misclassfication loss gradients are used to update the parameters: Jd(θd) for the discriminator, and Jg(θg) for the generator. 3.8.6 Relationship between Hyperparameters and Accuracy We consider in this section the relationship between number of neurons and accu- racy of the GAN’s replication of the original data distribution for the Monte Carlo 229 Figure 3.16: GAN Training with Loss Function Gradients This figure shows the each iteration of the training process for the conditional GAN with a training batch of n customers who chose from b brands. Note that the actual, private data sr is only accessi- ble by the firm’s private discriminator. zikm is the m-dimensional random noise draw for customer i at time period k, pikb is the price for brand p at time period k that is observed by customer i, srikb and sgirb are the “real” sales and “generated” sales respectively for a customer i at time period k for brand b. The data and gradients flow in highlighted in red are private, i.e. visible only to the data providing firm, and those in blue are public and visible to the researcher. Firm’s private data (sales 𝑠𝑟) + Public data (prices 𝑝) 𝑠𝑟𝑖11… 𝑠𝑟𝑖1𝑏𝑝𝑖11… 𝑝𝑖1𝑏 Discriminator loss gradient 𝑛 … 1−∇𝜃 ∑[log(𝐷(𝑠𝑟𝑖, 𝑝𝑖 , 𝜃𝑑 𝑛 𝑑 ) + log(1 − 𝐷(𝐺(𝑧𝑖, 𝑝𝑖, 𝜃𝑔), 𝜃𝑑)] 𝑠𝑟𝑖𝑘1… 𝑠𝑟𝑖𝑘𝑏𝑝𝑖𝑘1… 𝑝𝑖𝑘𝑏 𝑖=1 … Predicted Labels (Real or Generated) Random noise (𝑧) + 𝑖 ∈ [1, 𝑛] Public data (prices 𝑝) 1 Discriminator Loss function: … Generated data (sales 𝑠𝑔) + Misclassification cost 𝑧𝑖11… 𝑧𝑖1𝑚𝑝𝑖11… 𝑝𝑖1𝑏 Public data (prices 𝑝) 0 … 𝑠𝑟𝑖11… 𝑠𝑟𝑖1𝑏𝑝𝑖11… 𝑝𝑖1𝑏 𝑧𝑖𝑘1… 𝑧𝑖𝑘𝑚𝑝𝑖𝑘1… 𝑝𝑖𝑘𝑏 Generator … … 𝑠𝑟𝑖𝑘1… 𝑠𝑟𝑖𝑘𝑏𝑝𝑖𝑘1… 𝑝𝑖𝑘𝑏 𝑖 ∈ [1, 𝑛] … Generator loss gradient 𝑛 𝑖 ∈ [1, 𝑛] 1−∇𝜃 ∑ log(1 − 𝐷(𝐺(𝑧𝑖, 𝑝𝑖, 𝜃𝑔), 𝜃𝑑) 𝑔 𝑛 𝑖=1 data. The values reported in Figure 3.17 are the minimum MAPD values across fifty 230 Figure 3.17: Information Loss vs. Neuron Dimensionality. 10.0 GAN (Het.) GAN (No Het.) 7.5 5.0 2.5 250 500 750 1000 Neuron Dimensionality seeds for each of the generator types - GAN (Het.) and GAN (No Het.), and also number of neurons.24 We find that the best performance is for the GAN with heterogeneity and 512 neurons, as it has an information loss, MAPD, of 1.2, or total of 1.2 percent difference in the regression coefficients from the true data and GAN generated data. We also find a decreasing relationship between number of neurons and information loss. In order to test if the MAPD values differ across the generators, GAN (Het.) and GAN (No Het.), and for different values of number of neurons, we conduct two-tailed t-tests for the difference. We first conduct a cross generator-type t-test 24We follow a common convention from computer science literature where the neuron sizes are divisible by multiples of 8, such as in increments of 128 or 256. 231 Information Loss (MAPD) for each of the number of neurons. We find that for all values of number of neu- rons, we reject the null that GAN (Het.) and GAN (No Het.) have the same MAPD (p < 0.01). We next consider whether for a given generator type, GAN (Het.) or GAN (No Het.), if the MAPDs’ differ statistically for the different values of num- ber of neurons. We find that for both GAN (Het.) and GAN (No Het.), the MAPD is statistically different for number of neurons 128, 256, and 512. However, the differences in MAPD after this point are not significantly different from those of 512. This finding indicates diminishing returns in accuracy for increasing GAN complexity. Furthermore, we also find that generator trained to incorporate het- erogeneity, GAN (Het.), has consistently lower information loss than the generator trained without heterogeneity - GAN (No Het.). 232 3.8.7 Model Architecture and Accuracy In this section, we evaluate the accuracy of GANs with different model archi- tectures. We find that compared to baseline, MAPD measures for the other model variants are not substantially different, as the MAPD ranges from 0.0056 to 0.0215, with the baseline model MAPD of 0.011. We use GANs with the following model architecture as the baseline model: 1. Generator model: (a) Activation function: Leaky ReLU (b) Batch normalization (c) Random noise distribution: Uniform (d) Number of neurons: 512 2. Discriminator model: (a) Activation function: ReLU (b) Number of neurons: 512 3. Training procedure: With Heterogeneity 233 For the activation functions for both the generator model and discriminator model in the GAN, we consider ReLU and Leaky ReLU activation functions. The rectified linear unit activation function, also commonly known as ReLU activa- tion, is defined as: ReLU(x) = max(0, x) Thus, ReLU introduces a non-linearity at x = 0. The leaky rectified linear unit activation function, also commonly known as Leaky ReLU activation, is defined as: LeakyReLU(x) = max(x, 0.2x) The Leaky ReLU activation potentially helps the sparse gradients problem of ReLU, as the output of the activation is not set to zero when x is negative. Thus, we explore both Leaky ReLU and ReLU activations as the activation functions. We also introduce batch normalization in the intermediate layer of the generator model, as batch normalization can potentially help alleviate mode collapse prob- lems while training Xiang and Li (2017). Finally, we explore both uniform noise and Gaussian noise for the random noise that is used by the generator model to generate samples. The uniform noise is drawn randomly between -1 and 1, whereas the Gaussian noise is drawn from a standard normal distribution. 234