The U.S. Census Bureau Adopts Differential Privacy John M. Abowd Chief Scientist and Associate Director for Research and Methodology U.S. Census Bureau 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining London, United Kingdom August 23, 2018 Acknowledgments and Disclaimer • The opinions expressed in this talk are the my own and not necessarily those of the U.S. Census Bureau • The application to the Census Bureau’s 2020 publication system incorporates work by Daniel Kifer (Scientific Lead), Simson Garfinkel (Senior Scientist for Confidentiality and Data Access), Tamara Adams, Robert Ashmead, Michael Bentley, Stephen Clark, Aref Dajani, Jason Devine, Nathan Goldschlag, Michael Hay, Cynthia Hollingsworth, Michael Ikeda, Philip Leclerc, Ashwin Machanavajjhala, Gerome Miklau, Brett Moran, Edward Porter, Anne Ross, and Lars Vilhuber [link to the September 2018 Census Scientific Advisory Committee presentation] • Parts of this talk were supported by the National Science Foundation, the Sloan Foundation, and the Census Bureau (before and after my appointment started) 2 A Brief History of Differential Privacy at the U.S. Census Bureau 3 4 5 This slide is from the August 3, 2018 Program Management Review (PMR) for the 2020 Census. 2020 PMRs are quarterly public presentations of the state of readiness of the decennial census. 6 Database Reconstruction 7 2003: Database Reconstruction The Database Reconstruction Theorem • Powerful result from Dinur and Nissim (2003) [link] • Too many statistics published too accurately from a confidential database exposes the entire database with near certainty • How accurately is “too accurately”? • Cumulative noise must be of the order 𝑁𝑁 • At this conference, I don’t need to explain what this means 9 The 2010 Census of Population and Housing 10 2010 Census of Population: Summary Total population 308,745,538 Household population 300,758,215 Group quarters population 7,987,323 Households 116,716,292 11 2010 Census: High-level Database Schema Variables Distinct values Habitable blocks 10,620,683 Habitable tracts 73,768 Sex 2 Age 115 Race/Ethnicity (OMB Categories) 126 Race/Ethnicity (SF2 Categories) 600 Relationship to person 1 17 National histogram cells (OMB Ethnicity) 492,660 12 2010 Census: Published Statistics Released counts Publication (including zeros) PL94-171 Redistricting 2,771,998,263 Balance of Summary File 1 2,806,899,669 Summary File 2 2,093,683,376 Public-use micro sample 30,874,554 Lower bound on published statistics 7,703,455,862 Statistics/person 25 13 The database reconstruction theorem is the death knell for traditional data publication systems from confidential sources. 14 Internal Experiments Using the 2010 Census • Confirm that the confidential micro-data from the confidential hundred percent detail file can be reconstructed quite accurately from PL94 + balance of SF1 • While there is a vulnerability, the risk of re-identification is small • Experiments are at the person level, not household • Experiments have led to the declaration that reconstruction of Title 13-sensitive data is an issue, no longer a risk • Strong motivation for the adoption of differential privacy for the 2018 End-to-End Census Test and 2020 Census 15 Reconstruction Equation System Collect more than 5 billion statistics from official 2010 Census tables. From the sample space at the block and tract level (2 x 115 x 2 x 63 = 28,980), write the linear equations for each sample statistic, including zeros. 16 Properties of the Solution • This is the Dinur-Nissim reconstruction equation system for exact statistics • Can’t be overdetermined (known to come from a real person table) • Usually underdetermined: potentially many solutions • But, all solutions share some exact images • For example, block and voting age variables are the same in every solution • Full details will be released this fall 17 Formal Privacy 18 2006: Differential Privacy The Disclosure Avoidance System Relies on Injecting Noise with Formal Privacy Rules • Advantages of noise injection with differential privacy: Global Confidentiality • Privacy operations are closed under composition Protection Process • Privacy guarantees are robust to post-processing Disclosure Avoidance • Privacy guarantees are future-proof System • Privacy guarantees are provable and tunable • Privacy guarantees are public and explainable • Protects against database reconstruction attacks ε • Disadvantages: • Entire country must be processed at once for best accuracy • Every use of the private data must be tallied in the privacy-loss budget 20 Additional Technical Details • Central differential privacy implementation with a controlled total privacy-loss budget • Relevant definition is bounded ε-differential privacy (total population of the United States is public) • Semantic privacy guarantee is [-2ε, 2ε] by properties of bounded differential privacy • Other semantic guarantees, as they affect implemented invariants will be published later this year • All algorithms, code, and parameter values will be released with the test files for the 2018 End-to-End Census Test 21 2020 Census of Population and Households The Top-Down Algorithm National table of US National table with all 500,000 cells population Spend ε1 filled, structural zeros imposed with privacy-loss accuracy allowed by ε1 2 x 126 x 17 x 115 budget 2 x 126 x 17 x 115 Sex: Male / Female Race + Hispanic: 126 possible values Relationship to Householder: 17 Age: 0-114 Reconstruct individual micro-data without geography 330,000,000 records 23 State-level State-level tables for only certain queries; structural zeros imposed; Spend ε2 dimensions chosen to produce best privacy-loss Target state-level tables required for best accuracy for PL-94 and SF-1 budget accuracy for PL-94 and SF-1 Construct best-fitting individual micro-data with state geography 330,000,000 records now including state identifiers 24 330,000,000 records now including state identifiers County-level County-level tables for only certain queries; structural zeros imposed; Spend ε3 privacy- Target county-level tables required for best dimensions chosen to produce best loss budget accuracy for PL-94 and SF-1 accuracy for PL-94 and SF-1 Construct best-fitting individual micro-data with state and county geography 330,000,000 records now including state and county identifiers Pre-De2c5isional identifiers Census tract-level Tract-level tables for only certain Spend ε queries; structural zeros imposed; 4 Target tract-level tables required for best privacy-loss dimensions chosen to produce best budget accuracy for PL-94 and SF-1 accuracy for PL-94 and SF-1 Construct best-fitting individual micro-data with state, county, and tract geography 330,000,000 records now including state, county, and tract identifiers 26 tract identifiers Block-level Block-level tables for only certain queries; Spend ε structural zeros imposed; 5 Block tract-level tables required for best accuracy for privacy-loss dimensions chosen to produce best PL-94 and SF-1budget accuracy for PL-94 and SF-1 Construct best-fitting individual micro-data with state, county, tract and block geography 330,000,000 records now including state, county, tract identifiers 27 Tabulation micro-data Construct best-fitting individual micro-data with state, county, tract and block geography 330,000,000 records now including state, county, tract, and block identifiers Micro-data used for tabulating PL-94, SF-1 28 Tabulation micro-data Construct best-fitting individual micro-data with • How accurate are the tabulation state, county, tract and block geography micro-data? 330,000,000 records now including state, county, tract, and block identifiers Disclosure Avoidance Certificate • Certifies that the disclosure avoidance Micro-data used for system passed all tests • Reports the accuracy of the micro-data tabulating used for tabulation • Requires ε PL-94, SF-1A 29 Operational Decisions • Set total privacy-loss budget: ε • Ensure that ε1+ ε2+ ε3+ ε4+ ε5 + εA = ε • Within each stage, allocate privacy-loss budget between: • PL-94 • Parts of SF-1 not in PL-94 • These are policy levers provided by the system • Levers are set by the Census Bureau’s Data Stewardship Executive Policy Committee Pre-Decisional 30 Examples from the 1940 Census of Population 31 32 Managing the Tradeoff 33 You know what I look like already 34 How to Think about the Social Choice Problem • The marginal social benefit is the sum of all persons’ willingness-to- pay for data accuracy with increased privacy loss • The marginal rate of transformation is the slope of the privacy-loss v. accuracy graphs we have been examining • This is exactly the same problem being addressed by Google in RAPPOR, Apple in iOS 11, and Microsoft in Windows 10 telemetry 35 Production Possibilities for Privacy-loss v. Accuracy Tradeoff 1.0 Estimated Marginal Social 0.9 Benefit Curve 0.8 0.7 Social Optimum: 0.6 MSB = MSC 0.5 0.4 0.3 Estimated 0.2 Production 0.1 Technology 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Privacy-loss Budget 36 Data Accuracy But the Choice Problem for Redistricting Tabulations Is More Challenging • In the redistricting application, the fitness-for-use is based on • Supreme Court one-person one-vote decision (All legislative districts must have approximately equal populations; there is judicially approved variation) • Is statistical disclosure limitation a “statistical method” (permitted by Utah v. Evans) or “sampling” (prohibited by the Census Act, confirmed in Commerce v. House of Representatives)? • Voting Rights Act, Section 2: requires majority-minority districts at all levels, when certain criteria are met • The privacy interest is based on • Title 13 requirement not to publish exact identifying information • The public policy implications of uses of detailed race and ethnicity • Other use cases: See Federal Register Notice 83 FR 34111 (comments due September 17, 2018) 37 Production Possibilities for Alternative Mechanisms 1.0 0.9 0.8 0.7 Proposed 2020 Census differential privacy implementation with use-case based accuracy improvements 0.6 0.5 Simple differential privacy implementation with no 0.4 accuracy improvements 0.3 0.2 Randomized response: method used by Google, 0.1 Apple and Microsoft 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Privacy-loss Budget 38 Data Accuracy Production Possibilities for Alternative Mechanisms 1.0 Where social 0.9 scientists act 0.8 like MSC = MSB 0.7 Proposed 2020 Census differential privacy implementation with use-case based accuracy improvements 0.6 0.5 Where computer Simple differential privacy implementation with no 0.4 scientists act like accuracy improvements MSC = MSB 0.3 0.2 Randomized response: method used by Google, 0.1 Apple and Microsoft 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Privacy-loss Budget 39 Data Accuracy Estimated Marginal Production Possibilities for Alternative Mechanisms1.0 Social Benefit Curves 0.9 More accuracy favoring 0.8 0.7 0.6 Social Optima: MSB = MSC 0.5 Blue tangency (3.5, 94%) More privacy favoring 0.4 Green tangency (1.0, 60%) 0.3 0.2 0.1 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Privacy-loss Budget 40 Data Accuracy Some Other Tools for Managing the Tradeoff • Machanavajjhala, He and Hay (SIGMOD 2017) Differential Privacy in the Wild: A Tutorial on Current Practices & Open Challenges. • Harvard Data Privacy Tools Project • Kobbi Nissim, Thomas Steinke, Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, David O'Brien, and Salil Vadhan. Forthcoming. Differential Privacy: A Primer for a Non-technical Audience • Kobbi Nissim and Alexandra Wood. 2018. Is Privacy Privacy? Philosophical Transaction of the Royal Society A. 41 Thank you. John.Maron.Abowd@census.gov johnabowd.com Selected References • Dinur, Irit and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems(PODS '03). ACM, New York, NY, USA, 202-210. DOI: 10.1145/773153.773173. • Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. in Halevi, S. & Rabin, T. (Eds.) Calibrating Noise to Sensitivity in Private Data Analysis Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings, Springer Berlin Heidelberg, 265-284, DOI: 10.1007/11681878_14. • Dwork, Cynthia. 2006. Differential Privacy, 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), Springer Verlag, 4052, 1-12, ISBN: 3-540- 35907-9. • Dwork, Cynthia and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. Vol. 9, Nos. 3–4. 211–407, DOI: 10.1561/0400000042. • Dwork, Cynthia, Frank McSherry and Kunal Talwar. 2007. The price of privacy and the limits of LP decoding. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing(STOC '07). ACM, New York, NY, USA, 85-94. DOI:10.1145/1250790.1250804. • Machanavajjhala, Ashwin, Daniel Kifer, John M. Abowd , Johannes Gehrke, and Lars Vilhuber. 2008. Privacy: Theory Meets Practice on the Map, International Conference on Data Engineering (ICDE) 2008: 277-286, doi:10.1109/ICDE.2008.4497436. • Dwork, Cynthia and Moni Naor. 2010. On the Difficulties of Disclosure Prevention in Statistical Databases or The Case for Differential Privacy, Journal of Privacy and Confidentiality: Vol. 2: Iss. 1, Article 8. Available at: http://repository.cmu.edu/jpc/vol2/iss1/8. • Kifer, Daniel and Ashwin Machanavajjhala. 2011. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD '11). ACM, New York, NY, USA, 193-204. DOI:10.1145/1989323.1989345. • Erlingsson, Úlfar, Vasyl Pihur and Aleksandra Korolova. 2014. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS '14). ACM, New York, NY, USA, 1054-1067. DOI:10.1145/2660267.2660348. • Abowd, John M. and Ian M. Schmutte. 2017 . Revisiting the economics of privacy: Population statistics and confidentiality protection as public goods. Labor Dynamics Institute, Cornell University, Labor Dynamics Institute, Cornell University, at https://digitalcommons.ilr.cornell.edu/ldi/37/ • Abowd, John M. and Ian M. Schmutte. Forthcoming. An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices. American Economic Review, at https://arxiv.org/abs/1808.06303 • Apple, Inc. 2016. Apple previews iOS 10, the biggest iOS release ever. Press Release (June 13). URL=http://www.apple.com/newsroom/2016/06/apple-previews-ios-10-biggest-ios- release-ever.html. • Ding, Bolin, Janardhan Kulkarni, and Sergey Yekhanin 2017. Collecting Telemetry Data Privately, NIPS 2017. 43