Sloan Foundation: The Economics of Socially-Efficient Privacy and Confidentiality Management for Statistical Agencies

Permanent URI for this collection

https://hdl.handle.net/1813/45819

Browse

Now showing 1 - 10 of 10

Final and Cumulative Annual Report for Alfred P. Sloan Foundation Grant G-2015-13903 “The Economics of Socially-Efficient Privacy and Confidentiality Management for Statistical Agencies”
Abowd, John M.; Schmutte, Ian M.; Vilhuber, Lars (2019-05)
Goal: To study the economics of socially efficient protocols for managing research databases containing private information. Metrics 1. At least four peer-reviewed articles that are published in journals read by economists, statisticians, and other social scientists. 2. A library of socially efficient algorithms that other researchers can readily implement 3. A policy handbook or brief to inform key statistical agencies on managing the tradeoffs between enabling data access and maintaining privacy 4. At least one graduate equipped with unique research and computational skills.
What Is a Privacy-Loss Budget and How Is It Used to Design Privacy Protection for a Confidential Database?
Abowd, John M. (2018-02-01)
Webinar for Privacy Day 2018, Sponsored by the ASA Committee on Privacy and Confidentiality. For statistical agencies, the Big Bang event in disclosure avoidance occurred in 2003 when Irit Dinur and Kobbi Nissim, two well-known cryptographers, turned their attention to properties of safe systems for data publication from confidential sources. And the paradigm-shifting message was a very strong result showing that most of the confidentiality protection systems used by statistical agencies around the world, collectively known as statistical disclosure limitation, were not designed to defend against a database reconstruction attack. Such an attack recreates increasingly accurate record-level images of the confidential data as an agency publishes more and more accurate statistics from the same database. Why are we still talking about this theorem fifteen years later? What is required to modernize our disclosure limitation systems? The answer is recognizing that the database reconstruction theorem identified a real constraint on agency publication systems—there is only a finite amount of information in any confidential database. We can’t repeal that constraint. But it doesn’t help with the public-good mission of statistical agencies to publish data that are suitable for their intended uses. The hard work is incorporating the required privacy-loss budget constraint into the decision-making processes of statistical agencies. This means balancing the interests of data accuracy and privacy loss. A leading example of this process is the need for accurate redistricting data, to enforce the Voting Rights Act, and the protection of sensitive racial and ethnic information in the detailed data required for this activity. Wrestling with this tradeoff stares-down the database reconstruction theorem, and uses the formal privacy results that it inspired to specify the technologies. Specifying the decision framework for selecting a point on that technology has proven much more challenging. We still have a lot of work to do.
Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods
Abowd, John; Schmutte, Ian M. (2017-04-17)
We consider the problem of the public release of statistical information about a population–explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner’s problem using the technology set implied by (ε, δ)-differential privacy with (α, β)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner’s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial.
Making Confidential Data Part of Reproducible Research
Vilhuber, Lars; Lagoze, Carl (2017-08-21)
Proceedings from the 2017 Cornell-Census- NSF- Sloan Workshop on Practical Privacy
Vilhuber, Lars; Schmutte, Ian M. (2017-09-20)
These proceedings report on a workshop hosted at the U.S. Census Bureau on May 8, 2017. Our purpose was to gather experts from various backgrounds together to continue discussing the development of formal privacy systems for Census Bureau data products. is workshop was a successor to a previous workshop held in October 2016 (Vilhuber & Schmutte 2017). At our prior workshop, we hosted computer scientists, survey statisticians, and economists, all of whom were experts in data privacy. At that time we discussed the practical implementation of cutting-edge methods for publishing data with formal, provable privacy guarantees, with a focus on applications to Census Bureau data products. The teams developing those applications were just starting out when our first workshop took place, and we spent our time brainstorming solutions to the various problems researchers were encountering, or anticipated encountering. For these cutting-edge formal privacy models, there had been very little effort in the academic literature to apply those methods in real-world settings with large, messy data. We therefore brought together an expanded group of specialists from academia and government who could shed light on technical challenges, subject matter challenges and address how data users might react to changes in data availability and publishing standards. In May 2017, we organized a follow-up workshop, which these proceedings report on. We reviewed progress made in four different areas. the four topics discussed as part of the workshop were 1. the 2020 Decennial Census; 2. the American Community Survey (ACS); 3. the 2017 Economic Census; 4. measuring the demand for privacy and for data quality. As in our earlier workshop, our goals were to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers; 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.
Proceedings from the Synthetic LBD International Seminar
Vilhuber, Lars; Kinney, Saki; Schmutte, Ian M. (2017-09-22)
On May 9, 2017, we hosted a seminar to discuss the conditions necessary to implement the SynLBD approach with interested parties, with the goal of providing a straightforward toolkit to implement the same procedure on other data. The proceedings summarize the discussions during the workshop.
Utility Cost of Formal Privacy for Releasing National Employer-Employee Statistics
Haney, Samuel; Machanavajjhala, Ashwin; Abowd, John M; Graham, Matthew; Kutzbach, Mark; Vilhuber, Lars (2017-05-14)
National statistical agencies around the world publish tabular summaries based on combined employeremployee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter ϵ≥1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional
Confidentiality Protection and Physical Safeguards
Vilhuber, Lars (2017-02-09)
Confidentiality protection is a multi-layered concept, involving statistical (cryptographic) methods and physical safeguards. When providing access to researchers (both internal to the agency and external academic), a tension arises between the level of trust vis-à-vis the researcher, the statistical disclosure limitation applied to the data visible to the researcher; and the physical access mechanisms used by the researcher. In this presentation, I (attempt to) review systems used by national and private research organizations around the world, putting them into the relevant legal and societal context.
Proceedings from the 2016 NSF–Sloan Workshop on Practical Privacy
Vilhuber, Lars; Schmutte, Ian (2017-01-22)
On October 14, 2016, we hosted a workshop that brought together economists, survey statisticians, and computer scientists with expertise in the field of privacy preserving methods: Census Bureau staff working on implementing cutting-edge methods in the Bureau’s flagship public-use products mingled with academic researchers from a variety of universities. The four products discussed as part of the workshop were 1. the American Community Survey (ACS); 2. Longitudinal Employer-Household Data (LEHD), in particular the LEHD Origin-Destination Employment Statistics (LODES); the 3. 2020 Decennial Census; and the 4. 2017 Economic Census. The goal of the workshop was to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.
How Will Statistical Agencies Operate When All Data Are Private?
Abowd, John M. (Journal of Privacy and Confidentiality, 2016-09-06)
The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.

Browse

Recent Submissions