eCommons

DigitalCollections@ILR
ILR School
 

Presentations by the Labor Dynamics Institute

Permanent URI for this collection

This collection contains presentations made by faculty, staff, members, and guests of the Labor Dynamics Institute.

Browse

Recent Submissions

Now showing 1 - 10 of 21
  • Item
    Utility of two synthetic data sets mediated through a validation server: Experience with the Cornell Synthetic Data Server
    Vilhuber, Lars (2019-08-13)
    The SDS at Cornell University was set up to provide early access to new synthetic data products by the U.S. Census Bureau. These datasets are made available to interested researchers in a controlled environment, prior to a more generalized release. Over the past 7 years, 4 synthetic datasets were made available on the server, and over 120 users have accessed the server over that time period. This paper reports on outcomes of the activity: results of validation requests from a user perspective, functioning of the feedback loop due to validation and user input, and the role of the SDS as a access gateway to and educational tool for other mechanisms of accessing detailed person, household, establishment, and firm statistics.
  • Item
    Stepping-up: The Census Bureau Tries to Be a Good Data Steward in the 21stCentury
    Abowd, John M. (2019-03-04)
    The Fundamental Law of Information Reconstruction, a.k.a. the Database Reconstruction Theorem, exposes a vulnerability in the way statistical agencies have traditionally published data. But it also exposes the same vulnerability for the way Amazon, Apple, Facebook, Google, Microsoft, Netflix, and other Internet giants publish data. We are all in this data-rich world together. And we all need to find solutions to the problem of how to publish information from these data while still providing meaningful privacy and confidentiality protections to the providers. Fortunately for the American public, the Census Bureau's curation of their data is already regulated by a very strict law that mandates publication for statistical purposes only and in a manner that does not expose the data of any respondent--person, household or business--in a way that identifies that respondent as the source of specific data items. The Census Bureau has consistently interpreted that stricture on publishing identifiable data as governed by the laws of probability. An external user of Census Bureau publications should not be able to assert with reasonable certainty that particular data values were directly supplied by an identified respondent. Traditional methods of disclosure avoidance now fail because they are not able to formalize and quantify that risk. Moreover, when traditional methods are assessed using current tools, the relative certainty with which specific values can be associated with identifiable individuals turns out to be orders of magnitude greater than anticipated at the time the data were released. In light of these developments, the Census Bureau has committed to an open and transparent modernization of its data publishing systems using formal methods like differential privacy. The intention is to demonstrate that statistical data, fit for their intended uses, can be produced when the entire publication system is subject to a formal privacy-loss budget. To date, the team developing these systems--many of whom are in this room--has demonstrated that bounded \epsilon-differential privacy can be implemented for the data publications from the 2020 Census used to re-draw every legislative district in the nation (PL94-171 tables). That team has also developed methods for quantifying and displaying the system-wide trade-offs between the accuracy of those data and the privacy-loss budget assigned to the tabulations. Considering that work began in mid-2016 and that no organization anywhere in the world has yet deployed a full, central differential privacy system, this is already a monumental achievement. But it is only the tip of the iceberg in terms of the statistical products historically produced from a decennial census. Demographic profiles, based on the detailed tables traditionally published in summary files following the publication of redistricting data, have far more diverse uses than the redistricting data. Summarizing those use cases in a set of queries that can be answered with a reasonable privacy-loss budget is the next challenge. Internet giants, businesses and statistical agencies around the world should also step-up to these challenges. We can learn from, and help, each other enormously.
  • Item
    The Reproducibility of Economics Research: A Case Study
    Kingi, Hautahi; Vilhuber, Lars; Herbert, Sylverie; Stanchi, Flavio (Presented at the BITSS Annual Meeting 2018 and available at the Open Science Foundation website, 2018-12-10)
    Published reproductions or replications of economics research are rare. However, recent years have seen increased recognition of the important role of replication in the scientific endeavor. We describe and present the results of a large reproduction exercise in which we assess the reproducibility of research articles published in the American Economic Journal: Applied Economics over the last decade. 69 of 162 eligible replication attempts successfully replicated the article's analysis 42.6%. A further 68 (42%) were at least partially successful. A total of 98 out of 303 (32.3%) relied on confidential or proprietary data, and were thus not reproducible by this project. We also conduct several bibliometric analyses of reproducible vs. non-reproducible articles.
  • Item
    Why the Economics Profession Cannot Cede the Discussion of Privacy Protection to Computer Scientists
    Abowd, John M.; Schmutte, Ian M.; Sexton, William N.; Vilhuber, Lars (Presented at the Allied Social Science Association Meeting 2019, 2019-01-05)
    When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.
  • Item
    Reproducibility Confidentiality Data Access
    Vilhuber, Lars (Presented at the 2018 ADRF Network Research Conference and available at the University of Pennsylvania Scholarly Commons., 2018-11)
    The recent concern about the reproducibility of research results has not yet been robustly incorporated into methods of providing and accessing administrative data, casting doubts on the validity of research based on such data. Reproducibility depends on disaggregating and exposing the multiple components of the research - data, software, workflows, and provenance - to other researchers and providing adequate metadata to make these components usable. The key worry is access: the authors of a study that uses administrative data often cannot themselves deposit the data with the journal, thereby impairing easy access to those data and consequently impeding reproducibility. This suggests a critical role for administrative data centers. We argue, that data held by ADRF do have attributes that lend themselves to reproducibility exercises, though this may, at present, not always be communicated correctly. We describe how ADRF can and should promote reproducibility through a number of components.
  • Item
    Presentation: The U.S. Census Bureau Adopts Differential Privacy
    Abowd, John M. (KDD '18 Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 2018-08)
    The U.S. Census Bureau announced, via its Scientific Advisory Committee, that it would protect the publications of the 2018 End-to-End Census Test (E2E) using differential privacy. The E2E test is a dress rehearsal for the 2020 Census, the constitutionally mandated enumeration of the population used to reapportion the House of Representatives and redraw every legislative district in the country. Systems that perform successfully in the E2E test are then used in the production of the 2020 Census. This presentation was given at KDD'18.
  • Item
    What Is a Privacy-Loss Budget and How Is It Used to Design Privacy Protection for a Confidential Database?
    Abowd, John M. (2018-02-01)
    Webinar for Privacy Day 2018, Sponsored by the ASA Committee on Privacy and Confidentiality. For statistical agencies, the Big Bang event in disclosure avoidance occurred in 2003 when Irit Dinur and Kobbi Nissim, two well-known cryptographers, turned their attention to properties of safe systems for data publication from confidential sources. And the paradigm-shifting message was a very strong result showing that most of the confidentiality protection systems used by statistical agencies around the world, collectively known as statistical disclosure limitation, were not designed to defend against a database reconstruction attack. Such an attack recreates increasingly accurate record-level images of the confidential data as an agency publishes more and more accurate statistics from the same database. Why are we still talking about this theorem fifteen years later? What is required to modernize our disclosure limitation systems? The answer is recognizing that the database reconstruction theorem identified a real constraint on agency publication systems—there is only a finite amount of information in any confidential database. We can’t repeal that constraint. But it doesn’t help with the public-good mission of statistical agencies to publish data that are suitable for their intended uses. The hard work is incorporating the required privacy-loss budget constraint into the decision-making processes of statistical agencies. This means balancing the interests of data accuracy and privacy loss. A leading example of this process is the need for accurate redistricting data, to enforce the Voting Rights Act, and the protection of sensitive racial and ethnic information in the detailed data required for this activity. Wrestling with this tradeoff stares-down the database reconstruction theorem, and uses the formal privacy results that it inspired to specify the technologies. Specifying the decision framework for selecting a point on that technology has proven much more challenging. We still have a lot of work to do.
  • Item
    Confidentiality Protection and Physical Safeguards (LatAm version)
    Vilhuber, Lars (2017-06-07)
    Confidentiality protection is a multi-layered concept, involving statistical (cryptographic) methods and physical safeguards. When providing access to researchers (both internal to the agency and external academic), a tension arises between the level of trust vis-à-vis the researcher, the statistical disclosure limitation applied to the data visible to the researcher; and the physical access mechanisms used by the researcher. In this presentation, I (attempt to) review systems used by national and private research organizations around the world, putting them into the relevant legal and societal context.
  • Item
    Excerpt: Usage and outcomes of the Synthetic Data Server
    Vilhuber, Lars; Abowd, John M. (2017-05-09)
    This is an excerpt from a prior presentation at the Society of Labor Economists (2016). The Synthetic Data Server (SDS) at Cornell University was set up to provide early access to new synthetic data products by the U.S. Census Bureau. These datasets are made available to interested researchers in a controlled environment, prior to a more generalized release. Over the past 5 years, 4 synthetic datasets were made available on the server, and over 100 users have accessed the server over that time period. This paper reports on interim outcomes of the activity: results of validation requests from a user perspective, functioning of the feedback loop due to validation and user input, and the role of the SDS as an access gateway to and educational tool for other mechanisms of accessing detailed person, household, establishment, and firm statistics.
  • Item
    Confidentiality of the SynLBD
    Vilhuber, Lars; Kinney, Saki (2017-05-09)
    We describe the confidentiality protection provided by the SynLBD. The presentation was originally prepared by Saki Kinney for the World Statistics Congress 2013.