NCRN Meeting Spring 2016: The Advantages and Disadvantages of Statistical Disclosure Limitation for Program Evaluation
Abowd, John; Schmutte, Ian
This paper formalizes the manner in which statistical disclosure limitation (SDL) hinders empirical research in economics. We also highlight a hitherto unappreciated advantage of SDL, formal privacy models, and synthetic data systems: they can serve as a defense against model overfitting and false-discovery bias. More specifically, a synthetic data validation system can – and we argue should – be used in conjunction with systems in which researchers register their research design ahead of analysis. The key insight is that privacy-protected data can be used for model development while minimizing risk of model overfitting. To demonstrate these points, we develop a model in which the statistical agency collects data from a population, but publishes a version in which the data that have been intentionally distorted by some SDL process. We say the SDL process is ignorable if inferences based on the published data are indistinguishable from inferences based on the unprotected data. SDL is rarely ignorable. If the researcher has knowledge of the SDL model, she can conduct an SDL-aware analysis that explicitly corrects for the effects of SDL. If, as is often the case, if the SDL model is unknown, we describe circumstances under which SDL can still be learned.
Presented at the NCRN Meeting Spring 2016 in Washington DC on May 9-10, 2016; see http://www.ncrn.info/event/ncrn-spring-2016-meeting . Presented at the Society of Labor Economists Annual Meeting in Seattle, WA on May 6, 2016; see http://www.sole-jole.org/2016ProgramOutline.html
NSF Grant 1507241 (NCRN Coordinating Office) and 1131848 (to Cornell University), as well as funding from the Alfred P. Sloan Foundation
statistical disclosure limitation; economics; privacy models; synthetic data systems