Expanding the Role of Synthetic Data at the U.S. Census Bureau
Jarmin, Ron; Louis, Thomas A.; Miranda, Javier
National Statistical offices (NSOs) create official statistics from data collected directly from survey respondents, from government administrative records and from other third party sources. The raw source data, regardless of origin, is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of data users to extract as much information as possible from rich microdata. Traditional disclosure protection techniques applied to resolve this tension have resulted in official data products that come no where close to fully utilizing the information content of the underlying microdata. Typically, these products take for the form of basic, aggregate tabulations. In a few cases anonymized public-use micro samples are made available, but these are increasingly under risk of re-identification by the ever larger amounts of information about individuals and firms that is available in the public domain. One potential approach for overcoming these risks is to release products based on synthetic or partially synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata rather than making the actual underlying microdata available. We discuss recent Census Bureau work to develop and deploy such products. We also discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.
Presented at World Statistical Congress 2013.
cofidentiality synthetic data; official statistics