Lab 12: Confidentiality Protection through Synthetic Data

  • The lab uses pumsbxak (AK 1% PUMS from Census 2000), the same data as used in Lab 11. All data and programs are also available on the SSG at /ssgprojects/courses/info7470/.
  • The lab can be performed using Stata, but example programs are only provided for SAS.
  1. Keep the person records for all the in-scope employed individuals (you can use the data selector example (02.pums2000-missing.sas)
  2. Select wage and salary income and recoded education (from 02.pums2000-missing.sas) and up to five other relevant variables.
  3. MI step: Replace the allocated wage and salary income and recoded education with multiply-imputed values (build 5 implicates). Save the resulting dataset(s).
  4. Synthetic step: Using the same model you built for the previous step, synthesize wage and salary income and recoded education for all observations. Make 5 synthetic data sets. A synthetic data set consists of all of the data from the input data set except for wage and salary income and recoded education. For the latter two variables every value is replaced by a draw from the posterior predictive distribution.
  5. Compute the transition matrix between observed and synthetic (recoded) education.
  6. Assess the effect of missing data on the precision of estimating the mean earnings of high school graduates, and the effect of using synthetic data on the same statistic.
  7. Compare the completed data built in the MI step with the synthetic data that you built in the second step. The correct combining formula for these synthetic data is T=U+B/5 where T is the total variance, U is the average variance within synthetic implicate, and B is the between synthetic implicate variance. Recall that the correct formula for the completed data files from multiply imputation of only missing data is T=U + (1+1/5)*B.
  8. Assess the degree to which synthetic education is informative about any individual's observed education.

Files to submit

  • Submit a single program that creates both the MI and synthetic datasets, as well as creates the necessary statistics for your interpretation (lab12.{sas,do}). The program is expected to run error free, starting only with provided data. Clearly label any output listing (i.e., create proper table headings and footnotes, as if you were inserting this into a term paper). Your program should clearly identify the author (i.e., you!). You can include the data selector program by including the line
    %include "02.pums2000-missing.sas";
    at the top of your program. A template program is provided here and on the SSG.
  • Submit a document (Word, PDF, txt) that provides the interpretation of the MI and synthetic data results (lab12_results).

Submitting labs

Maximum group size: for programs, up to 3 students, if so declared. Otherwise: individual submissions only. Each student should still submit all required elements individually. Submissions are made on the Course Management Site. The documents can be submitted here. Due date: May 3, 2011, 3:59PM (Note: the site is used only for submissions, not for the other functionality you will find there.)