Lab 8: Data Imputation Methods

  • The object of this lab is to create and use a probabilistic crosswalk
  • The lab and all of its files are also accessible on the SSG at /ssgprojects/courses/info7470. SAS or Stata is required to complete the lab, both of which are accessible on the SSG.
  • Data:
    Input data: naicsmiss.sas7bdat and naicsmiss.dta.
    The variables in naicsmiss are
    • sic (sometimes incomplete; i.e., expressed to only 2 or 3 digits)
    • naics (always missing).
    Cross walk: sic_naics.sas7bdat and/or sic_naics.dta.
    The variables in sic_naics.sas7bdat are
    • es_sic = the 4-digit 1987 SIC code;
    • naics_impute = the 6-digit 2002 NAICS code;
    • emp = employment in the indicated (SIC, NAICS) pair;
    • sum_sic = employment in the indicated SIC;
    • pct_emp = emp/sum_sic;
    • low_limit = lower limit for random comparison to pct_emp in imputation;
    • up_limit = upper limit for random comparison to pct_emp in imputation.
    (Note: incomplete employment data in the cross walk is indicated by a value of 1 for sum_sic and fractions for emp. Do not worry about this.)
  • The exercise: Write a SAS (Stata) program to do a single probabilistic imputation of naics from the data in sic_naics. This is a straightforward application of the information in the sic_naics cross walk. Be careful how you handle the incomplete SIC codes. For these cases you will have to build the correct conditional probability model for the imputation. Provide the program code, documenting at each step what you are doing and why (use SAS/Stata comments). The program code, when executed, should run without errors.
  • If you run your program a second time, you should not get the same answer. Explain why not. Your answer should be uploaded as a separate text document.

Submitting labs

Maximum group size: for programs, up to 3 students, if so declared. Otherwise: individual submissions only. Each student should still submit all required elements individually. Submissions are made on the Course Management Site. The documents can be submitted here. Due date: April 5, 2011, 3:59PM (Note: the site is used only for submissions, not for the other functionality you will find there.)