Estimation methods in the presence of outcome and mediator misclassification
In health and social science association studies, binary variables may be subject to misclassification, resulting in substantial bias in effect estimates. While existing work in this area largely focuses on correcting for bias caused by misclassification through validation studies, I instead consider the problem in cases where a gold standard measure is not available— making validation studies impossible. In this dissertation, I propose statistical methods to recover unbiased parameter estimates in association studies with binary outcome misclassification, in multi-stage decision-making frameworks with noisy labels, and in mediation analyses with misclassified binary mediator variables. In the first project, I develop a Markov Chain Monte Carlo (MCMC) algorithm and an Estimation-Maximization (EM) algorithm for association studies with misclassified binary outcomes to estimate both (1) the unbiased association between the predictor and true outcome of interest and (2) the rate at which the observed outcomes were misclassified. In addition, I develop a "label switching correction" algorithm to select the appropriate parameter set from two that maximize this likelihood, relying only on the assumption that the sum of the outcome sensitivity and specificity is greater than one. I create an R software package, COMBO, to implement the proposed methods. Compared to models that ignore outcome misclassification or assume perfect sensitivity or specificity, I show through simulation studies that the estimates from my proposed MCMC and EM algorithm methods are less biased. In an example using data from the 2020 Medical Expenditure Panel Survey (MEPS), I apply the proposed methods to show that misdiagnosis of heart attacks can be modeled as a function of patient gender. In the second project, I extend the misclassified binary outcome model to study a specialized decision-making structure within the Virginia pretrial system. This system is characterized by a two-stage, sequential, and dependent decision-making framework. First, a pretrial risk assessment algorithm (the Virginia Pretrial Risk Assessment Instrument, or VPRAI) is used to assess the likelihood of "pretrial failure," the event where defendants either fail to appear for court or reoffend, for each defendant. Judicial officers, in turn, use these assessments to determine whether to release or detain defendants before trial. There is concern that both risk assessment algorithm recommendations and judge's pretrial decisions are biased against minority groups. I develop Bayesian and frequentist methods to investigate the association between various risk factors and pretrial failure, while simultaneously estimating misclassification rates of pretrial risk assessments and of judicial decisions as a function of defendant race. Using data from the Virginia Department of Criminal Justice Services, I estimate that the VPRAI has near-perfect specificity, but its sensitivity differs by defendant race. Judicial decisions also display evidence of bias; I estimate wrongful detention rates of 39.7% and 51.4% among white and Black defendants, respectively. In the third project, I consider mediation analysis settings where a binary mediator is misclassified. I develop a suite of analysis techniques including an ordinary least squares bias correction, a predictive value weighting method, and an EM algorithm to recover unbiased parameter estimates and to estimate misclassification rates for the mediator variable. Through simulation studies, I show that as misclassification rates increase, the proposed methods out-perform methods that ignore misclassification in terms of root mean squared error. I apply these methods to evaluate the role of gestational hypertension as a mediator in the association between maternal age and pre-term delivery.