PREDICTIVE MODELING FOR  
DEPRESSION WITH CO-MORBIDITIES  
– RESULTS FROM KOREA NATIONAL  
HEALTH INSURANCE SERVICES DATA 
 
 
 
A Thesis 
Presented to the Faculty of the Weill Cornell Graduate School 
of Medical Sciences 
in Partial Fulfillment of the Requirements for the Degree of 
Master of Science 
 
 
by 
Min-hyung Kim 
August 2016 
  
 
 
 
 
 
 
 
 
 
 
© 2016 Min-hyung Kim 
 
 
 
 
ABSTRACT 
 
Depression, despite its high prevalence, remains severely under-diagnosed across 
the healthcare system. This demands the development of data-driven approaches that 
can help screen patients who are at a high risk of depression.  
In this work, depression risk prediction models that incorporate disease co-
morbidities were built on the data from the one million twelve-year longitudinal 
cohort from Korean National Health Insurance Services (KNHIS), with multiple 
supervised machine-learning approaches, including decision tree, boost trees, random 
forest, and support vector machine. Then traditional logistic regression model and 
Elastic Net regression model were employed in order to leverage the predictive 
performance and interpretability.  
Among the supervised machine-learning approaches, boost trees, random forest, 
and support vector machine achieved Area Under the Curve of the Receiver 
Operating Characteristic (AUROC) of 0.793, 0.739, and 0.660, respectively. And 
Elastic Net regression model achieved an AUROC of 0.7818, compared to a 
traditional logistic regression model without co-morbidity analysis (AUROC of 
0.6992). In addition, Elastic Net regression model showed co-morbidity adjusted 
Odds Ratios (ORs), which may be more accurate independent estimate of each 
predictor variable.  
In conclusion, the inclusion of co-morbidity analysis with Elastic Net regression 
model showed the performance of depression risk prediction models comparable to 
that of supervised machine-learning methods, with providing better interpretability. 
 
 
BIOGRAPHICAL SKETCH 
 
Min-hyung Kim, M.D., M.S., graduated Seoul Science High School in 2003, 
Seoul National University College of Medicine in 2011, and Master of Science in 
Healthcare Informatics program in Weill Cornell Medical College in 2016. 
 
  
iii 
 
 
DEDICATION 
 
Special thanks to my mentors – Dr. Sue Kyung Park, Dr. Joo Han Oh,  
Dr. Sang Min Park at Seoul National University College of Medicine, 
 Dr. Young-Su Ju at Hallym University College of Medicine,  
and Dr. Yong Auh at Weill Cornell Medical College –  
for their rich advice and support in my study at Weill Cornell. 
 
Special thanks to my Weill Cornell professors – especially  
Dr. Jyotishman Pathak, Dr. Samprit Banerjee, Dr. Jessica Ancker,  
Dr. Arian Hyeyoung Jung, Dr. Mark Unruh, and Dr. Fei Wang –  
who guided me through their great teaching. 
 
Special thanks to my friends and alumni – especially  
Dr. Young Su Park, Dr. Yongseok Ju, Dr. Hyu Hyunseok Kang, Dr. Sungwhan F. Oh, 
Dr. Secheol Oh, Dr. Chul Kim, Dr. Byoung-Il Bae, Dr. Hojoong Kwak,  
Dr. Boram Kim, Dr. Hyun-Sik Yang, and Dr. Alex Taekyong Lee –  
for their helps and encouragements for my study in the US. 
 
Special love to my grandparents, parents, and my little brother,  
who encouraged me to pursue advanced studies. 
 
iv 
 
 
ACKNOWLEDGMENTS 
 
I thank Dr. Samprit Banerjee, Dr. Sang Min Park, and Dr. Jyotishman Pathak for 
their advice on completing these works. 
I thank Dr. Sang Min Park at Seoul National University College of Medicine and 
Korea National Health Insurance Service for allowing me to research the data of 
Korea National Health Insurance Service - National Sample Cohort (NHIS-NSC) 
2002~2013.  
I also thank Kyuwoong Kim and Jooyoung Chang at Seoul National University 
College of Medicine for assistance with data management and collaboration for this 
research. 
Finally, I thank Korean Government Scholarship Program and Korea National 
Institute for International Education, Ministry of Education, for providing financial 
support for my study at Weill Cornell. 
 
 
 
  
v 
 
 
TABLE OF CONTENTS 
 
ABSTRACT ............................................................................................................. i 
BIOGRAPHICAL SKETCH .................................................................................. iii 
DEDICATION ........................................................................................................ iv 
ACKNOWLEDGMENTS ....................................................................................... v 
LIST OF FIGURES ............................................................................................... vii 
LIST OF TABLES .................................................................................................. ix 
CHAPTER ONE. INTRODUCTION ..................................................................... 1 
CHAPTER TWO. STUDY SETTING AND DATA ............................................... 3 
CHAPTER THREE. ANALYTIC APPROACH ..................................................... 8 
CHAPTER FOUR. RESULTS .............................................................................. 14 
CHAPTER FIVE. DISCUSSION ......................................................................... 29 
BIBLIOGRAPHY ................................................................................................. 35 
 
 
  
vi 
 
 
LIST OF FIGURES 
 
Figure 1. The plot obtained from cross-validation of Elastic Net, showing 
the change of the Area Under the Curve (AUC) of Receiver Operating 
Characteristic (ROC) with different λ (in log scale) with α of 0.75. 
The numeric values above the plot indicates the number of variables 
selected in the between 28 and 61. In other words, 28 (when log(λ) is 
-6.577986) is the minimum number of variables that guarantees the 
maximum AUC. ................................................................................. 12 
Figure 2. A conditional inference tree with 73 terminal nodes built on the 
training data. With appropriate threshold, the decision tree model 
achieved sensitivity of 0.497, specificity of 0.872, Positive Predictive 
Value of 0.0935, Negative Predictive Value of 0.985, Accuracy of 
0.865, and F measure of 0.156. .......................................................... 15 
Figure 3. Receiver Operating Characteristic curve of supervised machine-
learning approaches, including decision tree, boost trees, random 
forest, and support vector machine. (ada: boost trees model built with 
R software package ada19. rf: random forest built with R software 
package randomForest20. ksvm: support vector machine built with R 
software package kernlab22.) .............................................................. 21 
Figure 4 (a) Odds Ratio (OR) plot of the traditional logistic regression 
model without co-morbidity analysis. (b) Odds Ratio (OR) plot of the 
final logistic regression model with co-morbidity analysis. The point 
values indicate the adjusted Odds Ratios, horizontal lines indicate the 
vii 
 
95% confidence intervals, and the asterisks indicate the level of 
statistical significance (***: p < 0.001, **: p < 0.01, *: p < 0.05). The 
variables included in both the traditional model without co-morbidity 
analysis (a) and the final model with co-morbidity analysis (b) are 
highlighted in yellow. The adjusted Odds Ratios can differ if variable 
selection is different. For example, the adjusted OR of being female is 
2.07 from the traditional logistic regression model without co-
morbidity analysis, but is 1.63 from the final logistic regression model 
with co-morbidity analysis. ................................................................ 23 
Figure 5. Receiver Operating Characteristic (ROC) curve of the traditional 
logistic regression model without co-morbidity analysis (red) and the 
final logistic regression model with co-morbidity analysis (blue) on 
the test data, which was unseen during the training phase. The Area 
Under the Curve (AUC) of the ROC increased from 0.6992 (red) to 
0.7818 (blue). ..................................................................................... 27 
 
 
 
  
viii 
 
 
LIST OF TABLES 
 
Table 1. ICD-10 codes of the depressive disorder, bipolar disorder, 
schizophrenia in the Chronic Conditions Data Warehouse (CCW) 
Condition Algorithms (rev. 01/2016) by Centers for Medicare & 
Medicaid Services (CMS). ................................................................... 4 
Table 2. Univariate and bivariate statistics of selected demographic, socio-
economic, disability registry and co-morbidity variables between the 
depression case group (N=28,256) and complement comparison group 
(N=1,085,400). For categorical variables, the observed frequencies of 
the categories and percentages (twelve-year prevalence) were 
reported, and for numerical variables, means (and standard deviations) 
were reported. P-values were of chi-square tests for categorical 
variables, and t-tests for numerical variables. ...................................... 6 
Table 3. Selected performance measures of the 30-predictor co-morbidity 
model, including sensitivities, specificities, Positive Prediction Values 
(PPV), Negative Prediction Values (NPV), Accuracies, and F 
measures for nine distinct threshold points on the blue curve of the 
Receiver Operating Characteristic (ROC) shown in Figure 5. The 
performance measures were evaluated with the test data, which was 
unseen during the training phase. ....................................................... 28 
 
 
ix 
 
 
CHAPTER ONE. 
INTRODUCTION 
 
Depression is a highly prevalent disease with a large societal burden. Major 
depressive disorder has the one-year prevalence of 6%, and the lifetime prevalence of 
17%1, while persistent depressive disorder (dysthymia) has the one-year prevalence 
of 2%, and the lifetime prevalence of 3%2. The estimated societal burden of unipolar 
depression was 83 billion dollars per year in the US alone in 20073. 
However, despite this burden, depression is under-diagnosed at large across the 
health care system in all care settings. A meta-analysis in 2009 concluded that the 
weighted sensitivity of primary care physicians’ diagnosis on depression was only 
about half (41.3-59.0%) without the assistance of screening tools4. This lead to the 
under-diagnosis or delayed diagnosis of depression, because many of depressed 
patients initially present with somatic symptoms to the primary care clinics. In 
general, 69-73% of depression patients presented to their primary care physicians 
with somatic symptoms, such as pain, fatigue, and sleep problems5.  
Data-driven risk prediction models can be beneficial by rapidly classifying high-
risk patients who need further evaluation. Risk prediction models can be 
implemented on Electronic Health Record system (EHRs) in order to provide clinical 
decision support. Risk prediction models can also be implemented in health 
insurance claims data in order to classify high-risk patients, and can be used for 
accountable care strategy6. 
Previous work on the prediction modelling of depression include a regression-
based depression risk prediction model based on Electronic Health Record data, 
1 
 
developed at Stanford University, which reported an area under the receiver 
operating characteristic (AUROC) of 0.80 for current classification, 0.712 for 6-
month prediction, and 0.701 for 12-month prediction7. Another work of the 
depression risk prediction model based on clinical trial data, developed at University 
of Southern California, reported to have a current classification with an AUROC of 
0.81, as well as a sensitivity of 0.65 and a specificity of 0.81 at the institution’s 
optimized threshold8. 
However, both these approaches did not explicitly apply co-morbid medical 
conditions as independent predictors in the depression prediction model. Many 
medical conditions can affect depression9, and depression can also affect certain 
medical conditions10. Therefore, application of co-morbidity analysis can improve 
the performance of the risk prediction models.  
Hence, the main hypothesis and the research question to be addressed in this study 
was whether the co-morbidity analysis can improve the performance of prediction 
models for depression risk.  
In this work, depression risk prediction models that incorporate disease co-
morbidities were built with multiple supervised machine-learning approaches, 
including decision tree, boost trees, random forest, and support vector machine. 
Then, regularized regression methods were employed in order to leverage the 
predictive performance and interpretability.  
 
  
2 
 
 
CHAPTER TWO. 
STUDY SETTING AND DATA 
 
In this study, co-morbidity analysis and risk prediction modeling was made from 
one million twelve-year longitudinal data from Korea National Health Insurance 
Services (KNHIS)11. The sample cohort (N= 1,025,340) was established in 2002 
from 2.2% of 46,605,433 individuals from the National Health Information Database 
(NHID), in order to provide public health researchers and policy makers with 
representative information regarding the utilization of health insurance and health 
examinations12. The data include demographic profile, health insurance claims data 
(including in-patient, out-patient, and pharmacy claims), death registry, disability 
registry, and national health check-up data. With the combination of 18 age groups, 2 
genders, and 41 income groups, total 1476 strata were undergone systematic 
stratified random sampling with proportional allocation13 within each stratum, using 
the individual’s total annual medical expenses as a target variable. During the follow-
up years, annual drop-out by death was 0.5 % (ranging from 4,929 to 5,229). Each 
year, a representative sample of newborns (ranging from 7,872 to 9,581), sampled 
across 82 strata (2 for gender, 41 for parents’ income group), was added to ensure the 
representativeness of the data.  
The diagnosis codes in KNHIS are based on the Korean Classification of 
Diseases, Sixth Revision (KCD-6), which is compatible with International 
Classification of Diseases, Tenth Revision (ICD-10). These diagnoses were classified 
with Chronic Conditions Data Warehouse (CCW) Condition Algorithms (rev. 
01/2016) by Centers for Medicare & Medicaid Services (CMS)14. The CCW 
3 
 
condition category algorithms are claims-based algorithms to indicate whether 
treatment for the condition appears to have taken place, which include 27 chronic 
condition categories and 33 other chronic or potentially disabling conditions 
categories. Table 1 shows the ICD-10 codes of the depressive disorder, bipolar 
disorder, schizophrenia in the CMS-CCW algorithm. The ICD-10 codes for 
depressive disorder, bipolar disorder, schizophrenia were used in the operational 
definition of the depression case group in this study. The study subjects had two or 
more encounters with depression diagnosis codes, but less than two encounters with 
either bipolar or schizophrenia diagnosis codes. The inclusion and exclusion criteria 
is based on the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition 
(DSM-5)15. 
 
 
Table 1. ICD-10 codes of the depressive disorder, bipolar disorder, schizophrenia 
in the Chronic Conditions Data Warehouse (CCW) Condition Algorithms (rev. 
01/2016) by Centers for Medicare & Medicaid Services (CMS). 
 
  
4 
 
 
The univariate and bivariate statistics of selected demographic and co-morbidity 
variables between the depression case group (N=28,256) and the complement 
comparison group (N=1,085,400), based on the operational definition of the case 
group described in the method section, are shown in Table 2. The case group showed 
significantly higher percentage of females (68.1%), age (mean 48, standard deviation 
19), income decile (mean 5.8, standard deviation 2.5), limb disability (3.3%), 
neurologic disability (0.9%), visual disability (0.7%), hearing disability (0.6%), but 
showed significantly lower percentage of social security beneficiaries (1.0%). The 
case group showed no significant difference in the percentage of residents in Seoul 
metropolitan area, and cognitive disability. Most of the co-morbidity variables 
showed statistically significant difference between the case group and comparison 
group, except cerebral palsy (p = 0.121). The most noticeable difference in the co-
morbidity by the ratio of percentage (twelve-year prevalence) was personality 
disorder (0.9% vs 0.1%), followed by anxiety disorder (31.8% vs 4.9%), dementia & 
Alzheimer’s disease (2.0% vs 0.4%), and osteoporosis (5.2% vs 1.3%). 
 
  
5 
 
 
 
 
 
 
 
 
 
Table 2. Univariate and bivariate statistics of selected demographic, socio-
economic, disability registry and co-morbidity variables between the depression case 
group (N=28,256) and complement comparison group (N=1,085,400). For 
categorical variables, the observed frequencies of the categories and percentages 
(twelve-year prevalence) were reported, and for numerical variables, means (and 
standard deviations) were reported. P-values were of chi-square tests for categorical 
variables, and t-tests for numerical variables. 
6 
 
  7 
 
 
CHAPTER THREE. 
ANALYTIC APPROACH 
 
For supervised machine-learning approaches, 10% of the data (N = 111,366) was 
set aside as a test data, another 10% of the data (N = 111,366) was used as parameter 
tuning validation data, and 80% of the data was used as training data (N = 890,924), 
in order to simulate the prediction performance for unseen data. Then predictive 
models for depression based on demographic and co-morbidity features were trained 
with decision tree, boost trees, random forest, and support vector machine 
algorithms. 
Tree-based algorithms are based on recursive partitioning, until the subsets are 
sufficiently homogeneous, or stopping criterion has been met16. Deciding the best 
split on each recursive partitioning step is based on either purity measures, such as 
entropy and information gain, or statistical significance testing. For the development 
of decision tree, conditional inference tree algorithm based on significance testing in 
R software package party17 was used with tuning parameters of minimum split 20, 
minimum bucket 7, and maximum depth 30.  
Boosting is a machine learning ensemble meta-algorithm for reducing bias and 
variance in supervised learning18. In ensemble methods, a set of weak learners are 
combined to create a strong learner. In boosting, sequential models are built to fit the 
residuals, or incorrectly classified observations in previous iteration, by weighing the 
problematic observations. For boost trees, 50 trees with tuning parameters of 
minimum split 20, maximum depth 30, complexity parameter 0.01 were trained with 
R software package ada19.  
8 
 
Random forest is another ensemble machine learning approach to supervised 
learning15. The algorithm builds trees based on bootstrap aggregation (bagging) of 
the observations, as well as random sampling of the variables while building each 
node, in order to de-correlate the trees built. For random forest, 500 trees with 7 
variables were trained based on Gini index with R software package randomForest20.  
Support vector machine is a machine learning method for identifying an optimal 
hyperplane for partitioning the classes in the multi-dimensional feature space21. 
Kernel function is employed to map the data into higher dimension space, in order to 
make more linearly separable. A classification support vector machine was trained 
with radial basis function, and the R software package kernlab22 provided automatic 
sigma parameter estimation function. 
Then the operational definition of diagnosis of depression was analyzed in a 
logistic regression model with socio-economic and co-morbid predictors. Among the 
available socio-economic variables and co-morbid conditions in KNHIS data, 
variables for the final logistic regression model was selected with Elastic Net23. The 
performance of the final logistic regression model with co-morbidity analysis was 
compared with that of the traditional logistic regression model without co-morbidity 
analysis. 
When the number of predictors is large compared to the sample size, traditional 
variable selection methodologies may have poor prediction performance for external 
datasets by overfitting random error or noise, and it has been criticized that the 
goodness of fit24, significance25, and degrees of freedom26 do not reflect the reality. 
In order to overcome this problem, regularization and shrinkage methods for 
regression have been developed27. Elastic Net is a regularization method for 
regression and classification models which compromises the Least Absolute 
Shrinkage And Selection Operator (LASSO) penalty (L1) and the ridge penalty 
9 
 
(L2)
23. The LASSO (L1) penalty function performs variable selection and dimension 
reduction by shrinking coefficients, while the ridge (L2) penalty function shrinks the 
coefficients of correlated variables toward their average. The overall Elastic Net is a 
function of parameters λ and α (0 ≤ α ≤ 1), where λ being a parameter for the level of 
penalty, while α being the weight of L1 penalty and (1 – α) being that of L2 penalty 
function. Hence, in this work, variable selection and penalization of collinear 
predictors were performed by Elastic Net for developing the final logistic regression 
model. 
A robust way to determine the best combination of λ and α is via a k-fold cross-
validation. For the performance test of the predictive model, 10% of the data (N = 
111,366) was set aside as a test data, and 90% of the data was used as a training data 
(N = 1,002,290). 10-fold cross-validation on training data was employed, where total 
observations of the dataset are randomly divided into 10 folds, or partitions. One of 
the 10 folds is reserved as the internal validation data (N = 100,229), and the rest of 
the folds consist the internal training data (N = 902,061), where statistical models are 
fitted. After fitting the models, or calculating the coefficients, the models are 
validated against the reserved fold. This overall process is iterated (repeated) 10 
times, so that every folds can be a validation set. This is a preferred method 
especially when the prediction models need to perform prediction for external 
datasets, that is, outside of the overall dataset used in the research. 
The variables for the traditional logistic regression model without co-morbidity 
analysis was driven by performing the stepwise backward selection using Akaike’s 
Information Criterion (AIC)28. The selected variables for the traditional logistic 
regression model include sex, age, income decile, and disability registration. The 
variable selection for the final logistic regression model was applied with Elastic Net 
from the training data, as described above. The selected value of α was 0.75, and the 
10 
 
optimized values of λ was 0.001390648 (log(λ) -6.577986), although other α values, 
including 0.25, 0.5, and 1, did not change the results much. The plot obtained from 
cross-validation of Elastic Net, showing the change of the Area Under the Curve 
(AUC) of ROC with different λ (in log scale) for a model assuming an α of 0.75, is 
shown in Figure 1. This gives a minimum of 28 variables needed for building an 
optimized model. Two more variables, acute myocardial infarction and dementia, 
were added to the final model, because even though those conditions were separated 
by the CMS-CCW algorithm, the conditions were in spectrum with ischemic heart 
disease and Alzheimer’s disease, respectively. Therefore, 30 variables were selected 
for the final logistic regression model. These variables include: sex, age, income 
decile, acquired hypothyroidism, acute myocardial infarction, Attention Deficit 
Hyperactivity Disorder (ADHD) and conduct disorder, Alzheimer's disease, anemia, 
anxiety disorder, arthritis, atrial fibrillation, brain injury, chronic kidney disorder , 
colorectal cancer, chronic obstructive pulmonary disease (COPD), dementia, 
diabetes, epilepsy, glaucoma, hearing impairment, hyperlipidemia, ischemic heart 
disease, liver disease (except viral hepatitis), migraine and chronic headache, 
mobility impairments, osteoporosis, peripheral vascular disease, personality 
disorders, stroke and transient ischemic attack (TIA), and viral hepatitis. 
  
11 
 
 
Figure 1. The plot obtained from cross-validation of Elastic Net, showing the 
change of the Area Under the Curve (AUC) of Receiver Operating Characteristic 
(ROC) with different λ (in log scale) with α of 0.75. The numeric values above the 
plot indicates the number of variables selected in the between 28 and 61. In other 
words, 28 (when log(λ) is -6.577986) is the minimum number of variables that 
guarantees the maximum AUC. 
12 
 
 
 
In order to get more robust Receiver Operation Characteristics (ROC) that reflect 
the prediction performance also for external datasets, another layer of validation on 
the test data (N = 111,366), which was set aside and unseen during the training 
phase, was applied to derive ROC. With the variable selected via Elastic Net, a final 
logistic regression model was built with co-morbidity analysis, and obtained the 
ROC of the final logistic regression model with co-morbidity analysis on the test 
data. Then the ROC was compared with that of the traditional logistic regression 
model without co-morbidity analysis. R version 3.1.329 with R software packages, 
glmnet30, and pROC31 were used for this study. 
 
 
 
 
  
13 
 
 
CHAPTER FOUR. 
RESULTS 
 
 
A conditional inference tree with 73 terminal nodes built on the training data are 
represented in Figure 2. With appropriate threshold, the decision tree model 
achieved sensitivity of 0.497, specificity of 0.872, Positive Predictive Value of 
0.0935, Negative Predictive Value of 0.985, Accuracy of 0.865, and F measure of 
0.156. 
 
 
  
14 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 2. A conditional inference tree with 73 terminal nodes built on the training 
data. With appropriate threshold, the decision tree model achieved sensitivity of 
0.497, specificity of 0.872, Positive Predictive Value of 0.0935, Negative Predictive 
Value of 0.985, Accuracy of 0.865, and F measure of 0.156. 
 
 
 
 
  
15 
 
Figure 2.  
1) Anxiety <= 0; criterion = 1, statistic = 26896.224 
  2) Age <= 42; criterion = 1, statistic = 4068.764 
    3) Age <= 7; criterion = 1, statistic = 1360.102 
      4) ADHD_Conduct <= 0; criterion = 1, statistic = 1041.09 
        5) Age <= 2; criterion = 1, statistic = 318.724 
          6)*  weights = 79905  
        5) Age > 2 
          7) Female <= 0; criterion = 1, statistic = 26.68 
            8) Epilepsy <= 0; criterion = 1, statistic = 27.449 
              9)*  weights = 24129  
            8) Epilepsy > 0 
              10)*  weights = 295  
          7) Female > 0 
            11)*  weights = 22862  
      4) ADHD_Conduct > 0 
        12) Age <= 2; criterion = 1, statistic = 47.248 
          13)*  weights = 1390  
        12) Age > 2 
          14) Liver_except_viral <= 0; criterion = 0.984, statistic = 13.512 
            15)*  weights = 1451  
          14) Liver_except_viral > 0 
            16)*  weights = 84  
    3) Age > 7 
      17) Female <= 0; criterion = 1, statistic = 962.568 
        18)*  weights = 215405  
      17) Female > 0 
        19) Migraine_ChronicHeadache <= 0; criterion = 1, statistic = 351.285 
          20) Liver_except_viral <= 0; criterion = 1, statistic = 304.994 
            21)*  weights = 146761  
          20) Liver_except_viral > 0 
            22) Hyperlipidemia <= 0; criterion = 0.999, statistic = 142.916 
              23)*  weights = 13252  
            22) Hyperlipidemia > 0 
              24)*  weights = 1470  
        19) Migraine_ChronicHeadache > 0 
          25) Hyperlipidemia <= 0; criterion = 1, statistic = 52.031 
            26) IschemicHeart <= 0; criterion = 1, statistic = 43.501 
              27)*  weights = 24059  
            26) IschemicHeart > 0 
              28)*  weights = 562  
          25) Hyperlipidemia > 0 
            29)*  weights = 1999  
  2) Age > 42 
    30) Migraine_ChronicHeadache <= 0; criterion = 1, statistic = 1123.736 
      31) Arthritis <= 0; criterion = 1, statistic = 486.759 
        32) Female <= 0; criterion = 1, statistic = 135.843 
          33) StrokeTIA <= 0; criterion = 1, statistic = 75.902 
            34) BenignProstatic <= 0; criterion = 1, statistic = 62.162 
              35)*  weights = 46398  
            34) BenignProstatic > 0 
              36)*  weights = 6183  
          33) StrokeTIA > 0 
            37) IschemicHeart <= 0; criterion = 0.987, statistic = 34.764 
              38)*  weights = 5333  
            37) IschemicHeart > 0 
              39) Glaucoma <= 0; criterion = 0.984, statistic = 13.552 
16 
 
Figure 2. (continued) 
                40)*  weights = 931  
              39) Glaucoma > 0 
                41)*  weights = 102  
        32) Female > 0 
          42) ChronicKidney <= 0; criterion = 1, statistic = 41.349 
            43) Osteoporosis <= 0; criterion = 1, statistic = 35.408 
              44) StrokeTIA <= 0; criterion = 1, statistic = 41.302 
                45) Asthma <= 0; criterion = 0.999, statistic = 34.791 
                  46) IschemicHeart <= 0; criterion = 0.999, statistic = 43.574 
                    47)*  weights = 23909  
                  46) IschemicHeart > 0 
                    48)*  weights = 1324  
                45) Asthma > 0 
                  49) PeripheralVascularDisease <= 0; criterion = 0.983, statistic = 13.497 
                    50)*  weights = 3552  
                  49) PeripheralVascularDisease > 0 
                    51)*  weights = 122  
              44) StrokeTIA > 0 
                52)*  weights = 3170  
            43) Osteoporosis > 0 
              53)*  weights = 1395  
          42) ChronicKidney > 0 
            54) Liver_except_viral <= 0; criterion = 0.996, statistic = 19.838 
              55) HeartFailure <= 0; criterion = 0.966, statistic = 12.089 
                56)*  weights = 1615  
              55) HeartFailure > 0 
                57)*  weights = 164  
            54) Liver_except_viral > 0 
              58)*  weights = 284  
      31) Arthritis > 0 
        59) Cataract <= 0; criterion = 1, statistic = 158.759 
          60) Female <= 0; criterion = 1, statistic = 117.624 
            61)*  weights = 22723  
          60) Female > 0 
            62) Hyperlipidemia <= 0; criterion = 1, statistic = 71.826 
              63)*  weights = 27163  
            62) Hyperlipidemia > 0 
              64)*  weights = 7894  
        59) Cataract > 0 
          65) Female <= 0; criterion = 1, statistic = 28.402 
            66) BenignProstatic <= 0; criterion = 0.999, statistic = 18.62 
              67)*  weights = 4557  
            66) BenignProstatic > 0 
              68)*  weights = 2118  
          65) Female > 0 
            69) Asthma <= 0; criterion = 1, statistic = 22.132 
              70) Dementia <= 0; criterion = 0.996, statistic = 16.032 
                71) Hyperlipidemia <= 0; criterion = 0.996, statistic = 16.309 
                  72)*  weights = 7359  
                71) Hyperlipidemia > 0 
                 73)*  weights = 2327  
              70) Dementia > 0 
                74)*  weights = 146  
            69) Asthma > 0 
              75) Income <= 7; criterion = 0.997, statistic = 16.843 
17 
 
Figure 2. (continued) 
                76)*  weights = 2513  
              75) Income > 7 
                77)*  weights = 1308  
    30) Migraine_ChronicHeadache > 0 
      78) Female <= 0; criterion = 1, statistic = 110.777 
        79) Epilepsy <= 0; criterion = 1, statistic = 39.041 
          80) Liver_except_viral <= 0; criterion = 1, statistic = 22.325 
            81) Anemia <= 0; criterion = 1, statistic = 29.207 
              82) DFAB_MINOR_TRUE <= 0; criterion = 0.996, statistic = 16.155 
                83) COPD <= 0; criterion = 0.976, statistic = 19.611 
                  84)*  weights = 5607  
                83) COPD > 0 
                  85)*  weights = 938  
              82) DFAB_MINOR_TRUE > 0 
                86)*  weights = 354  
            81) Anemia > 0 
              87)*  weights = 319  
          80) Liver_except_viral > 0 
            88) StrokeTIA <= 0; criterion = 0.972, statistic = 20.781 
              89)*  weights = 2250  
            88) StrokeTIA > 0 
              90)*  weights = 542  
        79) Epilepsy > 0 
          91)*  weights = 161  
      78) Female > 0 
        92) Arthritis <= 0; criterion = 1, statistic = 49.827 
          93) ViralHepatitis <= 0; criterion = 0.973, statistic = 23.728 
            94)*  weights = 5231  
          93) ViralHepatitis > 0 
            95)*  weights = 281  
        92) Arthritis > 0 
          96) IschemicHeart <= 0; criterion = 1, statistic = 38.004 
            97) Liver_except_viral <= 0; criterion = 1, statistic = 24.682 
              98) StrokeTIA <= 0; criterion = 0.982, statistic = 13.323 
                99)*  weights = 8090  
              98) StrokeTIA > 0 
                100)*  weights = 1817  
            97) Liver_except_viral > 0 
              101) MobilityImpairments <= 0; criterion = 0.967, statistic = 12.183 
                102)*  weights = 2480  
              101) MobilityImpairments > 0 
                103)*  weights = 54  
          96) IschemicHeart > 0 
            104)*  weights = 2438  
1) Anxiety > 0 
  105) Age <= 37; criterion = 1, statistic = 305.218 
    106) Age <= 7; criterion = 1, statistic = 105.772 
      107)*  weights = 1578  
    106) Age > 7 
      108) Liver_except_viral <= 0; criterion = 1, statistic = 51.925 
        109) ADHD_Conduct <= 0; criterion = 1, statistic = 44.258 
          110) Age <= 22; criterion = 1, statistic = 28.168 
            111)*  weights = 4159  
          110) Age > 22 
            112) Female <= 0; criterion = 0.998, statistic = 17.112 
18 
 
Figure 2. (continued) 
              113)*  weights = 2398  
            112) Female > 0 
              114)*  weights = 4538  
        109) ADHD_Conduct > 0 
          115)*  weights = 120  
      108) Liver_except_viral > 0 
        116) PersonalityDisorders <= 0; criterion = 0.955, statistic = 11.565 
          117)*  weights = 2709  
        116) PersonalityDisorders > 0 
          118)*  weights = 39  
  105) Age > 37 
    119) Migraine_ChronicHeadache <= 0; criterion = 1, statistic = 134.844 
      120) Income <= 4; criterion = 1, statistic = 55.794 
        121) Liver_except_viral <= 0; criterion = 1, statistic = 25.2 
          122) Osteoporosis <= 0; criterion = 0.974, statistic = 12.656 
            123)*  weights = 4347  
          122) Osteoporosis > 0 
            124)*  weights = 222  
        121) Liver_except_viral > 0 
          125)*  weights = 1271  
      120) Income > 4 
        126) Cataract <= 0; criterion = 1, statistic = 37.512 
          127) Liver_except_viral <= 0; criterion = 1, statistic = 26.664 
            128) StrokeTIA <= 0; criterion = 1, statistic = 24.121 
              129) Income <= 6; criterion = 0.973, statistic = 12.576 
                130)*  weights = 2206  
              129) Income > 6 
                131)*  weights = 3779  
            128) StrokeTIA > 0 
              132)*  weights = 863  
          127) Liver_except_viral > 0 
            133)*  weights = 2141  
        126) Cataract > 0 
          134)*  weights = 3123  
    119) Migraine_ChronicHeadache > 0 
      135) StrokeTIA <= 0; criterion = 1, statistic = 26.293 
        136) ChronicKidney <= 0; criterion = 1, statistic = 23.634 
          137) Osteoporosis <= 0; criterion = 0.983, statistic = 13.408 
            138) COPD <= 0; criterion = 0.992, statistic = 14.872 
              139)*  weights = 4888  
            138) COPD > 0 
              140)*  weights = 1022  
          137) Osteoporosis > 0 
            141)*  weights = 523  
 
        136) ChronicKidney > 0 
          142)*  weights = 653  
      135) StrokeTIA > 0 
        143) Dementia <= 0; criterion = 0.96, statistic = 11.796 
          144)*  weights = 2140  
        143) Dementia > 0 
          145)*  weights = 64 
 
19 
 
 
Receiver Operating Characteristic curve of supervised machine-learning 
approaches, including boost trees, random forest, and support vector machine, are 
shown in Figure 3. AUROC achieved on the test set were 0.793, 0.739, 0.660 for 
boost trees, random forest, and support vector machine, respectively. 
Boost trees achieved train error of 0.024, Out-of-Bag (OOB) error of 0.023 on 6 
iterations. Variables actually used in tree construction include, Age, Anxiety, 
Dysthymia, Adjustment Disorder, Arthritis, Migraine/Chronic Headache, Female, 
Epilepsy, Liver Disease (except viral), Dementia, Stroke/Transient Ischemic Attack, 
Fibromyalgia/Pain/Fatigue, Hyperlipidemia, Benign Prostatic Hyperplasia, Cataract, 
Glaucoma, and Ischemic Heart Disease, in the order of decreasing variable usage 
frequency. 
Random forest achieved OOB estimate of error rate of 0.024, and variables 
associated with more than 100 of mean decrease in Gini index include Anxiety, Age, 
Income, Dysthymia, Migraine/Headache, Arthritis, Fibromyalgia/Pain/Fatigue, 
Hyperlipidemia, Diabetes, and Asthma. 
In support vector machine, probability classification model achieved training error 
of 0.037 with cost parameter of 1, Gaussian Radial Basis kernel function sigma 
parameter of 0.014, and 2848 support vectors. 
 
 
 
 
  
20 
 
 
 
 
 
 
Figure 3. Receiver Operating Characteristic curve of supervised machine-learning 
approaches, including decision tree, boost trees, random forest, and support vector 
machine. (ada: boost trees model built with R software package ada19. rf: random 
forest built with R software package randomForest20. ksvm: support vector machine 
built with R software package kernlab22.)  
 
 
 
 
21 
 
 
The Odds Ratio (OR) plot for the traditional logistic regression model without co-
morbidity analysis is presented in Figure 4 (a) and the same for the final logistic 
regression model with co-morbidity analysis in Figure 4 (b). It is noticeable that 
adjusted ORs for the same variables differs between the two models. For example, 
the adjusted OR of being female is 2.07 from the traditional logistic regression model 
without co-morbidity analysis, but is 1.63 from the final logistic regression model 
with co-morbidity analysis. Likewise, the adjusted OR of age is 1.03 from the 
traditional logistic regression model without co-morbidity analysis, but is 1.01 from 
the final logistic regression model with co-morbidity analysis. Finally, the adjusted 
OR of income decile is 1.04 from the traditional logistic regression model without 
co-morbidity analysis, but is 1.02 from the final logistic regression model with co-
morbidity analysis. The ORs for the disability registration variables in the traditional 
logistic regression model without co-morbidity analysis ranged from 1.03 (cognitive 
disability) to 1.42 (hearing disability). The ORs for the co-morbidity variables in the 
final logistic regression model with co-morbidity analysis ranged from 0.78 (acute 
myocardial infarction) to 5.81 (ADHD and conduct disorder). 
 
  
22 
 
 
 
 
 
 
Figure 4 (a) Odds Ratio (OR) plot of the traditional logistic regression model 
without co-morbidity analysis. (b) Odds Ratio (OR) plot of the final logistic 
regression model with co-morbidity analysis. The point values indicate the adjusted 
Odds Ratios, horizontal lines indicate the 95% confidence intervals, and the asterisks 
indicate the level of statistical significance (***: p < 0.001, **: p < 0.01, *: p < 0.05). 
The variables included in both the traditional model without co-morbidity analysis 
(a) and the final model with co-morbidity analysis (b) are highlighted in yellow. The 
adjusted Odds Ratios can differ if variable selection is different. For example, the 
adjusted OR of being female is 2.07 from the traditional logistic regression model 
without co-morbidity analysis, but is 1.63 from the final logistic regression model 
with co-morbidity analysis. 
 
 
 
 
23 
 
 
 
 
Figure 4 (a) 
 
 
  
24 
 
 
 
 
 
Figure 4 (b) 
  
25 
 
 
Receiver Operating Characteristic (ROC) curve of the traditional logistic 
regression model without co-morbidity analysis and the final logistic regression 
model with co-morbidity analysis on the test data, which were unseen during the 
training phase, are shown in Figure 5. The Area Under the Curve (AUC) of the ROC 
increased from 0.6992 (the traditional logistic regression model without co-morbidity 
analysis) to 0.7818 (the final logistic regression model with co-morbidity analysis). 
Selected performance measures for the 30-predictor co-morbidity model, including 
sensitivities, specificities, Positive Prediction Values, Negative Prediction Values, 
Accuracies, and F measures for nine distinct threshold points on the ROC are shown 
in Table 3. 
 
  
26 
 
 
Figure 5. Receiver Operating Characteristic (ROC) curve of the traditional 
 
logistic regression model without co-morbidity analysis (red) and the final logistic 
 
regression model with co-morbidity analysis (blue) on the test data, which was 
uns een during the training phase. The Area Under the Curve (AUC) of the ROC 
increased from 0.6992 (red) to 0.7818 (blue). 
 
 
 
27 
 
 
 
 
Table 3. Selected performance measures of the 30-predictor co-morbidity model, 
including sensitivities, specificities, Positive Prediction Values (PPV), Negative 
Prediction Values (NPV), Accuracies, and F measures for nine distinct threshold 
points on the blue curve of the Receiver Operating Characteristic (ROC) shown in 
Figure 5. The performance measures were evaluated with the test data, which was 
unseen during the training phase. 
 
 
 
 
  
28 
 
 
CHAPTER FIVE. 
DISCUSSION 
 
Given that depression remains significantly under-diagnosed in all settings of the 
healthcare system4,5, data-driven prediction models can play an important role in the 
screening of depression patients. Although previous work has shown promising 
results on the ability to predict future diagnoses of depression, such models have not 
explicitly applied co-morbid medical conditions as independent predictors. 
Depression is a characteristic disease which can be affected by many medical 
condition9, and can also affect certain medial conditions10. In 2013, psychological 
factors affecting medical conditions (PFAOMC) was included as a new diagnosis in 
DSM-V15. PFAOMC are the factors which may precipitate or exacerbate the medical 
condition, interfere with treatment, or contribute to morbidity and mortality. The 
mechanism of PFAOMC include promotion of known risk factors (i.e. smoking), 
influence on the underlying pathophysiology (i.e. bronchospasm in asthma), and the 
interference on the treatment (i.e. poor compliance). Therefore, addressing the co-
morbidities related to depression will be a rationally important step in understanding 
the course of depression, and the analysis of these co-morbidities will likely improve 
the performance of depression risk prediction models. 
Machine learning methods differ from traditional statistical model that the 
primary hypothesis is the existence of a pattern in the set of predictor variables that 
predicts the outcome32. Machine learning methods differ from traditional statistical 
model by having primary hypothesis of existence of a pattern in the set of predictor 
variables that will identify the outcome. While traditional statistical modeling 
29 
 
pursues a simple model that fits reasonably well, machine learning methods consider 
complex relationships among the variables, which is advantageous for predictive 
modeling. Where the outcome is related to multiple highly correlated features, simple 
statistical model may not work well due to the violation of the assumptions for the 
statistical model. Therefore, machine learning approaches were employed in this 
work to confirm the potential benefit of complex models for prediction of depression, 
especially with multiple co-morbidity features. 
Rule induction methods, or symbolic methods, such as decision tree algorithms, 
may provide interpretability which allows justification and explanation of 
unexpected solution of new problems33. However, such methods often provide poor 
predictive performance compared to so-called black-box sub-symbolic methods, such 
as support vector machine. On the contrary, given that machine learning approaches 
need to be incorporated with human expert’s interpretation for an integrated man-
and-machine approach34 for accuracy and liability, explanation ability and 
transparency35 would be essential, and black-box nature of the machine learning 
system may be a critical drawback. 
Results from a single conditional inference tree shown in Figure 2 offers 
interpretable and transparent decision rules, but the discrete nature of classification 
rule does not allow flexibly in tuning the threshold. Predictive performance in the 
measure of AUROC was best achieved with boost trees (0.793), followed by random 
forest(0.739), and support vector machine (0.660), as shown in Figure 3. Support 
vector machine has great advantage in handling a large number of variables with 
relatively small observations of samples, by mathematical optimization based on the 
support vector observations lying at the class boundaries. However, in this work, the 
number of features, including the 60 co-morbid condition, was not so much a large 
number, and the data have more than 1 million of observations. Therefore, support 
30 
 
vector machine added not so much value in predictive modeling in this work. 
Although boost trees model achieved highest predictive performance in the 
measure of AUROC, it also has poor interpretability and explanation ability. 
Although variables actually used in tree construction of the boosting algorithm may 
imply the variable importance, the exact role of the variable is difficult to be 
explained or quantified, because of its complex algorithm of building sequential 
models by weighing the problematic observations. Therefore, the boost trees model 
is also not the best model for an integrated man-and-machine approach34 for 
accuracy and liability. 
In order to leverage the predictive performance and interpretability, regularized 
regression methods were employed in this study. The AUC of the ROC increased 
from 0.6992 (the traditional logistic regression model without co-morbidity analysis) 
to 0.7818 (the final logistic regression model with co-morbidity analysis), after 
applying the optimized variable selection from Elastic Net (Figure 5). Because 
neither questionnaire-based screening results (i.e. Patient Health Questionnaire36) nor 
physician clinical notes are available in claims data, there is no direct information 
about patients’ moods or symptoms. Given this limitation, this improvement could be 
interpreted very significant improvement, and the inclusion of co-morbidity analysis 
could be a key component in improving the performance of depression risk 
prediction models. 
Furthermore, since odds ratio estimates change for some variables after adjusting 
for co-morbid conditions, the adjusted OR in the final logistic regression model with 
co-morbidity analysis could reflect estimates closer to the truth. For example, the 
adjusted OR of being female is 2.07 from the traditional logistic regression model 
without co-morbidity analysis, but is 1.63 from the final logistic regression model 
with co-morbidity analysis (Figure 4). Given that, females have a higher co-
31 
 
morbidity burden in general, the traditional logistic regression model will give higher 
OR for females, by not adjusting for co-morbidities. 
The one million twelve-year Korea National Health Insurance Service (KNHIS) 
longitudinal data used in this study has many advantages for analyzing large scale 
statistical models. As KNHIS is the only health insurance system which covers all 
Korean citizens, the random sample cohort from KNHIS can be considered as a 
nationally representative health data37. Factors arising from multiple health insurance 
systems effecting diagnosis of depression (i.e. some health insurance plans might 
have lower coverage for mental health) can be avoided in the single health insurance 
system, and therefore higher statistical power can be achieved. Therefore, adjusted 
ORs from the logistic regression model with co-morbidity analysis may represent the 
risks of each variable in the population. 
Cautions are needed when interpreting the epidemiologic results from this study, 
however. The large sample size in this study is over-powered to detect small effects, 
so more emphasis should be placed on the magnitude of estimates rather than the 
statistical significance. Furthermore, the operational definition of depression case 
group is based on the diagnosis codes in the claims data. Therefore, the depression 
risk prediction model in this study is predicting the probabilities of each person’s 
visiting physicians and diagnosed as depressed by physicians, and this will limit the 
ability of detecting the underdiagnosed depressed population. However, it is 
noticeable that the findings are consistent with previous studies revealed the 
relationship between co-morbidities and depression in Korean population with cross-
sectional survey study38, as well as Korean Longitudinal Study of Aging39. 
In order to develop a better depression risk prediction model which can also 
address the currently underdiagnosed depressed population, reaching out to the 
underdiagnosed depressed population with gold standard screening tools will be 
32 
 
necessary. Further work is also needed to investigate possible difference in the co-
morbidity patterns in different gender, age-group, and socio-economic status. Higher 
prevalence of depression among female has been discussed to be related to both 
biological and environmental factors40. Features of depressions can also be vary 
among different age-groups41, and certain age-groups may have additional risks42. 
Socio-economic factors43 of depression and disparity44 in depression treatment are 
also very important topic in public health.  
Additional research is needed for optimizing the chronic conditions clusters, or 
categories. Although CMS-CCW algorithm is a well validated algorithm using ICD 
codes, optimized clusters developed using insurance claims data might be different 
when compared to actual clinical manifestation of depression. Even within the 
clinical practice, the disease classification or categorization can differ among various 
clinical specialties and subspecialties. Therefore, optimization for co-morbid 
conditions clusters will be needed for better prediction models45. Furthermore, 
integration of medication prescription data will allow better operational definitions 
with lesser false positives. Further research is also needed for variable interactions 
(i.e. epilepsy of young female may have different effect from epilepsy of elderly 
male), as well as time-to-event analysis (i.e. Cox Proportional Hazard regression46), 
dealing with time-dependent covariates. 
Although the chronic co-morbid disease studied in this work was limited to those 
available in CMS-CCW in order to maintain consistency in the operational definition 
of each conditions, there are many other conditions which are related to depression, 
such as substance abuse47 and abortion48. Furthermore, development of technologies 
may allow social media data49, genome50, exposome51, and patient-generated health 
data52 to be incorporated in the prediction model. In practical application of 
prediction model, thorough review of past medical history and family history, 
33 
 
detailed symptoms, as well as incorporation of computational time series analysis of 
trend and variance of laboratory tests, may allow more precise prediction. 
Although the focus of this study was on the prediction of the existence of 
depression based on the chronic co-morbid disease appeared in the health insurance 
claims data, further studies will be needed to confirm if co-morbidity analysis can 
also improve the performance of the prediction model for treatment response53,54, or 
prediction model based on lexical data55, as well as on electronic health records56,57. 
In conclusion, the inclusion of co-morbidity analysis with Elastic Net regression 
model showed the performance of depression risk prediction models comparable to 
that of supervised machine-learning methods, with providing better interpretability. 
The co-morbidity adjusted ORs from the Elastic Net regression model may indicate 
the true independent OR of each predictor variable. Further studies will be needed to 
cover the currently underdiagnosed depressed population, as well as optimizing the 
chronic conditions clusters.  
  
34 
 
BIBLIOGRAPHY 
 
 
1.  Kessler RC, Berglund P, Demler O, Jin R, Merikangas KR, Walters EE. 
Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the 
National Comorbidity Survey Replication. Arch Gen Psychiatry. 2005 
Jun;62(6):593–602.  
2.  Pietrzak RH, Kinley J, Afifi TO, Enns MW, Fawcett J, Sareen J. Subsyndromal 
depression in the United States: prevalence, course, and risk for incident 
psychiatric outcomes. Psychol Med. 2013 Jul;43(7):1401–14.  
3.  Donohue JM, Pincus HA. Reducing the societal burden of depression. 
Pharmacoeconomics. 2007;25(1):7–24.  
4.  Mitchell AJ, Vaze A, Rao S. Clinical diagnosis of depression in primary care: a 
meta-analysis. Lancet Lond Engl. 2009 Aug 22;374(9690):609–19.  
5.  Tylee A, Gandhi P. The importance of somatic symptoms in depression in 
primary care. Prim Care Companion J Clin Psychiatry. 2005;7(4):167–76.  
6.  Bruce ML, Raue PJ, Reilly CF, Greenberg RL, Meyers BS, Banerjee S, et al. 
Clinical effectiveness of integrating depression care management into medicare 
home health: the Depression CAREPATH Randomized trial. JAMA Intern Med. 
2015 Jan;175(1):55–64.  
7.  Huang SH, LePendu P, Iyer SV, Tai-Seale M, Carrell D, Shah NH. Toward 
personalizing treatment for depression: predicting diagnosis and severity. J Am 
Med Inform Assoc. 2014;21(6):1069–1075.  
8.  Jin H, Wu S, Di Capua P. Development of a Clinical Forecasting Model to 
Predict Comorbid Depression Among Diabetes Patients and an Application in 
Depression Screening Policy Making. Prev Chronic Dis. 2015;12:E142.  
9.  Hirschfeld RMA. The Comorbidity of Major Depression and Anxiety 
Disorders: Recognition and Management in Primary Care. Prim Care 
Companion J Clin Psychiatry. 2001 Dec;3(6):244–54.  
10.  Fava GA, Fabbri S, Sirri L, Wise TN. Psychological factors affecting medical 
condition: a new proposal for DSM-V. Psychosomatics. 2007 Apr;48(2):103–
11.  
11.  Kim L, Kim J, Kim S. A guide for the utilization of Health Insurance Review 
and Assessment Service National Patient Samples. Epidemiol Health. 
2014;36:e2014008.  
12.  Lee J, Lee JS, Park S-H, Shin SA, Kim K. Cohort Profile: The National Health 
Insurance Service–National Sample Cohort (NHIS-NSC), South Korea. Int J 
Epidemiol. 2016 Jan 28;dyv319.  
13.  Cochran WG. Sampling techniques-3. 1977 [cited 2016 Jul 3]; Available from: 
http://agris.fao.org/agris-search/search.do?recordID=XF2015028634 
14.  Gorina Y, Kramarow EA. Identifying Chronic Conditions in Medicare Claims 
35 
 
Data: Evaluating the Chronic Condition Data Warehouse Algorithm. Health 
Serv Res. 2011 Oct 1;46(5):1610–27.  
15.  Association AP, others. Diagnostic and statistical manual of mental disorders 
(DSM-5®). American Psychiatric Pub; 2013.  
16.  Kuhn M, Johnson K. Applied predictive modeling. Springer; 2013.  
17.  Hothorn T, Hornik K, Zeileis A. ctree: Conditional Inference Trees. [cited 2016 
Jul 16]; Available from: 
http://cran.nexr.com/web/packages/partykit/vignettes/ctree.pdf 
18.  Williams G. Data mining with rattle and R: the art of excavating data for 
knowledge discovery. Springer Science & Business Media; 2011.  
19.  Culp M, Johnson K, Michailidis G. ada: An r package for stochastic boosting. J 
Stat Softw. 2006;17(2):9.  
20.  Liaw A, Wiener M. Classification and regression by randomForest. R News. 
2002;2(3):18–22.  
21.  James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical 
learning. Springer; 2013.  
22.  Karatzoglou A, Smola A, Hornik K, Zeileis A. kernlab-an S4 package for kernel 
methods in R. 2004 [cited 2016 Jul 23]; Available from: 
http://epub.wu.ac.at/1048/ 
23.  Zou H, Hastie T. Regularization and variable selection via the elastic net. J R 
Stat Soc Ser B Stat Methodol. 2005;67(2):301–320.  
24.  Rencher AC, Pun FC. Inflation of R2 in best subset regression. Technometrics. 
1980;22(1):49–53.  
25.  Wilkinson L, Dallal GE. Tests of significance in forward selection regression 
with an F-to-enter stopping rule. Technometrics. 1981;23(4):377–380.  
26.  Hurvich CM, Tsai C-L. The impact of model selection on inference in linear 
regression. Am Stat. 1990;44(3):214–217.  
27.  Tibshirani R. Regression Shrinkage and Selection via the Lasso. J R Stat Soc 
Ser B Methodol. 1996;58(1):267–88.  
28.  Zucchini W. An Introduction to Model Selection. J Math Psychol. 2000 
Mar;44(1):41–61.  
29.  Team RC. R: A language and environment for statistical computing [Internet]. 
Vienna, Austria: R Foundation for Statistical Computing; 2013. Doc Free 
Available Internet Httpwww R-Proj Org. 2015;  
30.  Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear 
models via coordinate descent. J Stat Softw. 2010;33(1):1.  
31.  Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, et al. pROC: 
36 
 
an open-source package for R and S+ to analyze and compare ROC curves. 
BMC Bioinformatics. 2011;12(1):1.  
32.  Waljee AK, Higgins PD. Machine learning in medicine: a primer for physicians. 
Am J Gastroenterol. 2010;105(6):1224.  
33.  Lavrač N. Machine learning for data mining in medicine. In: Joint European 
Conference on Artificial Intelligence in Medicine and Medical Decision Making 
[Internet]. Springer; 1999 [cited 2016 Jul 23]. p. 47–62. Available from: 
http://link.springer.com/chapter/10.1007/3-540-48720-4_4 
34.  Deo RC. Machine learning in medicine. Circulation. 2015;132(20):1920–1930.  
35.  Kononenko I. Machine learning for medical diagnosis: history, state of the art 
and perspective. Artif Intell Med. 2001;23(1):89–109.  
36.  Kroenke K, Spitzer RL, Williams JB. The Phq-9. J Gen Intern Med. 
2001;16(9):606–613.  
37.  Park SM, Son KY, Park J-H, Cho B. Disparities in short-term and long-term all-
cause mortality among Korean cancer patients with and without preexisting 
disabilities: a nationwide retrospective cohort study. Support Care Cancer Off J 
Multinatl Assoc Support Care Cancer. 2012 May;20(5):963–70.  
38.  Yun YH, Kim SH, Lee KM, Park SM, Kim YM. Age, sex, and comorbidities 
were considered in comparing reference data for health-related quality of life in 
the general and cancer populations. J Clin Epidemiol. 2007 Nov;60(11):1164–
75.  
39.  Kim H, Park S-M, Jang S-N, Kwon S. Depressive symptoms, chronic medical 
illness, and health care utilization: findings from the Korean Longitudinal Study 
of Ageing (KLoSA). Int Psychogeriatr IPA. 2011 Oct;23(8):1285–93.  
40.  Kessler RC. Epidemiology of women and depression. J Affect Disord. 2003 
Mar;74(1):5–13.  
41.  Benazzi F. Female Depression before and after Menopause. Psychother 
Psychosom. 2000;69(5):280–3.  
42.  Greenfield A, Banerjee S, DePasquale A, Weiss N, Sirey J. Factors Associated 
with Nutritional Risk Among Homebound Older Adults with Depressive 
Symptoms. J Frailty Aging. (In Press);  
43.  Lorant V, Croux C, Weich S, Deliège D, Mackenbach J, Ansseau M. Depression 
and socio-economic risk factors: 7-year longitudinal population study. Br J 
Psychiatry. 2007 Apr 1;190(4):293–8.  
44.  Alegría M, Chatterji P, Wells K, Cao Z, Chen C, Takeuchi D, et al. Disparity in 
Depression Treatment Among Racial and Ethnic Minority Populations in the 
United States. Psychiatr Serv. 2008 Nov 1;59(11):1264–72.  
45.  Pathak J, Wang J, Kashyap S, Basford M, Li R, Masys DR, et al. Mapping 
clinical phenotype data elements to standardized metadata repositories and 
controlled terminologies: the eMERGE Network experience. J Am Med Inform 
37 
 
Assoc. 2011;18(4):376–386.  
46.  Lin DY, Wei L-J. The robust inference for the Cox proportional hazards model. 
J Am Stat Assoc. 1989;84(408):1074–1078.  
47.  Ialongo NS, Werthamer L, Kellam SG, Brown CH, Wang S, Lin Y. Proximal 
impact of two first-grade preventive interventions on the early risk behaviors 
for later substance abuse, depression, and antisocial behavior. Am J Community 
Psychol. 1999;27(5):599–641.  
48.  Cougle JR, Reardon DC, Coleman PK. Depression associated with abortion and 
childbirth: a long-term analysis of the NLSY cohort. Med Sci Monit. 
2003;9(4):CR105–CR112.  
49.  De Choudhury M, Gamon M, Counts S, Horvitz E. Predicting Depression via 
Social Media. In: ICWSM [Internet]. 2013 [cited 2016 Jul 23]. p. 2. Available 
from: http://course.duruofei.com/wp-
content/uploads/2015/05/Choudhury_Predicting-Depression-via-Social-
Media_ICWSM13.pdf 
50.  Lewis CM, Ng MY, Butler AW, Cohen-Woods S, Uher R, Pirlo K, et al. 
Genome-wide association study of major recurrent depression in the UK 
population. Am J Psychiatry. 2010;167(8):949–957.  
51.  Martin-Sanchez F, Verspoor K, others. Big data in medicine is driving big 
changes. Yearb Med Inform. 2014;9(1):14–20.  
52.  Shapiro M, Johnston D, Wald J, Mon D. Patient-generated health data. White 
Pap Prep Off Policy Plan [Internet]. 2012 [cited 2016 Jul 23]; Available from: 
http://healthitgov.ahrqdev.org/sites/default/files/rti_pghd_whitepaper_april_201
2.pdf 
53.  Gallagher PJ, Castro V, Fava M, Weilburg JB, Murphy SN, Gainer VS, et al. 
Antidepressant response in patients with major depression exposed to NSAIDs: 
a pharmacovigilance study. Am J Psychiatry. 2012 Oct;169(10):1065–72.  
54.  Chekroud AM, Zotti RJ, Shehzad Z, Gueorguieva R, Johnson MK, Trivedi MH, 
et al. Cross-trial prediction of treatment outcome in depression: a machine 
learning approach. Lancet Psychiatry. 2016;  
55.  Banitaan S, Daimi K. Using Data Mining to Predict Possible Future Depression 
Cases. Int J Public Health Sci IJPHS. 2014;3(4):231–240.  
56.  Pathak J, Simon G, Li D, Biernacka JM, Jenkins GJ, Chute CG, et al. Detecting 
Associations between Major Depressive Disorder Treatment and Essential 
Hypertension using Electronic Health Records. AMIA Jt Summits Transl Sci 
Proc AMIA Summit Transl Sci. 2014;2014:91–6.  
57.  Bobo WV, Pathak J, Kremers HM, Yawn BP, Brue SM, Stoppel CJ, et al. An 
electronic health record driven algorithm to identify incident antidepressant 
medication users. J Am Med Inform Assoc JAMIA. 2014 Oct;21(5):785–91.  
 
38