Outlier Detection and Multicollinearity in Sequential Variable Selection: A Least Angle Regression-Based Approach
Kirtland, Kelly Meredith
As lasso regression has grown exceedingly popular as a tool for coping with variable selection in high-dimensional data, diagnostic methods have not kept pace. The primary difficulty of outlier detection in high-dimensional data is the inability to examine all subspaces, either simultaneously or sequentially. I explore the impact of outliers on lasso variable selection and penalty parameter estimation, and propose a tree-like outlier nominator based on the LARS algorithm. The least angle regression outlier nomination (LARON) algorithm follows variable selection paths and prediction summaries for the original data set and data subsets after removing potential outliers. This provides visual insight into the effect of specific points on lasso fits while allowing for a data-directed exploration of various subspaces. Simulation studies indicate that LARON is generally more powerful at detecting outliers than standard diagnostics applied to Lasso models after fitting a model. One reason for this improvement is that observations with unusually high influence can inflate the penalty parameter and result in a severely underfit model. We explore this result through simulations and theoretically using a Lasso homotopy adapted for online observations. Additionally, LARON is able to explore multiple subspaces while post-hoc diagnostics rely on a variable selection that has already occurred under possible influence of an unusual observation. However, LARON underperforms random nomination when attempting to detect high leverage, non-influential points located in minor eigenvalue directions in high dimensional settings. The lack of detection appears to result from a robustness in Lasso's variable selection process against such points. A new R package implementing the LARON algorithm is presented and its functionality to detect multicollinearity in the data, even when masked by high leverage points, described. This package is then used to analyze data created by simulation and several real data sets.
Statistics; LARS; lasso; multicollinearity; outlier nomination; sequential; variable selection
Velleman, Paul F
Wells, Martin Timothy; Hooker, Giles J.
Ph. D., Statistics
Doctor of Philosophy
Attribution-NonCommercial-ShareAlike 4.0 International
dissertation or thesis
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 4.0 International