eCommons

 

TOPICS IN MODERN REGRESSION MODELING

Other Titles

Author(s)

Abstract

In the first part of this work, we propose a novel efficient sampling method for measurement-constrained data. Under “measurement constraints,” responses are expensive to measure and initially unavailable on most of records in the dataset, but the covariates are available for the entire dataset. Our goal is to sample a relatively small portion of the dataset where the expensive responses will be measured and the resultant sampling estimator is statistically efficient. Measurement constraints require the sampling probabilities can only depend on a very small set of the responses. A sampling procedure that uses responses at most only on a small pilot sample will be called “response-free.” We propose a response-free sampling procedure (OSUMC) for generalized linear models (GLMs). Using the A-optimality criterion, i.e., the trace of the asymptotic variance, the resultant estimator is statistically efficient within a class of sampling estimators. We establish the unconditional asymptotic distribution of a general class of response-free sampling estimators. This result is novel compared with the existing conditional results obtained by conditioning on both covariates and responses. Under our unconditional framework, the subsamples are no longer independent and new martingale techniques are developed for our asymptotic theory. We further derive the A-optimal response-free sampling distribution. Since this distribution depends on population level quantities, we propose the Optimal Sampling Under Measurement Constraints (OSUMC) algorithm to approximate the theoretical optimal sampling. Finally, we conduct an intensive empirical study to demonstrate the advantages of OSUMC algorithm over existing methods in both statistical and computational perspectives. We find that OSUMC’s performance is comparable to that of sampling algorithms that use complete responses. This shows that, provided an efficient algorithm such as OSUMC is used, there is little or no loss in accuracy due to the unavailability of responses because of measurement constraints. In the second part of this work, we develop uniform inference methods for the conditional mode based on quantile regression. Specifically, we propose to estimate the conditional mode by minimizing the derivative of the estimated conditional quantile function defined by smoothing the linear quantile regression estimator, and develop two bootstrap methods, a novel pivotal bootstrap and the nonparametric bootstrap, for our conditional mode estimator. Building on high-dimensional Gaussian approximation techniques, we establish the validity of simultaneous confidence rectangles constructed from the two bootstrap methods for the conditional mode. We also extend the preceding analysis to the case where the dimension of the covariate vector is increasing with the sample size. Finally, we conduct simulation experiments and a real data analysis using U.S. wage data to demonstrate the finite sample performance of our inference method. The supplemental materials include the wage dataset, R codes and an appendix containing proofs of the main results, additional simulation results, discussion of model misspecification and quantile crossing, and additional details of the numerical implementation. In the third part of this work, we develop a multi-round aggregated one-step estimator and a scalable bootstrap method for distributed sparse least absolute deviation (LAD) regression with high-dimensional covariates. The proposed one-step estimator is based on multi-round distributed quantile regression andlinear regression estimators. We derive convergence rates and sparsity proper- ties of the new multi-round estimators and show that our multi-round one-step estimator requires less restrictive sample complexity than the one-shot aggregation for valid inference. We also develop a novel pivotal bootstrap for simultaneous inference that is scalable to the distributed setting. Building on high-dimensional Gaussian approximation techniques, we establish the validity of simultaneous confidence rectangles constructed from the pivotal boot- strap. Finally, we conduct numerical experiments using the simulated data, which demonstrate encouraging performance of our methods.

Journal / Series

Volume & Issue

Description

228 pages

Sponsorship

Date Issued

2022-08

Publisher

Keywords

Distributed inference; High-dimensional CLT; Mode estimation; Optimal sampling; Pivotal bootstrap; Quantile regression

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Ruppert, David
Kato, Kengo

Committee Co-Chair

Committee Member

Wegkamp, Marten H.
Ning, Yang

Degree Discipline

Statistics

Degree Name

Ph. D., Statistics

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Attribution 4.0 International

Types

dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record