Show simple item record

dc.contributor.authorJin, Ze
dc.date.accessioned2018-10-23T13:34:51Z
dc.date.available2018-10-23T13:34:51Z
dc.date.issued2018-08-30
dc.identifier.otherJin_cornellgrad_0058F_11115
dc.identifier.otherhttp://dissertations.umi.com/cornellgrad:11115
dc.identifier.otherbibid: 10489773
dc.identifier.urihttps://hdl.handle.net/1813/59677
dc.description.abstractMy PhD research focuses on measuring and testing mutual dependence and conditional mean dependence, and applying it to Machine Learning problems, which is elaborated in the following four chapters: Chapter 1 – We propose three new measures of mutual dependence between multiple random vectors. Each measure is zero if and only if the random vectors are mutually independent. The first generalizes distance covariance from pairwise dependence to mutual dependence, while the other two measures are sums of squared distance covariances. The proposed measures share similar properties and asymptotic distributions with distance covariance, and capture non-linear and non-monotone mutual dependence between the random vectors. Inspired by complete and incomplete V-statistics, we define empirical and simplified empirical measures as a trade-off between the complexity and statistical power when testing mutual independence. The implementation of corresponding tests is demonstrated by both simulation results and real data examples. Chapter 2 – We apply both distance-based and kernel-based mutual dependence measures to independent component analysis (ICA), and generalize dCovICA to MDMICA, minimizing empirical dependence measures as an objective function in both deflation and parallel manners. Solving this minimization problem, we introduce Latin hypercube sampling (LHS), and a global optimization method, Bayesian optimization (BO) to improve the initialization of the Newton-type local optimization method. The performance of MDMICA is evaluated in various simulation studies and an image data example. When the ICA model is correct, MDMICA achieves competitive results compared to existing approaches. When the ICA model is misspecified, the estimated independent components are less mutually dependent than the observed components using MDMICA, while they are prone to be even more mutually dependent than the observed components using other approaches. Chapter 3 – Independent component analysis (ICA) decomposes multivariate data into mutually independent components (ICs). The ICA model is subject to a constraint that at most one of these components is Gaussian, which is required for model identifiability. Linear non-Gaussian component analysis (LNGCA) generalizes the ICA model to a linear latent factor model with any number of both non-Gaussian components (signals) and Gaussian components (noise), where observations are linear combinations of independent components. Although the individual Gaussian components are not identifiable, the Gaussian subspace is identifiable. We introduce an estimator along with its optimization approach in which non-Gaussian and Gaussian components are estimated simultaneously, maximizing the discrepancy of each non-Gaussian component from Gaussianity while minimizing the discrepancy of each Gaussian component from Gaussianity. When the number of non-Gaussian components is unknown, we develop a statistical test to determine it based on resampling and the discrepancy of estimated components. Through a variety of simulation studies, we demonstrate the improvements of our estimator over competing estimators, and we illustrate the effectiveness of our test to determine the number of non-Gaussian components. Further, we apply our method to real data examples and show its practical value. Chapter 4 – A crucial problem in statistics is to decide whether additional variables are needed in a regression model. We propose a new multivariate test to investigate the conditional mean independence of Y given X conditioning on some known effect Z, i.e., E(Y|X,Z) = E(Y|Z). Assuming that E(Y|Z) and Z are linearly related, we reformulate an equivalent notion of conditional mean independence through transformation, which is approximated in practice. We apply the martingale difference divergence (MDD) to measure conditional mean dependence, and show that the estimation error from approximation is negligible, as it has no impact on the asymptotic distribution of the test statistic under some regularity assumptions. The implementation of our test is demonstrated by both simulations and a financial data example.
dc.language.isoen_US
dc.rightsAttribution 4.0 International*
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/*
dc.subjectStatistics
dc.subjectMathematics
dc.subjectComputer science
dc.subjectconditional mean independence
dc.subjectindependent component analysis
dc.subjectlinear non-Gaussian component analysis
dc.subjectmultivariate analysis
dc.subjectmutual independence
dc.subjectV-statistics
dc.titleMeasuring Statistical Dependence and Its Applications in Machine Learning
dc.typedissertation or thesis
thesis.degree.disciplineStatistics
thesis.degree.grantorCornell University
thesis.degree.levelDoctor of Philosophy
thesis.degree.namePh. D., Statistics
dc.contributor.chairMatteson, David
dc.contributor.committeeMemberRuppert, David
dc.contributor.committeeMemberWeinberger, Kilian Quirin
dcterms.licensehttps://hdl.handle.net/1813/59810
dc.identifier.doihttps://doi.org/10.7298/X4FT8J98


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Except where otherwise noted, this item's license is described as Attribution 4.0 International

Statistics