eCommons

 

Similarity Measures and Anomaly Detection for Mixed Data

dc.contributor.authorDavidow, Matthew
dc.contributor.chairMatteson, David
dc.contributor.committeeMemberSamorodnitsky, Gennady
dc.contributor.committeeMemberRand, Richard Herbert
dc.date.accessioned2021-03-15T13:31:31Z
dc.date.available2021-03-15T13:31:31Z
dc.date.issued2020-12
dc.description111 pages
dc.description.abstractWe introduce a novel anomaly detection methodology in the unsupervised and mixed data case. The approach makes use of a factor analysis approach to integrate information from both continuous and categorical variables. Due to the difficulty of unsupervised anomaly detection, we also propose an ensemble methodology to combine the outputs from multiple scoring algorithms. Anomalous observations may correspond to financial fraud, health risks, or incorrectly measured data in practice. We focus on unsupervised detection and the continuous and categorical (mixed) variable case. We show detecting anomalies in mixed data is enhanced through first embedding the data then assessing an anomaly scoring scheme.We propose a kurtosis-weighted {\it Factor Analysis of Mixed Data} for anomaly detection to obtain a continuous embedding for anomaly scoring. We illustrate that anomalies are highly separable in the first and last few ordered dimensions of this space. Practical anomaly detection requires applying numerous approaches due to the inherent difficulty of unsupervised learning. Direct comparison between complex or opaque anomaly detection algorithms is intractable; we instead propose a framework for associating the scores of multiple methods. Our aim is to answer the question: How should one measure the similarity between anomaly scores generated by different methods? The scoring crux is in the extremes, which identify the most anomalous observations. A pair of algorithms is defined here to be similar if each assigns its highest scores to roughly the same small fraction of observations. To formalize this, we propose a measure based on extremal similarity in scoring distributions through a novel upper quadrant modeling approach, and contrast it with tail and other dependence measures. We use our similarity method as the first step of an ensemble meta-scorer to combine the outputs of many different anomaly detection scores. In ecology a plethora of ecological predictions about the survival rates of different species under various global warming scenarios are made. Due to uncertainty associated with a range of different climate models, the predictions vary greatly. The outputs of these various predictions call for a reliable interpretation. We propose an interpretable and flexible cosine similarity based method to measure the similarity between geographic range maps that vary depending on climate model. Using the similarities between range maps, we propose a spectral clustering technique which allows for flexibility in the choice of similarity used between range maps. The clustering results based on the similarity of range maps illustrate that it is crucial to incorporate ecological information to understand the relevant differences between range maps based on different climate models.
dc.identifier.doihttps://doi.org/10.7298/sfej-ge23
dc.identifier.otherDavidow_cornellgrad_0058F_12382
dc.identifier.otherhttp://dissertations.umi.com/cornellgrad:12382
dc.identifier.urihttps://hdl.handle.net/1813/103214
dc.language.isoen
dc.subjectanomaly detection
dc.subjectcopula
dc.subjectensemble
dc.subjectspectral clustering
dc.subjecttail dependence
dc.titleSimilarity Measures and Anomaly Detection for Mixed Data
dc.typedissertation or thesis
dcterms.licensehttps://hdl.handle.net/1813/59810
thesis.degree.disciplineApplied Mathematics
thesis.degree.grantorCornell University
thesis.degree.levelDoctor of Philosophy
thesis.degree.namePh. D., Applied Mathematics

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Davidow_cornellgrad_0058F_12382.pdf
Size:
7.52 MB
Format:
Adobe Portable Document Format