Cornell University
Library
Cornell UniversityLibrary

eCommons

Help
Log In(current)
  1. Home
  2. Cornell University Graduate School
  3. Cornell Theses and Dissertations
  4. Similarity Measures and Anomaly Detection for Mixed Data

Similarity Measures and Anomaly Detection for Mixed Data

File(s)
Davidow_cornellgrad_0058F_12382.pdf (7.52 MB)
Permanent Link(s)
https://doi.org/10.7298/sfej-ge23
https://hdl.handle.net/1813/103214
Collections
Cornell Theses and Dissertations
Author
Davidow, Matthew
Abstract

We introduce a novel anomaly detection methodology in the unsupervised and mixed data case. The approach makes use of a factor analysis approach to integrate information from both continuous and categorical variables. Due to the difficulty of unsupervised anomaly detection, we also propose an ensemble methodology to combine the outputs from multiple scoring algorithms. Anomalous observations may correspond to financial fraud, health risks, or incorrectly measured data in practice. We focus on unsupervised detection and the continuous and categorical (mixed) variable case. We show detecting anomalies in mixed data is enhanced through first embedding the data then assessing an anomaly scoring scheme.We propose a kurtosis-weighted {\it Factor Analysis of Mixed Data} for anomaly detection to obtain a continuous embedding for anomaly scoring. We illustrate that anomalies are highly separable in the first and last few ordered dimensions of this space. Practical anomaly detection requires applying numerous approaches due to the inherent difficulty of unsupervised learning. Direct comparison between complex or opaque anomaly detection algorithms is intractable; we instead propose a framework for associating the scores of multiple methods. Our aim is to answer the question: How should one measure the similarity between anomaly scores generated by different methods? The scoring crux is in the extremes, which identify the most anomalous observations. A pair of algorithms is defined here to be similar if each assigns its highest scores to roughly the same small fraction of observations. To formalize this, we propose a measure based on extremal similarity in scoring distributions through a novel upper quadrant modeling approach, and contrast it with tail and other dependence measures. We use our similarity method as the first step of an ensemble meta-scorer to combine the outputs of many different anomaly detection scores. In ecology a plethora of ecological predictions about the survival rates of different species under various global warming scenarios are made. Due to uncertainty associated with a range of different climate models, the predictions vary greatly. The outputs of these various predictions call for a reliable interpretation. We propose an interpretable and flexible cosine similarity based method to measure the similarity between geographic range maps that vary depending on climate model. Using the similarities between range maps, we propose a spectral clustering technique which allows for flexibility in the choice of similarity used between range maps. The clustering results based on the similarity of range maps illustrate that it is crucial to incorporate ecological information to understand the relevant differences between range maps based on different climate models.

Description
111 pages
Date Issued
2020-12
Keywords
anomaly detection
•
copula
•
ensemble
•
spectral clustering
•
tail dependence
Committee Chair
Matteson, David
Committee Member
Samorodnitsky, Gennady
Rand, Richard Herbert
Degree Discipline
Applied Mathematics
Degree Name
Ph. D., Applied Mathematics
Degree Level
Doctor of Philosophy
Type
dissertation or thesis
Link(s) to Catalog Record
https://newcatalog.library.cornell.edu/catalog/13312063

Site Statistics | Help

About eCommons | Policies | Terms of use | Contact Us

copyright © 2002-2026 Cornell University Library | Privacy | Web Accessibility Assistance