Similarity Measures and Anomaly Detection for Mixed Data

Other Titles
We introduce a novel anomaly detection methodology in the unsupervised and mixed data case. The approach makes use of a factor analysis approach to integrate information from both continuous and categorical variables. Due to the difficulty of unsupervised anomaly detection, we also propose an ensemble methodology to combine the outputs from multiple scoring algorithms. Anomalous observations may correspond to financial fraud, health risks, or incorrectly measured data in practice. We focus on unsupervised detection and the continuous and categorical (mixed) variable case. We show detecting anomalies in mixed data is enhanced through first embedding the data then assessing an anomaly scoring scheme.We propose a kurtosis-weighted {\it Factor Analysis of Mixed Data} for anomaly detection to obtain a continuous embedding for anomaly scoring. We illustrate that anomalies are highly separable in the first and last few ordered dimensions of this space. Practical anomaly detection requires applying numerous approaches due to the inherent difficulty of unsupervised learning. Direct comparison between complex or opaque anomaly detection algorithms is intractable; we instead propose a framework for associating the scores of multiple methods. Our aim is to answer the question: How should one measure the similarity between anomaly scores generated by different methods? The scoring crux is in the extremes, which identify the most anomalous observations. A pair of algorithms is defined here to be similar if each assigns its highest scores to roughly the same small fraction of observations. To formalize this, we propose a measure based on extremal similarity in scoring distributions through a novel upper quadrant modeling approach, and contrast it with tail and other dependence measures. We use our similarity method as the first step of an ensemble meta-scorer to combine the outputs of many different anomaly detection scores. In ecology a plethora of ecological predictions about the survival rates of different species under various global warming scenarios are made. Due to uncertainty associated with a range of different climate models, the predictions vary greatly. The outputs of these various predictions call for a reliable interpretation. We propose an interpretable and flexible cosine similarity based method to measure the similarity between geographic range maps that vary depending on climate model. Using the similarities between range maps, we propose a spectral clustering technique which allows for flexibility in the choice of similarity used between range maps. The clustering results based on the similarity of range maps illustrate that it is crucial to incorporate ecological information to understand the relevant differences between range maps based on different climate models.
Journal / Series
Volume & Issue
111 pages
Date Issued
anomaly detection; copula; ensemble; spectral clustering; tail dependence
Effective Date
Expiration Date
Union Local
Number of Workers
Committee Chair
Matteson, David
Committee Co-Chair
Committee Member
Samorodnitsky, Gennady
Rand, Richard Herbert
Degree Discipline
Applied Mathematics
Degree Name
Ph. D., Applied Mathematics
Degree Level
Doctor of Philosophy
Related Version
Related DOI
Related To
Related Part
Based on Related Item
Has Other Format(s)
Part of Related Item
Related To
Related Publication(s)
Link(s) to Related Publication(s)
Link(s) to Reference(s)
Previously Published As
Government Document
Other Identifiers
Rights URI
dissertation or thesis
Accessibility Feature
Accessibility Hazard
Accessibility Summary
Link(s) to Catalog Record