Similarity Measures and Anomaly Detection for Mixed Data

Other Titles


We introduce a novel anomaly detection methodology in the unsupervised and mixed data case. The approach makes use of a factor analysis approach to integrate information from both continuous and categorical variables. Due to the difficulty of unsupervised anomaly detection, we also propose an ensemble methodology to combine the outputs from multiple scoring algorithms. Anomalous observations may correspond to financial fraud, health risks, or incorrectly measured data in practice. We focus on unsupervised detection and the continuous and categorical (mixed) variable case. We show detecting anomalies in mixed data is enhanced through first embedding the data then assessing an anomaly scoring scheme.We propose a kurtosis-weighted {\it Factor Analysis of Mixed Data} for anomaly detection to obtain a continuous embedding for anomaly scoring. We illustrate that anomalies are highly separable in the first and last few ordered dimensions of this space. Practical anomaly detection requires applying numerous approaches due to the inherent difficulty of unsupervised learning. Direct comparison between complex or opaque anomaly detection algorithms is intractable; we instead propose a framework for associating the scores of multiple methods. Our aim is to answer the question: How should one measure the similarity between anomaly scores generated by different methods? The scoring crux is in the extremes, which identify the most anomalous observations. A pair of algorithms is defined here to be similar if each assigns its highest scores to roughly the same small fraction of observations. To formalize this, we propose a measure based on extremal similarity in scoring distributions through a novel upper quadrant modeling approach, and contrast it with tail and other dependence measures. We use our similarity method as the first step of an ensemble meta-scorer to combine the outputs of many different anomaly detection scores. In ecology a plethora of ecological predictions about the survival rates of different species under various global warming scenarios are made. Due to uncertainty associated with a range of different climate models, the predictions vary greatly. The outputs of these various predictions call for a reliable interpretation. We propose an interpretable and flexible cosine similarity based method to measure the similarity between geographic range maps that vary depending on climate model. Using the similarities between range maps, we propose a spectral clustering technique which allows for flexibility in the choice of similarity used between range maps. The clustering results based on the similarity of range maps illustrate that it is crucial to incorporate ecological information to understand the relevant differences between range maps based on different climate models.

Journal / Series

Volume & Issue


111 pages


Date Issued




anomaly detection; copula; ensemble; spectral clustering; tail dependence


Effective Date

Expiration Date




Union Local


Number of Workers

Committee Chair

Matteson, David

Committee Co-Chair

Committee Member

Samorodnitsky, Gennady
Rand, Richard Herbert

Degree Discipline

Applied Mathematics

Degree Name

Ph. D., Applied Mathematics

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)


Link(s) to Reference(s)

Previously Published As

Government Document




Other Identifiers


Rights URI


dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record