Relatedness inference, pedigree reconstruction, and cancer genomics in large cohorts.

Other Titles


As the cost of collecting genetic information decreases, and the size of available genetic datasets increases, tools that efficiently process these datasets become more important. Our goal is to develop multiple algorithms that can extract useful quantities from these large datasets to inform further analyses in medicine and population genetics.We first present on IBIS, an IBD detector we developed that locates long regions of allele sharing between unphased individuals. We compared this algorithm's performance to several contemporary alternatives, some that rely on phasing and some that do not. We determined that, in addition to being comparatively efficient, our algorithm’s ability to infer IBD segments ≥7cM compares favorably for to other methods. With these segments, we found that IBIS can classify first through third degree relatives in real data at rates meeting or exceeding other methods and identifies fourth through sixth degree pairs at rates close to the top methods. Next, we discuss our algorithm that leverages read data from multiple tumor samples to handle the complex problem of constructing phylogeny trees that include structural variants (SVs). The algorithm, Meltos, uses tumor phylogeny trees built on somatic single nucleotide variants (SNVs) initially to form a scaffold, and then it attempts to insert high confidence SVs to produce a comprehensive lineage tree. We found that using evolutionary constraints for variant allele frequences, combined with a new probabilistic formula for calculating said frequences, provides evidence that helps weed out false positive SVs and place true positive SVs into their phylogeny framework. Lastly, we return to the goal of working with IBD information. We present on our algorithm, PELICAN, that takes the information from IBIS, together with statistical likelihood for specific second degree relationships, provided by CREST (Qiao, Sannerud, et al. 2021), to quickly and accurately infer pedigrees from large genome datasets. We utilize a combination of likelihoods and biological constraints to perform a backtracking search that exhaustively checks the entire set of possible pedigrees for the highest possible likelihood pedigrees, and PELICAN is able to do so at speeds comparable to other state of the art algorithms.

Journal / Series

Volume & Issue


190 pages


Date Issued




Algorithm; Identical by Descent; Pedigrees; Phylogenetic trees; Population Genetics


Effective Date

Expiration Date




Union Local


Number of Workers

Committee Chair

Williams, Amy L.

Committee Co-Chair

Committee Member

Mezey, Jason G.
Clark, Andrew

Degree Discipline

Computational Biology

Degree Name

Ph. D., Computational Biology

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)


Link(s) to Reference(s)

Previously Published As

Government Document




Other Identifiers


Rights URI


dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record