THE POPULATION GENETIC VARIATION OF INTERSPERSED AND TANDEM REPEATS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Iskander Said August 2023 © 2023 Iskander Gaizka Said THE POPULATION GENETIC VARIATION OF INTERSPERSED AND TANDEM REPEATS Iskander Gaizka Said, Ph.D. Cornell University 2023 Despite making up large and essential portions of eukaryotic genomes repeat DNA has largely eluded study due to limitations in alignment and assembly of Next Generation short-read sequencing. Standard technologies and methods are largely inadequate to study the abundance and variation of repeat sequences at the population scale. Therefore, I have developed and applied new methods and algorithms to study the population variation of interspersed and tandem repeats, specifically transposable elements and simple satellites. Firstly, using hierarchical clustering and population genetic theory I have developed a method to infer transposable element clades and their age, circumventing difficult problems of phasing and mapping. This method resulted in a discovery that host piRNA regulatory mechanisms are turning over to regulate newly emerging transposable element variants shedding light on an evolutionary arms race. Secondly, I applied k-mer counting methods to human population genomics data to quantify abundance and variation of simple satellites discovering undescribed telomeric and centromeric satellites. I then applied a mixed modelling framework to find associations between centromeric ancestry and simple satellite abundances which lead to the discovery of an expansion of a centromeric satellite copy number in a cluster of African and Latin American individuals with shared centromeric ancestry. Overall, this dissertation represents an analysis framework for the development of computational strategies to mine population genomics data for repeat variation. The approaches I have developed and deployed have allowed me to make inferences about repeat abundance and variation in large population genomic datasets that were previously unfeasible. iii BIOGRAPHICAL SKETCH Iskander was born in Pasadena, California at an extremely young age unable to walk or speak. Thanks to the care of his parents and grandmother he was able to overcome these impairments and quickly gained ambulatory and verbal capabilities -- a development that his family would eventually regret. After eighteen years of finding ways to avoid doing schoolwork he decided to go to the University of California, Santa Cruz to pursue his dreams of being a lay- about and writer. During his first year of university by happenstance he took a course on human genomics, which caused the first of many existential career crises. The result of this crisis was the switching of his major from Literature to Molecular, Cell and Development Biology and eventually picking up a minor in Bioinformatics. Iskander’s first foray into research began as an undergraduate in the lab of Professor Bill Saxton. Under the tutelage of Inna Djagaeva, the lab manager, he was taught everything he knows about pushing flies and molecular biology. That initial exposure to research really fomented the passion for hands-on learning and discovery that Iskander would carry through the rest of his career. Iskander’s education in evolutionary biology began after he graduated from university and began work as a lab technician for Professor Russell Corbett-Detig. There he learned population genetics and genomics and truly began to appreciate evolution. Through the encouragement of Russ and his peers he went to Cornell University for his Ph.D. in Genetics, Genomics and Development in 2018. In his first year he was awarded the National Science Foundation Graduate Research Fellowship fully funding his Ph.D. for three years and joined the labs of Andrew Clark and Daniel Barbash soon after. He quickly fell in love with the weird, wild and untamed world of repetitive DNA with a particular fascination in understanding the evolution of transposable elements and using bioinformatics and population genetics to do so. iv DEDICATION inch by inch the tiny snail climbs mt fuji v ACKNOWLEDGEMENTS Firstly, I want to thank my mother, Marta Rosales, and father, Haroon Said, for always supporting me and giving me the courage to move forward even when things seem bleak. I want to thank my brother from another mother, Alex, for always having his heart open to me and being the brother I never had. I would like to thank my advisors Andy Clark and Dan Barbash for being mentors in science, careers and life. Andy, the summer sails, and lake house lab retreats were some of my fondest memories of graduate school. Thank you for your haikus and nuggets of wisdom. Dan, I often chuckle fondly thinking of your party where Lenny ran around with his pants off and smacked Dan Z. in the goolies. Thank you for your sardonic humor and sharp wit. Thank you as well to my lab mates in the Clark and Barbash lab who have been incredible fonts of knowledge and juvenile humor. I owe many of you debts of gratitude for helping me with complex problems and giving me feedback when I’ve been stuck. In particular, I need to thank Mike McGurk who I spent hours with discussing Bayesian statistics and the finer points of vegetarian meat substitutes. Your scientific rigor and attention to detail is something that I attempt to emulate every time I perform an analysis. Thank you for taking the time all those years ago to discuss TEs with me. To Elissa Cosgrove, thank you for your massive intellect that is only dwarfed by your enormous heart. You have always kept an open door for me when I repeatedly needed help using the computational resources or deploying some statistical model. Without you I would never have been able to finish my thesis. And thank you for your aviations as dangerous as they might be. vi Thank you to my friends. There are many of you and I may not name you all, but just know that you are all loved dearly. To my first DnD group in grad school: Melissa, Michelle, Roman, Victoria and our later additions Ian and Manisha, you have been incredible sources of joy and laughter during the uneasiness of grad school. Thank you for spending countless afternoons in my studio apartment laced with the smell of expo marker and coffee as we ate Chinese food and went on misadventures. You are not just my adventuring party, but also my lifelong friends. To Melissa, you are the sun, the moon and all of the stars. You are the light at daybreak. You are the first warn day of spring. Thank you for being with me always. For sticking with me through the hardest times. For being patient with me when no one else will. For sharing your life with me – the good and the bad. For your warmth, your laughter, and your smile. Thank you for being you. To everyone who helped me, I love you all. vii TABLE OF CONTENTS Biographical Sketch iii Dedication iv Acknowledgements v Chapter 1: An introduction to the evolution of genomic repeats. 1 Transposable element and host population dynamics as an arms race 3 Evolutionary models of satellites 10 Problems in repeat genomics and specific solutions 12 Overview of dissertation 14 Works Cited 16 Chapter 2: Patterns of piRNA regulation in Drosophila revealed through transposable element clade inference 24 Abstract 25 Introduction 26 New approaches 29 Results 34 Discussion 53 Methods 59 Acknowledgements 79 Works Cited 80 Chapter 3: The structure of simple satellite variation in the human genome and its correlation with centromere ancestry 87 Abstract 88 Introduction 89 Results 93 Discussion 116 Methods 125 Acknowledgements 135 Works Cited 136 Chapter 4: Conclusions and the future of repeat genomics. 143 Works Cited 146 1 Chapter 1: An introduction to the evolution of genomic repeats. Since the discovery of DNA as the information molecule of life it has been predicted that the organismal genome should be primarily composed of DNA encoding essential components for an organism's survival and development. Advancements in genetics and evolutionary theory put this idea on notice, but it was not until the advent of genomics and the sequencing of the human genome that this theory was ultimately shown to be incorrect. It was discovered that only 1-2% of the human genome is composed of DNA coding for proteins (Lander et al. 2001; Piovesan et al. 2019). The vast majority of the human genome is non-coding and although some non-coding DNA is essential for regulatory functions, much of it is large stretches of heterochromatic and transcriptionally silent repetitive DNA with no obvious function (Ecker et al. 2012; ENCODE Project Consortium 2012; Nurk et al. 2022). The amount of repetitive DNA varies across species with humans having ~60% repetitive DNA, while some organisms have >90% repetitive DNA (SanMiguel et al. 1996; Nurk et al. 2022). Repetitive DNA has often been ascribed as “junk”, evolutionary detritus that accumulates over time – like waves washing up trash to the shore (Ohno 1972; Palazzo and Gregory 2014). The discovery that the overwhelming abundance of the genome is non-coding “junk” has prompted a new synthesis of the evolutionary theories that govern the genome. The first evolutionary theories of “junk” came long before the sequencing of the human genome. Under the “neutral” and “nearly-neutral” theories of evolution the abundance of non- coding DNA can be explained by the fact that most changes in the genome will have a neutral or nearly-neutral effect on organismal fitness (Ohta 1973; Kimura 1979; Ohta 1992). In this way neutral polymorphisms and repetitive DNA can accumulate and dissipate in the genome with little consequence for the organism. However, the proliferation of repetitive DNA may additionally be driven by selfishness. The selfish DNA concept proposes that a DNA need not be beneficial, or even neutral, to the organismal fitness for it to exist. It simply needs to be able to replicate and proliferate in such a way to ensure its inheritance into the subsequent 2 generation (Doolittle and Sapienza 1980; Orgel and Crick 1980). From this synthesis, the existence of repetitive DNA can be explained as either nearly-neutral accretion or through an active selfish and parasitic process. Which mode governs the evolution of repeats depends on the class of repeat. There are two major classes of repetitive DNA: interspersed repeats and tandem repeats. Interspersed repeats are primarily transposable elements (TEs), which are mobile genetic elements that replicate and disperse throughout the genome and are often characterized as genetic parasites. TEs are autonomous, or semi-autonomous entities, with domains that encode proteins necessary for their replication and internal promoters. They are diverse and can be subdivided as DNA or RNA elements depending on the molecular intermediate used during replication. DNA TEs typically encode a transposase enzyme that can “cut and paste” a TE sequence from one portion of the genome to another. RNA TEs instead replicate through a reverse transcriptase that reverse transcribes the TE mRNA intermediate into cDNA that is then incorporated into the host genome. TEs may be further subdivided by family using phylogenetic and sequence-based characteristics, and variation within families are denominated as subfamilies (Wells and Feschotte 2020). TEs were first discovered through cytological and genetic experimentation in maize and since then have been appreciated as a major component of the genome and their movement can make them a significant evolutionary force (McCLINTOCK 1950). The second class of repeats are tandem repeats, arrays of a tandemly repeating motif, or monomer. Tandem repeats can be divided into microsatellites which are short arrays, usually a few hundred basepairs long, of small (1-6 basepair) monomers, or satellite arrays which have long (exceeding 1 Mb) arrays and may have small (simple), or long (complex) monomers. Repeat-length expansion of microsatellites have been implicated in numerous human diseases including Fragile-X syndrome and Friedreich's Ataxia (Brouwer et al. 2009; Neil et al. 2021). Satellite arrays of both the simple and complex type make up large stretches of the genome and 3 some such as the Alpha Satellite in mammals have critical functions. Alpha Satellite is particularly interesting because it is an essential component of the centromere that recruits the kinetochore for cell division (Lee et al. 1997; Ohzeki et al. 2002; Okada et al. 2007). However, most satellite arrays have unknown functions, if any. Because of this the accretion of most satellite arrays in the genome is proposed to occur in a nearly-neutral fashion. However, there is growing evidence that calls the neutrality of satellites into question. Satellites may be modifying the fitness of the organism in a way not directly related to a singular function. Constitutive heterochromatin can modify the epigenetic landscape and change gene expression in trans. This effect can be seen most clearly in the effect of repeat content in the Y- chromosome of D. melanogaster modifying phenotypes and sequestering heterochromatic proteins (Dimitri and Pisano 1989; Berloco et al. 2014; Kelsey and Clark 2017; Delanoue et al. 2023 Jun 12). Therefore, these trans-acting effects necessitate a reinterpretation of the selective forces acting on satellites. In a similar vein, TE proliferation is selfish, but a great deal of genomic novelty and function is attributable to TEs as well (Chuong et al. 2017; Cosby et al. 2019). Therefore, thinking of TEs as a purely negative force is incorrect. Additionally, satellite arrays and TEs seem to diverge quite rapidly between species and are capable of generating species barriers (Bayes and Malik 2009; Jurka et al. 2011). Therefore, although repetitive elements may proliferate selfishly, or aggregate in a nearly neutral fashion they are fundamental components of the genome. Repeats are implicated in such a multitude of evolutionary processes that considering them as mere “junk” is inaccurate. A new synthesis of the evolutionary theory of repeats should incorporate the complex ways that repeats shape the evolution of genomes and modify our thinking to consider repeats as weird and wild biological entities. Transposable element and host population dynamics as an arms race 4 Transposable elements (TEs) are highly prolific, inhabiting nearly all eukaryotes and make up large portions of the genome: ~20% in Drosophila melanogaster, ~50% in humans, and up to 90% in Zea mays (SanMiguel et al. 1996; Adams et al. 2000; Nurk et al. 2022). TEs are considered selfish elements because their proliferation in the genome is often at the expense of the host’s organismal fitness. Strong effects on host fitness are best exemplified by P-element-induced hybrid dysgenesis in D. melanogaster. Progeny of D. melanogaster males containing the P-element TE mated to females lacking P-elements have rampant mobilization and transposition of P-elements in the germline, which leads to gonadal atrophy and sterility (Kidwell 1985). A nearly identical mechanism of hybrid dysgenesis is seen in the I-element TE in D. melanogaster as well (Olovnikov et al. 2013; Ryazansky et al. 2017). Although most TE insertions are deleterious, some TE insertions can be adaptive to the host organism. The rapid rates of transposition, relative to nucleotide mutations, make TEs an excellent substrate for genomic novelty and adaptation (Adrion et al. 2017; Feusier et al. 2019). In natural populations of D. melanogaster there is an astonishing rate of adaptive TE insertions under positive selection, likely fueling adaptation of the species during its migration out of Africa and into more temperate climates as well as pesticide resistance (Aminetzach et al. 2005; González et al. 2008). In an experimental example , a P-element insertion was found to be under positive selection in a population of D. melanogaster that had P-elements introduced as an invading TE (Wang et al. 2023). Beyond fuelling adaptation, TEs have become inextricably linked to their hosts, for example being co-opted by mammalian genomes to play integral roles in development, as well performing necessary functions such as maintenance of D. melanogaster telomeres (Cosby et al. 2019). However, given that TE insertions can account for massive proportions of the genome, it is highly unlikely that all TE insertions are either strongly deleterious or under positive selection. In fact, the fitness effects of TEs are primarily modeled as weakly deleterious. That is, each insertion itself has a nearly neutral effect on host fitness but effects are cumulative, only strongly 5 affecting the host’s fitness at a high enough copy number. The primary models of cumulative weakly deleterious effects are the synergistic and additive models of fitness. Under synergistic epistasis, the deleterious effects on host fitness increase non-linearly with the number of TE insertions, while the additive model is linear (Charlesworth and Charlesworth 1983; Langley et al. 1983; Charlesworth and Langley 1989). The synergistic model is a preferred explanation over the linear additive model as it is thought to most accurately model the mechanisms by which TE insertions affect host fitness, which are ectopic recombination and changes in chromatin landscapes (Langley et al. 1988; Lee and Langley 2010; Lee 2015). Additionally, there is strong empirical evidence of synergistic epistasis of TE insertions in D. melanogaster populations (Lee 2022). Ectopic recombination is a mechanism by which recombination occurs between two heterologous loci, in this case TE insertions, leading to deleterious structural rearrangements (Davis et al. 1987; Kupiec and Petes 1988; Montgomery et al. 1991; Lim and Simmons 1994; Mieczkowski et al. 2006). The probability of an ectopic recombination event will increase with the square of the number of TE insertions as there is a probability of recombination between any pair of TE insertions (Montgomery et al. 1987; Langley et al. 1988). TE-induced changes in the chromatin landscape is a mechanism more recently implicated in reducing host fitness. Under this model, epigenetic silencing of TEs through host regulatory mechanisms can spread to nearby genes and reduce organismal fitness (Lee and Langley 2010; Lee 2015; Choi and Lee 2020). These two mechanisms are not mutually exclusive and there is evidence that both play a significant role. Synergistic epistasis paired with intergenerational transposition of TEs leads to an equilibrium “steady-state” of TE copy number known as “transposition-selection” balance, where TE insertions are kept at an equilibrium by selection (Charlesworth and Charlesworth 1983; Langley et al. 1983; Charlesworth and Langley 1989). For selection to control the copy number of TEs, transposition rate must be sufficiently low to not outpace the purging via selection 6 (Charlesworth and Langley 1989). However, TEs with transposition rates low enough to be balanced by selection are very unlikely to overcome drift when first invading a population (Le Rouzic and Capy 2005; Le Rouzic et al. 2007; Groth and Blumenstiel 2017). Therefore, recent models propose that TEs must have a high initial transposition rate, that then decreases at equilibrium, likely through host regulation, allowing for a “steady-state” and preservation of the host population (Le Rouzic and Capy 2005; Groth and Blumenstiel 2017; Kofler 2019; Tomar et al. 2023). This has been observed in natural populations of Drosophila simulans, where P- elements were introduced into naive populations resulting in a burst of transposition activity followed by copy-number equilibrium once host regulation was achieved (Kofler et al. 2018; Kofler et al. 2022). Host regulation of TEs in animals is primarily through the piwi-associated small RNA (piRNA) pathway. piRNAs are small non-coding RNAs that function as an adaptive immune system for TEs. These small RNAs are produced from a long non-coding RNA precursor that is transcribed from a highly repetitive genomic locus filled with TE sequences, called a piRNA cluster (Aravin et al. 2007; Brennecke et al. 2007). This long precursor transcript is sequentially cleaved into small (21-30 bp) RNAs that complex with ago and piwi clade proteins to form a silencing complex. The piRNA silencing functions through a complementarity mechanism between the piRNA and the TE transcript. Once paired, the TE-piRNA complex will signal to other members of the piwi pathway to degrade the nascent transcript and process it into piRNAs that form a positive feedback loop by associating with the long non-coding piRNA precursor to increase production of piRNAs and therefore establish constitutive silencing. This process is called “ping-pong” amplification due to the recursive process of piRNA generation (Czech et al. 2018). TE repression is a critical mechanism underlying the host’s fitness and is even implicated in aspects of human health. There is a growing body of evidence that de-repression of TEs in aging somatic tissue can lead to increased inflammation and other age-associated phenotypes (Gorbunova et al. 2021; Rigal et al. 2022). Additionally, de-repression of TEs in cancerous cells 7 is quite common, which may lead to further genomic instability and cancer progression (Clayton et al. 2016; Kong et al. 2019). But, it is unknown whether the de-repression of TEs in these examples are driven by defects in the piRNA system or other failures in the heterochromatization machinery. There are several competing hypotheses about how piRNA clusters are formed in the genome. There is the “trap” model wherein as TEs invade in a population, by chance, one will insert in a piRNA cluster. When this happens the host genome now has acquired immunity to this TE and the immunity will spread in the population, the TE invasion will be quelled, and copy number will stabilize (Khurana et al. 2011; Kofler et al. 2018). Evidence for this model is attributed to the presence of large piRNA clusters found in the D. melanogaster genome full of ancient and fragmented TEs, such as the flamenco locus, suggesting the recurrent capture of TE sequences within piRNA clusters (Brennecke et al. 2007; Mével-Ninio et al. 2007). There is also evidence in natural populations of insertions of P-elements in large ancestral piRNA clusters during a recent invasion (Zhang et al. 2020). Additionally, large piRNA clusters tend to be enriched for young, recently active TEs, further evidence for the “trap” model of piRNA clusters (Srivastav et al. 2023 May 10). The sequence composition of the piRNA clusters seems to be highly polymorphic as well, suggesting a rapid turnover of sequence as new TEs invade and become trapped (Zanni et al. 2013; Zhang et al. 2020). The other competing model is the de novo model. Under this model, piRNAs are not primarily sourced from large permanent piRNA clusters, but instead from smaller dynamic piRNA clusters dispersed throughout the genome. There is significant evidence that euchromatic TE insertions can be induced to produce piRNAs as de novo piRNA clusters through increased TE activity, and that this piRNA cluster licensing is stably inheritable (de Vanssay et al. 2012; Olovnikov et al. 2013; Shpiz et al. 2014). Furthermore, piRNA clusters are evolutionarily labile and divergent, with some of the largest “trap” heterochromatic piRNA clusters being completely dispensable for TE regulation (Gebert et al. 2021). Overall, it is likely 8 that piRNA clusters form through mechanisms similar to both the “trap” model and de novo model (Srivastav et al. 2023 May 10). However, there is emerging evidence of an alternative model, the “crank-up” model that suggests that it is not the formation of piRNA clusters that is important, but rather an siRNA-induced activation of the “ping-pong” cycle that established host control over TE expression (Selvaraju et al. 2022). Regardless of the exact model of piRNA cluster formation, piRNAs play an important role in the evolutionary dynamics of TEs and modern models of TE population genetics reflect this. Agent-based simulations of piRNA and TE dynamics have revealed that piRNA clusters can rapidly control TE invasions if they are sufficiently large to trap TE sequences, approximately ~0.2-3% of the genome depending on the population size and dominance of repression (Kofler 2020). A unified analytical model of TE-piRNA population dynamics was recently proposed, which satisfactorily describes the equilibrium states of this new “transposition-selection-regulation” balance. Importantly, the initial high transposition rates have no effect on the final TE copy number equilibrium, which has been recapitulated in an empirical study of P-element invasions of natural populations (Kofler et al. 2022; Tomar et al. 2023). The pairing of empirical data and population-genetic models has revealed an evolutionary arms race between TE invaders and the host’s piRNA adaptive immune system (Luo et al. 2020). These arms race models clearly describe how host genomes adapt to the invasion of TEs by limiting their transposition, but there is little description on how TEs might evolve to adapt to their hosts. TEs, much like other replicative entities, are subject to natural selection. But for TEs their fitness is linked to their ability to transpose and any mutation that increases TEs transposition would increase their fitness (Brunet and Doolittle 2015; Luo et al. 2020). Although population genetic models of TE sequence evolution have described the evolution and competition of derived TE subfamilies, these models tend to assume most mutations are either deleterious or neutral (Le Rouzic and Capy 2006; Kijima and Innan 2013; Iwasaki et al. 2020). However, positive selection to escape from host repression mechanisms may be playing a role 9 in driving a TE to adapt. For example the CACTA TE in Oryza sativa produces a micro RNA that reduces the expression of methylation machinery, effectively de-repressing it (Nosaka et al. 2013). But, it is possible that sophisticated mechanisms such as these are not necessary to escape silencing. When considering the piRNA pathway, it is possible that divergence of the TE from the piRNA sequences may be adaptive. That is because piRNAs must base-pair with the TE transcript and if that base-pairing is disrupted then that TE may escape repression. How this would function mechanistically is unclear as the specificity of piRNAs seems to vary across species. In Caenorhabditis elegans, disrupting the base pairing of a piRNA’s highly specific “seed” sequence will de-repress a silenced target, and even outside the “seed” no more than three mismatches are tolerated (Zhang et al. 2018). The piRNA pathway in D. melanogaster, on the other hand, seems more robust to mismatches as silencing of TEs is established as long as piRNA sequences have a 90% or greater sequence complementarity (Kotov et al. 2019). However, full de-repression of a TE subfamily may not be necessary as small gains in transposition relative to other subfamilies could give it a sufficient edge to outcompete its brethren (Le Rouzic and Capy 2006). There already exists evidence of variation in LINE1 transposition rates between subfamilies, emphasizing how TE subfamilies may be under selection to increase their transposition (Lutz et al. 2003; Seleme et al. 2006). However, it is unclear whether these differences reflect escape from host regulation. The population genetic and evolutionary models of TEs have matured considerably over the course of their first descriptions (Charlesworth and Charlesworth 1983; Langley et al. 1983; Charlesworth and Langley 1989). From the first models describing “transposition-selection” balance to the discovery and integration of host regulatory piRNAs, these fundamental theories have been instrumental in understanding the evolution of TEs (Kofler 2019; Tomar et al. 2023). And while evolutionary models of TE sequence evolution and subfamily competition have provided thorough descriptions of subfamily dynamics, there is still a significant gap in the 10 population genetic theory of the TE-host arms race (Le Rouzic and Capy 2006; Kijima and Innan 2013; Iwasaki et al. 2020). TE sequence evolution within a population has yet to be put into context with the emergence of piRNA host regulation in a population model. The challenges to describing this model are extensive, as it is unclear how the piRNA system acts as a selective force on TE sequences and which model of piRNA silencing best describes how TEs are regulated, e.g. “trap” model, de novo, or “crank-up”. Ultimately, significant advancements must be made on the empirical study of the population variation and evolution of TEs and piRNAs to be able to construct a robust theoretical model of the TE-host arms race. Evolutionary models of satellites Microsatellites typically change in copy number through replication-mediated repeat expansions and contractions, such as polymerase slippage (Khristich and Mirkin 2020). These changes in copy-number have been modeled in a stepwise fashion with neutral effects on host fitness (Ohta and Kimura 1973; Di Rienzo et al. 1994; Sainudiin et al. 2004). Recent appreciation that microsatellite copy number variation increases risk for several diseases and disorders such as gout and autism, as well as affecting gene expression, has led to models of selection on microsatellites (Rockman and Wray 2002; Mitra et al. 2021; Mukamel et al. 2021). Selection models on microsatellites must take into account that copy number changes can be frequent and recurrent and therefore standard models of selection do not apply. Instead, microsatellite copy number is modeled under “mutation-selection-drift” equilibrium where copy number oscillates at a fitness optimum (Haasl and Payseur 2013). These models applied to human population genomic datasets found that the aggregate fitness burden of microsatellites is much higher than that of single nucleotide polymorphisms (Huang et al. 2022). In contrast, mutational models of long satellite arrays consider changes in copy number as largely driven by unequal meiotic exchange during recombination rather than replication slippage (Smith 1976; Perelson and Bell 1977). During unequal exchange sister chromatids 11 containing highly repetitive satellite arrays can mispair and cross over, producing new sister chromatids with different sized satellite arrays. The combination of unequal exchange with stabilizing selection on the satellite array size is predicted to maintain homogeneous satellite arrays, and experiments have shown that reducing selective constraints increases satellite size variability (Kimura and Crow 1964; Flynn et al. 2017). The birth of satellite arrays from single copy sequence is predicted to occur from any locus where selection is relaxed (Smith 1974; Smith 1976). The initial amplification may occur through unequal exchange, double-stranded break repair, replication slippage, TE insertion or other undescribed mechanisms (Liao et al. 1997; Dover 2002). The extinction of the satellite array may also occur through unequal exchange and intra-chromatid recombination amplified by genetic drift (Charlesworth et al. 1986; Walsh 1987). Although it is predicted that satellite accumulation should be largely neutral with increasing array sizes being constrained by stabilizing selection, some satellites may be positively affecting host fitness. Pericentric heterochromatin seems to have a conserved function in the formation of “chromocenters'', an essential structure in the aggregation of the genome in the nucleus, and heterochromatic repeat content can modify the epigenome (Kelsey and Clark 2017; Jagannathan et al. 2018; Delanoue et al. 2023 Jun 12). These results suggest that although specific satellites may not have unique functions, satellite DNA in bulk plays an important role in the genome. Therefore, modeling satellite accumulation as a largely neutral process may not be accurate. There are empirical studies of within-species variation that reveal high variability in satellite array presence and copy number, as well as heterogeneous lineage- specific satellite copy-number expansion in human Y-haplogroups and inbred mouse lines suggesting that the mutational and selective processes affecting satellite array size are not constant (Altemose et al. 2014; Wei et al. 2014; Flynn et al. 2018; Flynn et al. 2021). However, it is unclear whether these rapid rates of expansion and contraction match expected gains and losses given genetic drift and unequal exchange under a neutral model. Therefore, significant 12 advancements must be made in the population-level study of satellite arrays and the evolutionary models to better understand the biological processes influencing their rapid evolution. Problems in repeat genomics and specific solutions In the previous sections I have outlined significant gaps in the field’s understanding of the population genetic and evolutionary models of repeats. These gaps are driven by major limitations of modern genomic technologies. Modern genomics has been revolutionized by the ease and cost effectiveness of Illumina based Next Generation Sequencing (NGS). These NGS technologies have made sequencing genomes incredibly cheap and easy, but are fundamentally ill equipped for handling repeat sequences. This is primarily because NGS sequencers generate “reads”, strings of typically 75 or 150 basepairs from input DNA representing small fragments of the genome (Bentley et al. 2008). In a typical genomics workflow, these reads are aligned to an organismal reference genome to find reads that uniquely map to genomic loci. This workflow poses two major problems to the study of repeats. Firstly, the reference genome may not contain highly repetitive loci for reads to be aligned to as repeat-rich regions tend to be absent from genome assemblies. In the case of the human genome the centromeres and acrocentric arms of the human genome were completely missing until very recently (Nurk et al. 2022). Secondly,reads derived from repetitive DNA are non- unique and will thus map to multiple loci, making it impossible to know which loci they were derived from. There can also be biases in k-mer content between library preparations and sequencing platform and GC-amplification bias, all of which must be accounted for in downstream analyses. Alternative technologies to Illumina sequencers, such as PacBio or Oxford Nanopore can however circumvent these problems. These two technologies generate long-reads, an average of 30 kb long and up to 2.3 Mb in the most extreme cases (Amarasinghe et al. 2020). 13 This allows the sequencer to read through long satellite arrays and provides fully haplotype- resolved TE insertions. The combination of these technologies with proximity-ligation sequencing or optical mapping are capable of producing highly contiguous genome assemblies with fully assembled centromeres and telomeres (Nurk et al. 2022). However, long-read technologies are not without their problems as they can severely under-represent the abundance of simple satellites (Flynn et al. 2020). Additionally, long-read sequencing is much more expensive than short-read sequencing and is currently unfeasible at the scale necessary for population genetic inference. Currently, most population-scale genomics datasets are sequenced using NGS technologies and several methodologies have been developed to mine these datasets for specific types of repeats and solve specific problems. For TEs the “McClintock” suite of tools is a highly vetted meta-pipeline containing several packages for the detection of TEs in short-read data (Nelson et al. 2017). Within this suite there are specialized tools including for detection of TE insertions in pooled population sequencing and detection of non-reference TE insertions (Kofler et al. 2016; Yu et al. 2021). TE detection methods tend to rely on the fact that TE insertions are flanked by unique genomic sequences, therefore it is possible to detect polymorphic TE insertions by examining discordant read-pairs. There are several high quality tools to detect microsatellites (short tandem repeats, or variable number tandem repeats) in population-level NGS data, including the “-STR” collection of tools, adVNTR, and others (Bakhtiari et al. 2018; Mousavi et al. 2019; Mitra et al. 2021; Mousavi et al. 2021). These tools are similar to TE detection tools in that they use the unique genomic sequence flanking the microsatellite to localize the repeat locus, but are also able to use statistical models to estimate the repeat unit length (Mousavi et al. 2019). Long satellite arrays and TE insertions nested within repeats are much more complicated to analyze. Tools such as k-Seek and TAREAN examine unassembled short-reads to look for tandemly repeating k-mers and can generate consensus sequences and estimate abundances of satellites, while ConTExt can discover 14 tandem arrays of TEs using mixture modeling (Wei et al. 2014; Novák et al. 2017; McGurk and Barbash 2018). Each toolset seeks to solve a specific problem for a specific type of repeat and because of the great diversity of repeats there is a great diversity of tools. If the ultimate goal is to analyze population-scale genomic datasets to construct evolutionary and population genetic models then we must find new ways to analyze repeats at scale. And depending on the type of repeat and the specific mode of evolution of the repeat, e.g., sequence, copy number, location, different methodologies and frameworks must be developed then applied. Eventually, long-read genomics may become affordable enough that NGS repeat analysis toolkits will no longer be necessary, but even so similar statistical principles will need to be applied to account for biases and errors in long-read sequencing. Overview of dissertation The broad aims of this dissertation are firstly to provide descriptions of the population variation of repetitive elements, and secondly to leverage population scale data to contribute to the advancement of the evolutionary models of repeats. I aim to do this by first developing new methodologies for studying repeats using existing population scale NGS data. And then by using the results of these new methodologies to make inferences on the evolution of repetitive sequences within populations. In Chapter 2, I develop a specialized method for inferring TE clades from population- scale NGS data to study the sequence variation of TEs in populations. This approach was successfully deployed to a large population genomics panel of D. melanogaster strains, finding considerable population specific variation. The clades are sets of sequence polymorphisms in the TE sequence that are strongly correlated and represent a statistical inference of underlying lineages of TEs. I next applied population genetic theory to the copy number distributions of clades to infer their putative age, and by integrating piRNA data found that the sequences of 15 younger TE clades are more often silenced and used as piRNA generating loci than older clades. This tool has revealed important features of the evolutionary dynamics of the TE-host arms race. From these results, I propose a model that TEs used to generate silencing piRNAs turn over as newer, younger and active derived TE lineages emerge in the population. In this way the host can maintain silencing of TE lineages that are most active and not expend resources on silencing older and inactive TE lineages. In Chapter 3, I focus on the population variation of satellite repeats and apply an existing method to mine tandem k-mer content, k-Seek, to 2,504 human NGS libraries (Wei et al. 2014). From these data I describe the population variation of simple satellite repeats, finding some modest African-specific population structure and the presence of rare satellite expansions within single individuals. Next, I find that AT- and AG-rich satellites are more interspersed than random chance and tend to covary across individuals, suggesting a concerted amplification of higher order structures within the genome. Lastly, I compute genetic relatedness between individuals at the centromeres and find that centromeric ancestries are predictive of centromeric simple satellite copy number. In particular, I find an expansion of Human Satellite 2 copy number within a cluster of African centromeric ancestry that is also found in admixed Puerto Rican and Colombian individuals. By combining satellite mining techniques and large scale polymorphism data I show that there is structured variation of centromeres and satellites in human populations. These results present a framework for analyzing satellite evolution and variation from short-read data and provide descriptions of population patterns necessary for constructing better models of repeat evolution. Chapter 4 discusses the future direction of the evolutionary models of repeats and the future of using NGS data for empirical study of repeats in populations. 16 Works Cited Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. 2000. The genome sequence of Drosophila melanogaster. Science. 287(5461):2185–2195. Adrion JR, Song MJ, Schrider DR, Hahn MW, Schaack S. 2017. Genome-Wide Estimates of Transposable Element Insertion and Deletion Rates in Drosophila Melanogaster. Genome Biol Evol. 9(5):1329–1340. Altemose N, Miga KH, Maggioni M, Willard HF. 2014. Genomic characterization of large heterochromatic gaps in the human genome assembly. PLoS Comput Biol. 10(5):e1003628. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. 2020. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21(1):30. Aminetzach YT, Macpherson JM, Petrov DA. 2005. Pesticide resistance via transposition- mediated adaptive gene truncation in Drosophila. Science. 309(5735):764–767. Aravin AA, Hannon GJ, Brennecke J. 2007. The Piwi-piRNA pathway provides an adaptive defense in the transposon arms race. Science. 318(5851):761–764. Bakhtiari M, Shleizer-Burko S, Gymrek M, Bansal V, Bafna V. 2018. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res. 28(11):1709–1719. Bayes JJ, Malik HS. 2009. Altered heterochromatin binding by a hybrid sterility protein in Drosophila sibling species. Science. 326(5959):1538–1541. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 456(7218):53–59. Berloco M, Palumbo G, Piacentini L, Pimpinelli S, Fanti L. 2014. Position effect variegation and viability are both sensitive to dosage of constitutive heterochromatin in Drosophila. G3 . 4(9):1709–1716. Brennecke J, Aravin AA, Stark A, Dus M, Kellis M, Sachidanandam R, Hannon GJ. 2007. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Cell. 128(6):1089–1103. Brouwer JR, Willemsen R, Oostra BA. 2009. Microsatellite repeat instability and neurological disease. Bioessays. 31(1):71–83. Brunet TDP, Doolittle WF. 2015. Multilevel Selection Theory and the Evolutionary Functions of Transposable Elements. Genome Biol Evol. 7(8):2445–2457. Charlesworth B, Charlesworth D. 1983. The population dynamics of transposable elements. Genet Res . 42(1):1–27. Charlesworth B, Langley CH. 1989. The population genetics of Drosophila transposable elements. Annu Rev Genet. 23:251–287. 17 Charlesworth B, Langley CH, Stephan W. 1986. The evolution of restricted recombination and the accumulation of repeated DNA sequences. Genetics. 112(4):947–962. Choi JY, Lee YCG. 2020. Double-edged sword: The evolutionary consequences of the epigenetic silencing of transposable elements. PLoS Genet. 16(7):e1008872. Chuong EB, Elde NC, Feschotte C. 2017. Regulatory activities of transposable elements: from conflicts to benefits. Nat Rev Genet. 18(2):71–86. Clayton EA, Wang L, Rishishwar L, Wang J, McDonald JF, Jordan IK. 2016. Patterns of Transposable Element Expression and Insertion in Cancer. Front Mol Biosci. 3:76. Cosby RL, Chang N-C, Feschotte C. 2019. Host-transposon interactions: conflict, cooperation, and cooption. Genes Dev. 33(17-18):1098–1116. Czech B, Munafò M, Ciabrelli F, Eastwood EL, Fabry MH, Kneuss E, Hannon GJ. 2018. piRNA- Guided Genome Defense: From Biogenesis to Silencing. Annu Rev Genet. 52:131–157. Davis PS, Shen MW, Judd BH. 1987. Asymmetrical pairings of transposons in and proximal to the white locus of Drosophila account for four classes of regularly occurring exchange products. Proc Natl Acad Sci U S A. 84(1):174–178. Delanoue R, Clot C, Leray C, Pihl T, Hudry B. 2023 Jun 12. Y chromosome toxicity does not contribute to sex-specific differences in longevity. Nat Ecol Evol. doi:10.1038/s41559-023- 02089-7. http://dx.doi.org/10.1038/s41559-023-02089-7. Dimitri P, Pisano C. 1989. Position effect variegation in Drosophila melanogaster: relationship between suppression effect and the amount of Y chromosome. Genetics. 122(4):793–800. Di Rienzo A, Peterson AC, Garza JC, Valdes AM, Slatkin M, Freimer NB. 1994. Mutational processes of simple-sequence repeat loci in human populations. Proc Natl Acad Sci U S A. 91(8):3166–3170. Doolittle WF, Sapienza C. 1980. Selfish genes, the phenotype paradigm and genome evolution. Nature. 284(5757):601–603. Dover G. 2002. Molecular drive. Trends Genet. 18(11):587–589. Ecker JR, Bickmore WA, Barroso I, Pritchard JK, Gilad Y, Segal E. 2012. Genomics: ENCODE explained. Nature. 489(7414):52–55. ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature. 489(7414):57–74. Feusier J, Watkins WS, Thomas J, Farrell A, Witherspoon DJ, Baird L, Ha H, Xing J, Jorde LB. 2019. Pedigree-based estimation of human mobile element retrotransposition rates. Genome Res. 29(10):1567–1577. Flynn JM, Brown EJ, Clark AG. 2021. Copy number evolution in simple and complex tandem repeats across the C57BL/6 and C57BL/10 inbred mouse lines. G3 . 11(8). doi:10.1093/g3journal/jkab184. http://dx.doi.org/10.1093/g3journal/jkab184. 18 Flynn JM, Caldas I, Cristescu ME, Clark AG. 2017. Selection Constrains High Rates of Tandem Repetitive DNA Mutation in Daphnia pulex. Genetics. 207(2):697–710. Flynn JM, Long M, Wing RA, Clark AG. 2020. Evolutionary Dynamics of Abundant 7-bp Satellites in the Genome of Drosophila virilis. Mol Biol Evol. 37(5):1362–1375. Flynn JM, Lower SE, Barbash DA, Clark AG. 2018. Rates and Patterns of Mutation in Tandem Repetitive DNA in Six Independent Lineages of Chlamydomonas reinhardtii. Genome Biol Evol. 10(7):1673–1686. Gebert D, Neubert LK, Lloyd C, Gui J, Lehmann R, Teixeira FK. 2021. Large Drosophila germline piRNA clusters are evolutionarily labile and dispensable for transposon regulation. Mol Cell. 81(19):3965–3978.e5. González J, Lenkov K, Lipatov M, Macpherson JM, Petrov DA. 2008. High rate of recent transposable element-induced adaptation in Drosophila melanogaster. PLoS Biol. 6(10):e251. Gorbunova V, Seluanov A, Mita P, McKerrow W, Fenyö D, Boeke JD, Linker SB, Gage FH, Kreiling JA, Petrashen AP, et al. 2021. The role of retrotransposable elements in ageing and age-associated diseases. Nature. 596(7870):43–53. Groth SB, Blumenstiel JP. 2017. Horizontal Transfer Can Drive a Greater Transposable Element Load in Large Populations. J Hered. 108(1):36–44. Haasl RJ, Payseur BA. 2013. Microsatellites as targets of natural selection. Mol Biol Evol. 30(2):285–298. Huang B, Durvasula A, Mousavi N, Ziaei-Jam H, Maksimov M, Lohmueller KE, Gymrek M. 2022. Genome-wide selection inference at short tandem repeats. bioRxiv.:2022.05.12.491726. doi:10.1101/2022.05.12.491726. [accessed 2023 May 31]. https://www.biorxiv.org/content/10.1101/2022.05.12.491726v1. Iwasaki WM, Kijima TE, Innan H. 2020. Population Genetics and Molecular Evolution of DNA Sequences in Transposable Elements. II. Accumulation of Variation and Evolution of a New Subfamily. Mol Biol Evol. 37(2):355–364. Jagannathan M, Cummings R, Yamashita YM. 2018. A conserved function for pericentromeric satellite DNA. Elife. 7. doi:10.7554/eLife.34122. http://dx.doi.org/10.7554/eLife.34122. Jurka J, Bao W, Kojima KK. 2011. Families of transposable elements, population structure and the origin of species. Biol Direct. 6:44. Kelsey KJP, Clark AG. 2017. Variation in Position Effect Variegation Within a Natural Population. Genetics. 207(3):1157–1166. Khristich AN, Mirkin SM. 2020. On the wrong DNA track: Molecular mechanisms of repeat- mediated genome instability. J Biol Chem. 295(13):4134–4170. Khurana JS, Wang J, Xu J, Koppetsch BS, Thomson TC, Nowosielska A, Li C, Zamore PD, Weng Z, Theurkauf WE. 2011. Adaptation to P element transposon invasion in Drosophila melanogaster. Cell. 147(7):1551–1563. 19 Kidwell MG. 1985. Hybrid dysgenesis in Drosophila melanogaster: nature and inheritance of P element regulation. Genetics. 111(2):337–350. Kijima TE, Innan H. 2013. Population genetics and molecular evolution of DNA sequences in transposable elements. I. A simulation framework. Genetics. 195(3):957–967. Kimura M. 1979. The neutral theory of molecular evolution. Sci Am. 241(5):98–100, 102, 108 passim. Kimura M, Crow JF. 1964. THE NUMBER OF ALLELES THAT CAN BE MAINTAINED IN A FINITE POPULATION. Genetics. 49(4):725–738. Kofler R. 2019. Dynamics of Transposable Element Invasions with piRNA Clusters. Mol Biol Evol. 36(7):1457–1472. Kofler R. 2020. piRNA Clusters Need a Minimum Size to Control Transposable Element Invasions. Genome Biol Evol. 12(5):736–749. Kofler R, Gómez-Sánchez D, Schlötterer C. 2016. PoPoolationTE2: Comparative Population Genomics of Transposable Elements Using Pool-Seq. Mol Biol Evol. 33(10):2759–2764. Kofler R, Nolte V, Schlötterer C. 2022. The Transposition Rate Has Little Influence on the Plateauing Level of the P-element. Mol Biol Evol. 39(7). doi:10.1093/molbev/msac141. http://dx.doi.org/10.1093/molbev/msac141. Kofler R, Senti K-A, Nolte V, Tobler R, Schlötterer C. 2018. Molecular dissection of a natural transposable element invasion. Genome Res. 28(6):824–835. Kong Y, Rose CM, Cass AA, Williams AG, Darwish M, Lianoglou S, Haverty PM, Tong A-J, Blanchette C, Albert ML, et al. 2019. Transposable element expression in tumors is associated with immune infiltration and increased antigenicity. Nat Commun. 10(1):5228. Kotov AA, Adashev VE, Godneeva BK, Ninova M, Shatskikh AS, Bazylev SS, Aravin AA, Olenina LV. 2019. piRNA silencing contributes to interspecies hybrid sterility and reproductive isolation in Drosophila melanogaster. Nucleic Acids Res. 47(8):4255–4271. Kupiec M, Petes TD. 1988. Allelic and ectopic recombination between Ty elements in yeast. Genetics. 119(3):549–559. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al. 2001. Initial sequencing and analysis of the human genome. Nature. 409(6822):860–921. Langley CH, Brookfield JF, Kaplan N. 1983. Transposable elements in mendelian populations. I. A theory. Genetics. 104(3):457–471. Langley CH, Montgomery E, Hudson R, Kaplan N, Charlesworth B. 1988. On the role of unequal exchange in the containment of transposable element copy number. Genet Res. 52(3):223–235. Lee C, Wevrick R, Fisher RB, Ferguson-Smith MA, Lin CC. 1997. Human centromeric DNAs. Hum Genet. 100(3-4):291–304. 20 Lee YCG. 2015. The Role of piRNA-Mediated Epigenetic Silencing in the Population Dynamics of Transposable Elements in Drosophila melanogaster. PLoS Genet. 11(6):e1005269. Lee YCG. 2022. Synergistic epistasis of the deleterious effects of transposable elements. Genetics. 220(2). doi:10.1093/genetics/iyab211. http://dx.doi.org/10.1093/genetics/iyab211. Lee YCG, Langley CH. 2010. Transposable elements in natural populations of Drosophila melanogaster. Philos Trans R Soc Lond B Biol Sci. 365(1544):1219–1228. Le Rouzic A, Boutin TS, Capy P. 2007. Long-term evolution of transposable elements. Proc Natl Acad Sci U S A. 104(49):19375–19380. Le Rouzic A, Capy P. 2005. The first steps of transposable elements invasion: parasitic strategy vs. genetic drift. Genetics. 169(2):1033–1043. Le Rouzic A, Capy P. 2006. Population genetics models of competition between transposable element subfamilies. Genetics. 174(2):785–793. Liao D, Pavelitz T, Kidd JR, Kidd KK, Weiner AM. 1997. Concerted evolution of the tandemly repeated genes encoding human U2 snRNA (the RNU2 locus) involves rapid intrachromosomal homogenization and rare interchromosomal gene conversion. EMBO J. 16(3):588–598. Lim JK, Simmons MJ. 1994. Gross chromosome rearrangements mediated by transposable elements in Drosophila melanogaster. Bioessays. 16(4):269–275. Luo S, Zhang H, Duan Y, Yao X, Clark AG, Lu J. 2020. The evolutionary arms race between transposable elements and piRNAs in Drosophila melanogaster. BMC Evol Biol. 20(1):14. Lutz SM, Vincent BJ, Kazazian HH Jr, Batzer MA, Moran JV. 2003. Allelic heterogeneity in LINE-1 retrotransposition activity. Am J Hum Genet. 73(6):1431–1437. McCLINTOCK B. 1950. The origin and behavior of mutable loci in maize. Proc Natl Acad Sci U S A. 36(6):344–355. McGurk MP, Barbash DA. 2018. Double insertion of transposable elements provides a substrate for the evolution of satellite DNA. Genome Res. 28(5):714–725. Mével-Ninio M, Pelisson A, Kinder J, Campos AR, Bucheton A. 2007. The flamenco locus controls the gypsy and ZAM retroviruses and is required for Drosophila oogenesis. Genetics. 175(4):1615–1624. Mieczkowski PA, Lemoine FJ, Petes TD. 2006. Recombination between retrotransposons as a source of chromosome rearrangements in the yeast Saccharomyces cerevisiae. DNA Repair . 5(9-10):1010–1020. Mitra I, Huang B, Mousavi N, Ma N, Lamkin M, Yanicky R, Shleizer-Burko S, Lohmueller KE, Gymrek M. 2021. Patterns of de novo tandem repeat mutations and their role in autism. Nature. 589(7841):246–250. Montgomery EA, Huang SM, Langley CH, Judd BH. 1991. Chromosome rearrangement by ectopic recombination in Drosophila melanogaster: genome structure and evolution. Genetics. 129(4):1085–1098. 21 Montgomery E, Charlesworth B, Langley CH. 1987. A test for the role of natural selection in the stabilization of transposable element copy number in a population of Drosophila melanogaster. Genet Res. 49(1):31–41. Mousavi N, Margoliash J, Pusarla N, Saini S, Yanicky R, Gymrek M. 2021. TRTools: a toolkit for genome-wide analysis of tandem repeats. Bioinformatics. 37(5):731–733. Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. 2019. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 47(15):e90. Mukamel RE, Handsaker RE, Sherman MA, Barton AR, Zheng Y, McCarroll SA, Loh P-R. 2021. Protein-coding repeat polymorphisms strongly shape diverse human phenotypes. Science. 373(6562):1499–1505. Neil AJ, Hisey JA, Quasem I, McGinty RJ, Hitczenko M, Khristich AN, Mirkin SM. 2021. Replication-independent instability of Friedreich’s ataxia GAA repeats during chronological aging. Proc Natl Acad Sci U S A. 118(5). doi:10.1073/pnas.2013080118. http://dx.doi.org/10.1073/pnas.2013080118. Nelson MG, Linheiro RS, Bergman CM. 2017. McClintock: An Integrated Pipeline for Detecting Transposable Element Insertions in Whole-Genome Shotgun Sequencing Data. G3 . 7(8):2763– 2778. Nosaka M, Ishiwata A, Shimizu-Sato S, Ono A, Ishimoto K, Noda Y, Sato Y. 2013. The copy number of rice CACTA DNA transposons carrying MIR820 does not correlate with MIR820 expression. Plant Signal Behav. 8(8). doi:10.4161/psb.25169. http://dx.doi.org/10.4161/psb.25169. Novák P, Ávila Robledillo L, Koblížková A, Vrbová I, Neumann P, Macas J. 2017. TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Res. 45(12):e111. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, et al. 2022. The complete sequence of a human genome. Science. 376(6588):44–53. Ohno S. 1972. So much “junk” DNA in our genome. Brookhaven Symp Biol. 23:366–370. Ohta T. 1973. Slightly deleterious mutant substitutions in evolution. Nature. 246(5428):96–98. Ohta T. 1992. The nearly neutral theory of molecular evolution. Annu Rev Ecol Syst. 23(1):263– 286. Ohta T, Kimura M. 1973. A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet Res. 22(2):201–204. Ohzeki J-I, Nakano M, Okada T, Masumoto H. 2002. CENP-B box is required for de novo centromere chromatin assembly on human alphoid DNA. J Cell Biol. 159(5):765–775. Okada T, Ohzeki J-I, Nakano M, Yoda K, Brinkley WR, Larionov V, Masumoto H. 2007. CENP- B controls centromere formation depending on the chromatin context. Cell. 131(7):1287–1300. 22 Olovnikov I, Ryazansky S, Shpiz S, Lavrov S, Abramov Y, Vaury C, Jensen S, Kalmykova A. 2013. De novo piRNA cluster formation in the Drosophila germ line triggered by transgenes containing a transcribed transposon fragment. Nucleic Acids Res. 41(11):5757–5768. Orgel LE, Crick FH. 1980. Selfish DNA: the ultimate parasite. Nature. 284(5757):604–607. Palazzo AF, Gregory TR. 2014. The case for junk DNA. PLoS Genet. 10(5):e1004351. Perelson AS, Bell GI. 1977. Mathematical models for the evolution of multigene families by unequal crossing over. Nature. 265(5592):304–310. Piovesan A, Antonaros F, Vitale L, Strippoli P, Pelleri MC, Caracausi M. 2019. Human protein- coding genes and gene feature statistics in 2019. BMC Res Notes. 12(1):315. Rigal J, Martin Anduaga A, Bitman E, Rivellese E, Kadener S, Marr MT. 2022. Artificially stimulating retrotransposon activity increases mortality and accelerates a subset of aging phenotypes in Drosophila. Elife. 11. doi:10.7554/eLife.80169. http://dx.doi.org/10.7554/eLife.80169. Rockman MV, Wray GA. 2002. Abundant raw material for cis-regulatory evolution in humans. Mol Biol Evol. 19(11):1991–2004. Ryazansky S, Radion E, Mironova A, Akulenko N, Abramov Y, Morgunova V, Kordyukova MY, Olovnikov I, Kalmykova A. 2017. Natural variation of piRNA expression affects immunity to transposable elements. PLoS Genet. 13(4):e1006731. Sainudiin R, Durrett RT, Aquadro CF, Nielsen R. 2004. Microsatellite mutation models: insights from a comparison of humans and chimpanzees. Genetics. 168(1):383–395. SanMiguel P, Tikhonov A, Jin YK, Motchoulskaia N, Zakharov D, Melake-Berhan A, Springer PS, Edwards KJ, Lee M, Avramova Z, et al. 1996. Nested retrotransposons in the intergenic regions of the maize genome. Science. 274(5288):765–768. Seleme M del C, Vetter MR, Cordaux R, Bastone L, Batzer MA, Kazazian HH Jr. 2006. Extensive individual variation in L1 retrotransposition capability contributes to human genetic diversity. Proc Natl Acad Sci U S A. 103(17):6611–6616. Selvaraju D, Wierzbicki F, Kofler R. 2022. P-element invasions in Drosophila erecta shed light on the establishment of host control over a transposable element. bioRxiv.:2022.12.22.521571. doi:10.1101/2022.12.22.521571. [accessed 2023 May 25]. https://www.biorxiv.org/content/10.1101/2022.12.22.521571v1. Shpiz S, Ryazansky S, Olovnikov I, Abramov Y, Kalmykova A. 2014. Euchromatic transposon insertions trigger production of novel Pi- and endo-siRNAs at the target sites in the drosophila germline. PLoS Genet. 10(2):e1004138. Smith GP. 1974. Unequal crossover and the evolution of multigene families. Cold Spring Harb Symp Quant Biol. 38:507–513. Smith GP. 1976. Evolution of repeated DNA sequences by unequal crossover. Science. 191(4227):528–535. 23 Srivastav S, Feschotte C, Clark AG. 2023 May 10. Rapid evolution of piRNA clusters in the Drosophila melanogaster ovary. bioRxiv. doi:10.1101/2023.05.08.539910. http://dx.doi.org/10.1101/2023.05.08.539910. Tomar SS, Hua-Van A, Le Rouzic A. 2023. A population genetics theory for piRNA-regulated transposable elements. Theor Popul Biol. 150:1–13. de Vanssay A, Bougé A-L, Boivin A, Hermant C, Teysset L, Delmarre V, Antoniewski C, Ronsseray S. 2012. Paramutation in Drosophila linked to emergence of a piRNA-producing locus. Nature. 490(7418):112–115. Walsh JB. 1987. Persistence of tandem arrays: implications for satellite and simple-sequence DNAs. Genetics. 115(3):553–567. Wang L, Zhang S, Hadjipanteli S, Saiz L, Nguyen L, Silva E, Kelleher E. 2023. P-element invasion fuels molecular adaptation in laboratory populations of Drosophila melanogaster. Evolution. 77(4):980–994. Wei KH-C, Grenier JK, Barbash DA, Clark AG. 2014. Correlated variation and population differentiation in satellite DNA abundance among lines of Drosophila melanogaster. Proc Natl Acad Sci U S A. 111(52):18793–18798. Wells JN, Feschotte C. 2020. A Field Guide to Eukaryotic Transposable Elements. Annu Rev Genet. 54:539–561. Yu T, Huang X, Dou S, Tang X, Luo S, Theurkauf WE, Lu J, Weng Z. 2021. A benchmark and an algorithm for detecting germline transposon insertions and measuring de novo transposon insertion frequencies. Nucleic Acids Res. 49(8):e44. Zanni V, Eymery A, Coiffet M, Zytnicki M, Luyten I, Quesneville H, Vaury C, Jensen S. 2013. Distribution, evolution, and diversity of retrotransposons at the flamenco locus reflect the regulatory properties of piRNA clusters. Proc Natl Acad Sci U S A. 110(49):19842–19847. Zhang D, Tu S, Stubna M, Wu W-S, Huang W-C, Weng Z, Lee H-C. 2018. The piRNA targeting rules and the resistance to piRNA silencing in endogenous genes. Science. 359(6375):587– 592. Zhang S, Pointer B, Kelleher ES. 2020. Rapid evolution of piRNA-mediated silencing of an invading transposable element was driven by abundant de novo mutations. Genome Res. 30(4):566–575. 24 Chapter 2: Patterns of piRNA regulation in Drosophila revealed through transposable element clade inference Iskander Said, Michael P. McGurk, Andrew G. Clark, Daniel A. Barbash Published in Molecular Biology and Evolution, Volume 39, Issue 1, January 2022 25 Abstract Transposable elements (TEs) are self-replicating “genetic parasites” ubiquitous to eukaryotic genomes. In addition to conflict between TEs and their host genomes, TEs of the same family are in competition with each other. They compete for the same genomic niches while experiencing the same regime of copy-number selection. This suggests that competition among TEs may favor the emergence of new variants that can outcompete their ancestral forms. To investigate the sequence evolution of TEs, we developed a method to infer clades: collections of TEs that share SNP variants and represent distinct TE family lineages. We applied this method to a panel of 85 Drosophila melanogaster genomes and found that the genetic variation of several TE families shows significant population structure that arises from the population- specific expansions of single clades. We used population genetic theory to classify these clades into younger versus older clades and found that younger clades are associated with a greater abundance of sense and antisense piRNAs per copy than older ones. Further, we find that the abundance of younger, but not older clades, is positively correlated with antisense piRNA production, suggesting a general pattern where hosts preferentially produce antisense piRNAs from recently active TE variants. Together these findings suggest a pattern whereby new TE variants arise by mutation and then increase in copy number, followed by the host producing antisense piRNAs that may be used to silence these emerging variants. 26 Introduction Transposable elements (TEs) are mobile, selfish genetic elements commonly thought of as “genetic parasites''. At the start of an invasion TEs begin as a single copy within a host genome, but can transpose and expand rapidly in copy number throughout the population in each successive generation by using the host’s replication machinery (Doolittle and Sapienza 1980; Orgel and Crick 1980). In Drosophila, explosive growth in copy number during a TE invasion is thought to be quickly followed by the host acquiring resistance to TE transpositions, commonly through host production of piwi-interacting small RNAs, piRNAs which interfere with TE transcripts (Le Rouzic and Capy 2005; Aravin et al. 2007; Brennecke et al. 2007; Kofler et al. 2018). In the germline, piRNA-mediated silencing is established through the activation of two synergistic pathways. In the “primary” piRNA pathway, transcription of long precursor sense and antisense RNAs from TE-rich loci called piRNA clusters are processed into 21-30 base-pair-long antisense piRNAs that complex with Piwi clade proteins, bind to nascent sense TE transcripts by recognizing sequence complementarity, and then recruit additional proteins to transcriptionally silence homologous TEs. In the “secondary” piRNA pathway, Piwi-bound piRNAs degrade TE transcripts and form sense piRNAs that bind to antisense piRNA precursors and create a positive feedback loop, known as the Ping-Pong cycle, that establishes constitutive silencing (Aravin et al. 2007; Brennecke et al. 2007; Le Thomas et al. 2014; Czech et al. 2018). Ultimately the TE copy number may reach a steady state, with the rate of transposition dampened by piRNA silencing as well as selection against the deleterious consequences to reproductive fitness of the host organism (Charlesworth and Charlesworth 1983; Lee and Langley 2010; Kelleher et al. 2020). However, as TEs expand in copy number, they also acquire polymorphisms in their sequences, which may lead to the formation of new lineages or subfamilies (Moody 1988; Kimmel and Mathaes 2010; Kijima and Innan 2013; Iwasaki et al. 2020). Multiple lineages of a TE will compete with each other, as long as their polymorphisms are not deactivating (Abrusán and Krambeck 2006; Le Rouzic and Capy 2006; Iwasaki et al. 27 2020). Much in the same way individuals within an ecological system are constrained by a carrying capacity, variants of the same TE may be constrained by the copy-number carrying capacity of the host (Brookfield 2005; Le Rouzic and Capy 2006). This dynamic produces an arena of genomic competition of TE variants where selection may drive the propagation of more fit TE lineages, while less fit lineages are purged (Le Rouzic and Capy 2006). The study of selection on TE population variation has often focused on the fitness of the host organism rather than on the TEs themselves. Much of it centered on the variation of TE insertions within and between populations as well as fitness and phenotypic effects associated with particular insertion loci (Cridland et al. 2013; Blumenstiel et al. 2014; Kofler et al. 2015). The study of selection on sequence variation of TEs, on the other hand, is much more limited. TEs are typically categorized into classes and subclasses based first on their mechanism of transposition, and then on presence of shared motifs, relative sequence identity, and phylogenetic characteristics (Wicker et al. 2007; Arkhipova 2017; Makałowski et al. 2019). There is extensive systemization of TE families, describing their consensus sequences, open reading frames, and insertion site preferences (Bao et al. 2015). Due to the challenges of quantifying variation within repetitive sequences, however, the empirical study of TE sequence polymorphism is largely limited to analyses of reference genome assemblies. For example, the sequence variation of TE families in the Drosophila melanogaster reference genome has been comprehensively described (Kaminker et al. 2002; Lerat et al. 2003; Bergman and Bensasson 2007). In another example, phylogenetic and evolutionary analyses on retrotransposons within the Oryza sativa genome revealed strong purifying selection on protein-coding regions, with occasional bursts of positive selection (Baucom et al. 2009). To our knowledge, studies examining TE sequence variation from population samples are rare, likely because reference genomes are the primary source of full- length TE sequence data. 28 Ideally, to apply population genetic and molecular evolutionary principles to the genetic variation of TEs, we would study the complete sequences of individual TE insertions across many genomes. This is especially necessary if the aim is to assess competition between TE subfamily lineages, where reconstructing the underlying phylogenies would yield insight into the dynamics of how lineages diversify and potentially compete with each other. However, most current population genomic data comes from short-read sequencing, which does not permit an unambiguous assembly containing all the TEs and their internal SNPs. The problem is related to haplotype phasing, which can be done with short reads (Clark 1990; Excoffier and Slatkin 1995; Browning and Browning 2007; Delaneau et al. 2008), except here the TE insertions are at nonhomologous positions. Furthermore, the high multiplicity of TEs greatly complicates the task of determining which internal polymorphisms co-occur in the same insertion, such that with short reads, unambiguous TE haplotypes cannot be recovered as complete sequences of linked SNPs. Although new long-read technologies, like PacBio and Oxford Nanopore, have emerged that greatly reduce phasing problems and allow for more complete analysis of TEs (Long et al. 2018; Miller et al. 2018; Chakraborty et al. 2019; Ellison and Cao 2020; Wierzbicki et al. 2020; Gebert et al. 2021 Jul 30), their higher cost and relatively high error rates have limited their application to large-scale population genomics studies. To gain insight into the sequence evolution of actively invading TEs, we sought to resolve some of these challenges by leveraging a straightforward intuition: If a set of SNPs co- occur in the same TE lineage, their copy-number variation should be correlated across genomes, covarying as the copy number of that lineage varies across genomes. To this end, we took advantage of the large sample sizes in a population-genomic dataset to quantify positive correlations in the copy number of SNPs across multiple individuals, and from these we identify groups of SNPs that we infer to co-occur within TE lineages. We refer to these groups of SNPs as clades, which are inferred to distinguish lineages of TE subfamilies while sidestepping the task of reconstructing the full phylogenies from short-read data. We use the term “TE lineages” 29 to describe the true but unknown genealogical relationship of TE sequences in a family, while “clades” are statistical inferences of this genealogy informed by the positive correlation of SNP copy number. Applying our method to a set of 85 D. melanogaster genomes from the Global Diversity Lines (GDL) (Grenier et al. 2015), we inferred clades in 41 recently active TE families. We then used public PacBio datasets and simulations to validate our inferred clades (Long et al. 2018; Chakraborty et al. 2019). We analyzed the population variation of TE variants and found significant population structure driven by population-specific TE clades, several which are likely active. We additionally analyzed several piRNA libraries from ovaries focused on SNPs that distinguish clades, and found piRNAs are especially enriched for younger TE clades. New approaches 30 Figure 1. Outline and examples of the TE clade inference method. (a) Cartoon depicting the method. The genomes of three individuals (orange rectangles) contain copies of an ancestral TE (blue rectangles) and a derived TE with SNPs A and B (blue rectangles with a red and purple stripe, respectively). As the copy number of the derived TE varies in copy number across individuals 1-3, so does the copy number of SNPs A and B. This relationship in copy number is depicted as a cartoon scatterplot, where each red dot represents the copy number of SNPs A and B in one of nine individuals sampled in the population. The copy number of the SNPs is positively correlated because the SNPs are physically linked. (b) Scatterplot depicting the correlation in copy number across GDL individuals for two SNPs in the Jockey element. Each C_2238, C_4204 Pearson’s r (a) (b) (c) (d) 31 dot represents the copy number of the SNPs C at position 2238 (C_2238) and C at position 2402 (C_4204) for each individual, colored by their population of origin. The degree of correlation of these two SNPs is high (Pearson’s r = 0.82), suggesting that they are physically linked and represent a clade. Black dashed line is a linear fit of the data drawn for emphasis. (c) Heatmap showing correlation of the copy number of all SNPs from the Jockey element. Cells in the heatmap are seriated via hierarchical clustering to create clusters of tightly correlated SNPs, which are inferred to be Jockey clades segregating in the population. The cells are shaded by the pairwise Pearson’s correlation between SNP copy number. The SNPs from (b) are outlined in a block box. SNPs are annotated by the cluster they belong to i.e. clade. Only clusters of two or more SNPs are annotated and included in downstream analysis. Clades are additionally annotated by predicted age as described in the section, The majority of clades are predicted to be young and recently active. (d) The percent of clades inferred from GDL data that were then detected in a set of PacBio genomes (includes only clades where at least two SNPs were detected at any frequency in the PacBio data). The results are separated by TE family, with total clades shown on the far right. Fraction of clades validated over the total number of clades found are placed above each bar. Hierarchical clustering of SNPs uncovers TE clades in NGS data We leverage large population-genomic datasets to detect TE clades by inferring the co- occurrences of SNPs within a TE lineage. We expect that if two alleles exist within the same lineage they will correlate in copy number, varying together as the TEs of that lineage vary in copy number (Figure 1a). We apply this principle to all pairwise combinations of SNPs within an element to compute a correlation matrix and then use hierarchical clustering to cluster groups of SNPs that are strongly correlated. The result is clusters of SNPs that co-vary in their copy number across samples; because these are inferred to occur within the same TE lineage we refer to these clusters as “clades”. Hierarchical clustering is a particularly appropriate choice for this problem as SNPs within TE lineages are truly related to each other in an underlying tree-like structure that is analogous to a hierarchical clustering dendrogram. The correlations between alleles are unlikely to be a result of co-transposition of multiple TEs because the linkage between TEs is very low and the sampling variation in these data is quite high. We employed this clade inference method using short-read libraries from the Global Diversity Lines (GDL), 85 D. melanogaster lines from populations in Beijing, Ithaca, the Netherlands, Tasmania, and Zimbabwe (Grenier et al. 2015). We aligned the short-read data to 32 the TE consensus sequences of 41 recently active TEs and the D. melanogaster reference genome using ConTExt (McGurk and Barbash 2018). Then we calculated allele frequencies of SNPs from the read pileups of the alignments and calculated copy number from the read depth. In brief, copy number was estimated by dividing the observed read depth at each position on the TE consensus sequence by the expected read depth of single copy sequences inferred from the read depth of the reference genome, with corrections for GC bias (McGurk et al. 2020). For each individual we took the allele frequencies at each position and multiplied them by the estimated copy number at each position to generate the copy number of alleles at each position for that individual. We compute pairwise correlations between the copy number of alleles across individuals (Figure 1b), and then employ hierarchical clustering to cluster positively correlated SNPs, thus inferring clades (Figure 1c, Supplemental File 6, https://github.com/is-the- biologist/TE_CladeInference). For each of the 41 TEs analyzed, we report the SNP clusters, as well as the copy number of each inferred clade, calculated by averaging the copy number of the individual alleles (e.g. C_2238, C_4204 in Figure 1) (Supplemental File 1, https://github.com/is- the-biologist/TE_CladeInference). One important consideration of this method is that we are not identifying the full set of TE insertions within an inferred clade nor the complete sequences of the insertions at any particular locus that belong to an inferred clade. Rather, we are identifying sets of SNPs that distinguish lineages from each other (lineage-informative SNPs). Lineage-informative SNPs are clustered together by our method and output as inferred clades, which are statistical inferences of the set of true lineages that exist for a TE family. However, because all TE lineages of a family are related by an underlying phylogeny, there likely does not exist a single correlation cut- off that optimally groups TEs into distinct clades. Rather, any chosen threshold induces some degree of coarse-graining in how it collapses this phylogeny, splitting and merging lineages of TE variants into clades. Generally, a higher stringency in the clustering cut-off will produce many small clusters of tightly correlated SNPs that split lineages, while low stringency cut-offs will 33 produce a few large clusters that merge lineages together. By analyzing the phylogeny of full length TE insertions for a given family, we found that insertions that belong to clades inferred under stringent clustering parameters tend to be more closely related phylogenetically than those inferred under lenient clustering parameters (Supplemental Figure 1). Therefore, the degree of correlation between the copy number of alleles is related to the phylogenetic distance between the TE insertions that bear those alleles. However, this relationship is not always perfect. For example, a pair of alleles that are present in all TE insertions in a population may be strongly correlated, but the phylogenetic distance between TE insertions bearing those alleles may be relatively large, as for “Cluster_7'' of the Jockey element (Supplemental Figure 1d, 1e). Due to the inherent coarse-graining of hierarchical clustering, whether or not SNPs are merged into one clade depends on the clustering cut-off and how often the lineage-informative SNPs co-occur. This can result in an insertion belonging to more than one clade, depending on the correlation cut-off chosen. For example, two closely related lineages may share an ancestral set of SNPs, but have recently diverged such that one lineage acquired a small number of polymorphisms in addition to the ancestral SNPs. Depending on the stringency of the clustering cut-off these two lineages may be called as a single clade containing all of the SNPs or as two clades: one only with the ancestral subset of SNPs and one only with the derived SNPs. In the latter case of two clades, TE insertions from the derived lineage have both the ancestral and derived SNPs and would be classified as belonging to both clades. Therefore, clades should not necessarily be interpreted as distinct waves of invasion nor as being collections of unique insertions, but rather as clusters of SNPs that co-occur within insertions that are statistical representations of evolving lineages. We attempt to address these caveats and trade-offs in our analysis by using simulations and PacBio data to validate our inferences. When inferring clades in the GDL short-read data, we chose stringent clustering parameters to make the clade calls conservative (splitting distantly related lineages). These sets of parameters were chosen to essentialize TE clades to a 34 minimum number of core SNPs that co-occur with a high positive correlation, while increasing the number of distinct, resolved clades. High stringency cut-offs also minimize the number of false positives that can result from performing many thousand pairwise correlations of allele copy number (Supplemental Figure 1a, 1b). Parameters could be tuned to be less stringent to define clades harboring greater internal SNP variation, but the sets of parameters we chose performed well in our validation as 70% of detectable clades inferred from the GDL data were detected in long-read PacBio assemblies (Figure 1d). Undetected clades were a mix of closely related lineages that had been merged into a single cluster and errors in clustering (see Methods). Our method performed well on simulated datasets under a wide range of clustering parameters and under various degrees of simulated sequencing error, but was most negatively affected by excessive fragmentation of elements (Supplemental Figure 2). We additionally assessed the robustness of our method to smaller sample sizes by down-sampling the GDL data and found that when sample sizes were below 50 libraries the ability to recover accurate cluster labels dropped, especially in LINE-like retrotransposons (Supplemental Figure 3a). Results Diversity and variation of TE clades in the GDL To determine whether high sequence diversity of a TE family is due to the evolution of many distinct lineages or a few highly diverged lineages, we assessed the sequence diversity of TE families and the number of clades segregating in the GDL. We calculated the average nucleotide diversity across the GDL of active TE families and found a positive relationship between the number of clades and nucleotide diversity (Figure 2a; Pearson’s r = 0.74; p-value < 0.05). Both the telomeric TEs and LTR retrotransposons have several families with a high number of clades and high sequence diversity. This observation conflicts with previous reports of LTR retrotransposons being typically less diverse than other classes of transposons because they are younger and had recently invaded D. melanogaster (Bergman and Bensasson 2007). 35 Our higher estimates of sequence diversity may be a result of our sampling population-level data as opposed to previous studies that were limited to a single reference genome. It should also be noted that the LTR families Zam, Gypsy and Gypsy1 are the most diverse, have the highest number of clades and are thought to be older than other LTR families in our list (Kofler et al. 2012; Kelleher and Barbash 2013; Kofler et al. 2015). Due to their age, their high diversity likely reflects inactive and degenerate lineages. However, nucleotide diversity is also a product of the mutational processes of a TE, and differences in sequence diversity between classes may also be driven by differences in the rate of mutation. Figure 2. Summary statistics of TE clades inferred from GDL short-reads. (a) Average nucleotide diversity for each TE family vs. the number of clades inferred for that family, colored by TE class. (b) The number of phase-informative SNPs that compose each clade inferred for every TE family. Each point represents a clade and is colored by TE class. (c) Histogram of the clade population frequency of all clades from 41 recently active TE families. (d) Boxplots of the clade population frequency of all clades separated by TE family. Each point represents a clade and is colored by TE class. (a) (b) (c) (d) 36 Besides the LTR retrotransposons, there are also several LINE-like retrotransposon and DNA transposon families with high sequence diversity, but they tend to have fewer clades than non-LINE retrotransposons with similar sequence diversity. For example, the R1 family has a high sequence diversity (𝜋 ≈ 0.026) and only eight clades, while the LTR retrotransposon Zam has comparable sequence diversity and 94 clades. This pattern may be driven by merging SNPs into large clades rather than splitting them into many small ones, so we characterized each clade by the number of lineage-informative SNPs it contains (Figure 2b). In general, the distribution of lineage-informative SNPs is small and tightly distributed, with a median number of two and an interquartile range (IQR) of one. The small cluster size is indicative of a preference for splitting multiple related clades rather than merging them into larger clusters. Clusters of SNPs we discover may therefore not be mutually exclusive and may occur together within a subset of insertions, but with a degree of positive correlation insufficient to pass the clustering threshold. This preference for splitting would upwardly bias the number of clades that we estimate, as large clusters of SNPs with distant genealogical relationships might be broken up into many small clusters of SNPs that are closely related. The exact number of clades segregating in each TE family is affected by parameterization of the clustering as well as the degree of fragmentation of the TE in the genomes themselves. Therefore, some caution must be taken when considering the absolute numbers of clades segregating in a TE family and not relative proportions. However, the inference of clades using simulated data shows that the number of clades and quality of clustering is surprisingly robust to clustering parameters (Supplemental Figure 2). Although most TE families have clades with a small number of SNPs, R1 clades are notable outliers, with a median of 14 SNPs and IQR 19.75 and two clades with 60 and 100 SNPs each. This might be explained by the presence of two independently evolving populations of R1 elements in D. melanogaster, the hundreds of R1 insertions in the highly repetitive ribosomal DNA array, and a separate lineage of divergent elements that comprise a megabase- 37 sized satellite array (Wellauer and Dawid 1977; Roiha et al. 1981; Xiong and Eickbush 1988; Luan et al. 1993; McGurk and Barbash 2018). Divergence between these lineages likely explains the high sequence diversity. The similarity of sequences within each lineage and dissimilarity between lineages may favor their merging into a handful of large clades during clustering. We next addressed whether a single clade dominates a transposable element family in terms of copy number or if instead the clades occur at roughly equal frequency. We determined the proportion of all copies of a TE family in the GDL population belonging to a clade by calculating the “clade population frequency”, dividing the total clade copy number by the average copy number across the lineage-informative positions (e.g. position 2238, position 4204 in Figure 1) summed across the GDL (Supplemental File 1, https://github.com/is-the- biologist/TE_CladeInference). It should be noted that clade population frequency in this context is not the frequency of any insertion of a TE in the population, but the number of copies belonging to a clade out of all copies in a population. We find that most clades occur in the population at a frequency between ~10-30% (mean 17%; Figure 2c). There is a notable lack of low frequency clades, likely due to the filtering of low copy-number and low population- frequency alleles before calling clades. We found 21 clades with high population frequency ( > 40%) occurring in 13 different TE families, including telomeric TEs and several LTR and LINE- like retrotransposons. The Diver and Nomad LTR retrotransposons had the greatest number of high frequency clades (3-4 per family), but with dramatically different clade population frequency distributions. The Diver frequency distribution had many clades spread across the entire range from low to high, while the Nomad distribution was clearly split between a handful of low and high frequency clades (Figure 2d). Jockey, Doc, and DM412 are examples of TE families with Nomad-like frequency distributions, while Gypsy1, Zam, and I-element are examples of TE families with more uniform clade frequency distribution similar to Diver (Figure 2d). The Nomad- like frequency distributions may reflect a relatively fast copy-number expansion of a handful of 38 clades that outcompeted other lineages, while Diver-like distributions may reflect gradual diversification and slow increase in copy number of many clades, possibly driven by stochastic processes. One important consideration is that due to the way population frequency was calculated, clades with SNPs in commonly deleted portions of a TE may be at a high frequency despite being at a relatively low copy number. This is particularly important for LINE-like elements and DNA transposons that are frequently truncated and internally deleted. Therefore, the clade population frequency does not necessarily reflect the proportion of TE insertions in the clade, but instead the number of TE insertions that have those nucleotide sites in the given clade. High population-frequency clades in Gypsy, Zam, HeT-A2, and Tart-B1 had very low copy numbers (~1-2 copies on average; Supplemental File 1, https://github.com/is-the- biologist/TE_CladeInference), likely due to having SNPs in commonly deleted portions of their respective TE sequences. However, we found that many high population-frequency clades were at high copy number (10-40 copies on average). The high frequency clades of Jockey, DM412, and Doc were particularly striking as they are at high copy number and dominate other clades of their respective families. These clades may be at high frequency due to age, having a competitive edge over other variants, or by pure chance. The majority of clades are predicted to be young and recently active The frequency of individual TE insertions in a population, or “insertion-site frequency”, is an informative parameter in estimating the age and potential activity of TEs. While the limitations of short-read data prevent us from mapping SNPs to specific insertions and estimating insertion-site frequency, the insertion-site frequency spectrum of a TE influences the variance of its copy-number distribution across individuals. Our approach infers TE clades, which are collections of TE insertions that share a subset of SNPs, but the same idea applies -- the insertion-site frequency spectrum, and therefore copy-number distribution, of clades is also 39 expected to be related to their age. We therefore applied population genetic theory to predict the age of clades from their copy-number distributions. The copy-number distribution of a transposable element lineage in a population is related to the insertion-site frequency spectrum: 𝑉𝑛 = 𝑛 (1 − 𝑛 𝑇 ) − 𝑇σ𝑥 2 + 4 ∑ 𝐷𝑖,𝑗 i> n) and the effect of linkage disequilibrium between insertions is small (4∑Di, j ≅0), then the mean and variance of the copy-number distribution will be Poisson distributed: 𝑉𝑛 ≈ 𝑛 𝑉𝑛 𝑛 ≈ 1 An older lineage, on the other hand, will have insertions at variable frequencies (𝜎x 2 > 0) -- due to drift increasing the frequency of older insertions. This results in the variance of the copy-number distribution being less than the mean i.e. “underdispersed” relative to the Poisson expectation: 40 𝑉𝑛 ≈ 𝑛 − 𝑇σ𝑥 2 𝑉𝑛 ≈ 𝑛 − 𝑇σ𝑥 2 𝑉𝑛 < 𝑛 𝑉𝑛 𝑛 < 1 Therefore, an underdispersed copy-number distribution is indicative of a lineage with a broad insertion-site frequency spectrum, which would indicate older age and inactivity (Charlesworth and Charlesworth 1983; Langley et al. 1983; McGurk et al. 2020). This is further complicated by linkage disequilibrium between insertions (e.g. population structure). For a recently active lineage (𝜎x 2 ≅ 0) with population structure, the copy-number distribution will no longer be Poisson. Instead, the variance of the copy number will be greater than the mean i.e. “overdispersed”: 𝑉𝑛 ≈ 𝑛 + 4 ∑ 𝐷𝑖,𝑗 i 𝑛 𝑉𝑛 𝑛 > 1 It is possible that an older TE lineage (𝜎x 2 > 0) could be experiencing population structure as well, in which case whether or not the copy number distribution remains underdispersed 41 depends on how strong the linkage is between insertions. More generally, when T𝜎x 2 > 4∑Di,j the TE will be underdispersed, when T𝜎x 2 < 4∑Di,j the TE will be overdispersed and when T𝜎x 2 ≅ 4∑Di,j the TE will fit a Poisson (Charlesworth and Charlesworth 1983; Langley et al. 1983; McGurk et al. 2020). Although the theoretical expectation is that young lineages will fit a Poisson and old lineages will be underdispersed, whether these expectations are borne out in the data is dependent on the population structure and demographic history of the organism. Further complicating our expectations are the complex life histories of TEs, such as recurrent invasions or extended continuous activity, wherein ancient TE insertions would be at high frequency but recent insertions would be at low frequency, thus creating an “underdispersed” copy number distribution despite having recent transpositions. Although these caveats should not be discounted, the copy-number distributions of known active and inactive TE families do recapitulate these expectations (McGurk et al. 2020). Therefore, we analyzed the copy-number distribution of clades (as a proxy for the true lineages) by using a two-tailed dispersion test with multiple testing correction to ask whether the distributions are overdispersed, underdispersed, or fit a Poisson (Figure 3a) (Yang et al. 2009). We predict that clades with copy-number distributions that fit a Poisson or are overdispersed are younger, recently active lineages and clades that are underdispersed are older, likely inactive lineages. By using this test we are not directly assaying transpositional activity of these clades, but are inferring their age based on the theoretical expectations of their copy-number distribution. These inferences of age are imperfect due to the caveats mentioned above, so we sought additional support for our inferences of young and old clades by finding insertions belonging to these clades in the PacBio data and determining their sequence diversity, length, and genomic position. We found that insertions belonging to young clades tend to have much lower sequence diversity (𝞹 = ~0.05), are more often full length and are found less often in heterochromatin than insertions belonging to old clades (𝞹 = ~0.16) (Mann-Whitney U: p-value < 0.05; Supplemental Figure 4). These results confirm that clades with underdispersed copy- 42 number distributions are composed of older, more degenerate insertions that have accumulated in the heterochromatin and match our theoretical expectation. Although in general the Poisson fit of the copy-number distribution of clades is highly concordant with other predictors of age, not all clades that were predicted to be young or old fully conformed with expectations. Ultimately, these inferences of age are an imperfect proxy for activity and are not directly assaying the ability to transpose. Figure 3. Age of clades are inferred by their copy-number distribution. (a) Mean-variance relationship of the clade copy-number distributions for clades from all families. The copy-number distributions for each clade were tested for goodness of fit to a Poisson distribution, and then colored based on acceptance or rejection of this test: “overdispersed” (rejected, red), “dispersed” (fail to reject, yellow), or “underdispersed” (rejected, purple). (b) Clade population frequency of young (red: “dispersed'' and “overdispersed”) or old (purple: “underdispersed”) clades across all TE families. (c) Number of discovered clades per TE family that are young (red: “dispersed'' and “overdispersed”) or old (purple: “underdispersed”). (d) Boxplot of the copy- number distribution of the sole putatively active I-element clade from (c) for each GDL population. There is a significant elevation in the copy number of this clade in Beijing. Of the clades that were at a high clade population frequency (> 40%), the clades of telomeric TEs and other families such as Jockey, Copia, Nomad and DM412 had copy-number (a) (b) (c) (d) 43 distributions consistent with recent activity (dispersed or overdispersed), while the high frequency clades of Zam and Stalker-4 were classified as older lineages (underdispersed). High frequency clades in Doc and Diver, on the other hand, were a mixture of young and old clades. We found that on a