THE POPULATION GENETIC VARIATION OF INTERSPERSED AND TANDEM REPEATS 

 
A Dissertation 

Presented to the Faculty of the Graduate School 

of Cornell University 

in Partial Fulfillment of the Requirements for the Degree of 

Doctor of Philosophy 

 
by 

Iskander Said 

August 2023 


© 2023 Iskander Gaizka Said 

 
THE POPULATION GENETIC VARIATION OF INTERSPERSED AND TANDEM REPEATS 

Iskander Gaizka Said, Ph.D. 

Cornell University 2023 

 
Despite making up large and essential portions of eukaryotic genomes repeat DNA has 

largely eluded study due to limitations in alignment and assembly of Next Generation short-read 

sequencing. Standard technologies and methods are largely inadequate to study the abundance 

and variation of repeat sequences at the population scale. Therefore, I have developed and 

applied new methods and algorithms to study the population variation of interspersed and 

tandem repeats, specifically transposable elements and simple satellites. Firstly, using 

hierarchical clustering and population genetic theory I have developed a method to infer 

transposable element clades and their age, circumventing difficult problems of phasing and 

mapping. This method resulted in a discovery that host piRNA regulatory mechanisms are 

turning over to regulate newly emerging transposable element variants shedding light on an 

evolutionary arms race. Secondly, I applied k-mer counting methods to human population 

genomics data to quantify abundance and variation of simple satellites discovering undescribed 

telomeric and centromeric satellites. I then applied a mixed modelling framework to find 

associations between centromeric ancestry and simple satellite abundances which lead to the 

discovery of an expansion of a centromeric satellite copy number in a cluster of African and 

Latin American individuals with shared centromeric ancestry. Overall, this dissertation 

represents an analysis framework for the development of computational strategies to mine 

population genomics data for repeat variation. The approaches I have developed and deployed 

have allowed me to make inferences about repeat abundance and variation in large population 

genomic datasets that were previously unfeasible. 


 iii 

BIOGRAPHICAL SKETCH 

 
Iskander was born in Pasadena, California at an extremely young age unable to walk or 

speak. Thanks to the care of his parents and grandmother he was able to overcome these 

impairments and quickly gained ambulatory and verbal capabilities -- a development that his 

family would eventually regret. After eighteen years of finding ways to avoid doing schoolwork 

he decided to go to the University of California, Santa Cruz to pursue his dreams of being a lay-

about and writer. During his first year of university by happenstance he took a course on human 

genomics, which caused the first of many existential career crises. The result of this crisis was 

the switching of his major from Literature to Molecular, Cell and Development Biology and 

eventually picking up a minor in Bioinformatics. Iskander’s first foray into research began as an 

undergraduate in the lab of Professor Bill Saxton. Under the tutelage of Inna Djagaeva, the lab 

manager, he was taught everything he knows about pushing flies and molecular biology. That 

initial exposure to research really fomented the passion for hands-on learning and discovery 

that Iskander would carry through the rest of his career. Iskander’s education in evolutionary 

biology began after he graduated from university and began work as a lab technician for 

Professor Russell Corbett-Detig. There he learned population genetics and genomics and truly 

began to appreciate evolution. Through the encouragement of Russ and his peers he went to 

Cornell University for his Ph.D. in Genetics, Genomics and Development in 2018. In his first 

year he was awarded the National Science Foundation Graduate Research Fellowship fully 

funding his Ph.D. for three years and joined the labs of Andrew Clark and Daniel Barbash soon 

after. He quickly fell in love with the weird, wild and untamed world of repetitive DNA with a 

particular fascination in understanding the evolution of transposable elements and using 

bioinformatics and population genetics to do so.  

 
 iv 

 
DEDICATION 

inch by inch 

the tiny snail 

climbs mt fuji 

 
 v 

ACKNOWLEDGEMENTS 

 
Firstly, I want to thank my mother, Marta Rosales, and father, Haroon Said, for always 

supporting me and giving me the courage to move forward even when things seem bleak. I want 

to thank my brother from another mother, Alex, for always having his heart open to me and 

being the brother I never had. 

 
I would like to thank my advisors Andy Clark and Dan Barbash for being mentors in science, 

careers and life. Andy, the summer sails, and lake house lab retreats were some of my fondest 

memories of graduate school. Thank you for your haikus and nuggets of wisdom. Dan, I often 

chuckle fondly thinking of your party where Lenny ran around with his pants off and smacked 

Dan Z. in the goolies. Thank you for your sardonic humor and sharp wit.  

 
Thank you as well to my lab mates in the Clark and Barbash lab who have been incredible fonts 

of knowledge and juvenile humor. I owe many of you debts of gratitude for helping me with 

complex problems and giving me feedback when I’ve been stuck.  

 
In particular, I need to thank Mike McGurk who I spent hours with discussing Bayesian statistics 

and the finer points of vegetarian meat substitutes. Your scientific rigor and attention to detail is 

something that I attempt to emulate every time I perform an analysis. Thank you for taking the 

time all those years ago to discuss TEs with me.  

 
To Elissa Cosgrove, thank you for your massive intellect that is only dwarfed by your enormous 

heart. You have always kept an open door for me when I repeatedly needed help using the 

computational resources or deploying some statistical model. Without you I would never have 

been able to finish my thesis. And thank you for your aviations as dangerous as they might be. 


 vi 

 
Thank you to my friends. There are many of you and I may not name you all, but just know that 

you are all loved dearly. To my first DnD group in grad school: Melissa, Michelle, Roman, 

Victoria and our later additions Ian and Manisha, you have been incredible sources of joy and 

laughter during the uneasiness of grad school. Thank you for spending countless afternoons in 

my studio apartment laced with the smell of expo marker and coffee as we ate Chinese food 

and went on misadventures. You are not just my adventuring party, but also my lifelong friends.  

 
To Melissa, you are the sun, the moon and all of the stars. You are the light at daybreak. You are 

the first warn day of spring. Thank you for being with me always. For sticking with me through 

the hardest times. For being patient with me when no one else will. For sharing your life with me 

– the good and the bad. For your warmth, your laughter, and your smile. Thank you for being 

you. 

 
To everyone who helped me, I love you all. 

 
 vii 

TABLE OF CONTENTS 

 
Biographical Sketch iii 

Dedication iv 

Acknowledgements v 

Chapter 1: An introduction to the evolution of genomic repeats. 1 

Transposable element and host population dynamics as an arms race 3 

Evolutionary models of satellites 10 

Problems in repeat genomics and specific solutions 12 

Overview of dissertation 14 

Works Cited 16 

Chapter 2: Patterns of piRNA regulation in Drosophila revealed through transposable element 
clade inference 24 

Abstract 25 

Introduction 26 

New approaches 29 

Results 34 

Discussion 53 

Methods 59 

Acknowledgements 79 

Works Cited 80 

Chapter 3: The structure of simple satellite variation in the human genome and its correlation 
with centromere ancestry 87 

Abstract 88 

Introduction 89 

Results 93 

Discussion 116 

Methods 125 

Acknowledgements 135 

Works Cited 136 

Chapter 4: Conclusions and the future of repeat genomics. 143 

Works Cited 146 

 
 1 

Chapter 1: An introduction to the evolution of genomic repeats. 

 Since the discovery of DNA as the information molecule of life it has been predicted that 

the organismal genome should be primarily composed of DNA encoding essential components 

for an organism's survival and development. Advancements in genetics and evolutionary theory 

put this idea on notice, but it was not until the advent of genomics and the sequencing of the 

human genome that this theory was ultimately shown to be incorrect. It was discovered that only 

1-2% of the human genome is composed of DNA coding for proteins (Lander et al. 2001; 

Piovesan et al. 2019). The vast majority of the human genome is non-coding and although 

some non-coding DNA is essential for regulatory functions, much of it is large stretches of 

heterochromatic and transcriptionally silent repetitive DNA with no obvious function (Ecker et al. 

2012; ENCODE Project Consortium 2012; Nurk et al. 2022). The amount of repetitive DNA 

varies across species with humans having ~60% repetitive DNA, while some organisms have 

>90% repetitive DNA (SanMiguel et al. 1996; Nurk et al. 2022). Repetitive DNA has often been 

ascribed as “junk”, evolutionary detritus that accumulates over time – like waves washing up 

trash to the shore (Ohno 1972; Palazzo and Gregory 2014). The discovery that the 

overwhelming abundance of the genome is non-coding “junk” has prompted a new synthesis of 

the evolutionary theories that govern the genome. 

 The first evolutionary theories of “junk” came long before the sequencing of the human 

genome. Under the “neutral” and “nearly-neutral” theories of evolution the abundance of non-

coding DNA can be explained by the fact that most changes in the genome will have a neutral 

or nearly-neutral effect on organismal fitness (Ohta 1973; Kimura 1979; Ohta 1992). In this way 

neutral polymorphisms and repetitive DNA can accumulate and dissipate in the genome with 

little consequence for the organism. However, the proliferation of repetitive DNA may 

additionally be driven by selfishness. The selfish DNA concept proposes that a DNA need not 

be beneficial, or even neutral, to the organismal fitness for it to exist. It simply needs to be able 

to replicate and proliferate in such a way to ensure its inheritance into the subsequent 


 2 

generation (Doolittle and Sapienza 1980; Orgel and Crick 1980). From this synthesis, the 

existence of repetitive DNA can be explained as either nearly-neutral accretion or through an 

active selfish and parasitic process. Which mode governs the evolution of repeats depends on 

the class of repeat.  

 There are two major classes of repetitive DNA: interspersed repeats and tandem 

repeats. Interspersed repeats are primarily transposable elements (TEs), which are mobile 

genetic elements that replicate and disperse throughout the genome and are often 

characterized as genetic parasites. TEs are autonomous, or semi-autonomous entities, with 

domains that encode proteins necessary for their replication and internal promoters. They are 

diverse and can be subdivided as DNA or RNA elements depending on the molecular 

intermediate used during replication. DNA TEs typically encode a transposase enzyme that can 

“cut and paste” a TE sequence from one portion of the genome to another. RNA TEs instead 

replicate through a reverse transcriptase that reverse transcribes the TE mRNA intermediate 

into cDNA that is then incorporated into the host genome. TEs may be further subdivided by 

family using phylogenetic and sequence-based characteristics, and variation within families are 

denominated as subfamilies (Wells and Feschotte 2020). TEs were first discovered through 

cytological and genetic experimentation in maize and since then have been appreciated as a 

major component of the genome and their movement can make them a significant evolutionary 

force (McCLINTOCK 1950).  

 The second class of repeats are tandem repeats, arrays of a tandemly repeating motif, 

or monomer. Tandem repeats can be divided into microsatellites which are short arrays, usually 

a few hundred basepairs long, of small (1-6 basepair) monomers, or satellite arrays which have 

long (exceeding 1 Mb) arrays and may have small (simple), or long (complex) monomers. 

Repeat-length expansion of microsatellites have been implicated in numerous human diseases 

including Fragile-X syndrome and Friedreich's Ataxia (Brouwer et al. 2009; Neil et al. 2021). 

Satellite arrays of both the simple and complex type make up large stretches of the genome and 


 3 

some such as the Alpha Satellite in mammals have critical functions. Alpha Satellite is 

particularly interesting because it is an essential component of the centromere that recruits the 

kinetochore for cell division (Lee et al. 1997; Ohzeki et al. 2002; Okada et al. 2007). However, 

most satellite arrays have unknown functions, if any. Because of this the accretion of most 

satellite arrays in the genome is proposed to occur in a nearly-neutral fashion.  

 However, there is growing evidence that calls the neutrality of satellites into question. 

Satellites may be modifying the fitness of the organism in a way not directly related to a singular 

function. Constitutive heterochromatin can modify the epigenetic landscape and change gene 

expression in trans. This effect can be seen most clearly in the effect of repeat content in the Y-

chromosome of D. melanogaster modifying phenotypes and sequestering heterochromatic 

proteins (Dimitri and Pisano 1989; Berloco et al. 2014; Kelsey and Clark 2017; Delanoue et al. 

2023 Jun 12). Therefore, these trans-acting effects necessitate a reinterpretation of the 

selective forces acting on satellites. In a similar vein, TE proliferation is selfish, but a great deal 

of genomic novelty and function is attributable to TEs as well (Chuong et al. 2017; Cosby et al. 

2019). Therefore, thinking of TEs as a purely negative force is incorrect. Additionally, satellite 

arrays and TEs seem to diverge quite rapidly between species and are capable of generating 

species barriers (Bayes and Malik 2009; Jurka et al. 2011). Therefore, although repetitive 

elements may proliferate selfishly, or aggregate in a nearly neutral fashion they are fundamental 

components of the genome. Repeats are implicated in such a multitude of evolutionary 

processes that considering them as mere “junk” is inaccurate. A new synthesis of the 

evolutionary theory of repeats should incorporate the complex ways that repeats shape the 

evolution of genomes and modify our thinking to consider repeats as weird and wild biological 

entities. 

 
Transposable element and host population dynamics as an arms race 


 4 

 Transposable elements (TEs) are highly prolific, inhabiting nearly all eukaryotes and 

make up large portions of the genome: ~20% in Drosophila melanogaster, ~50% in humans, 

and up to 90% in Zea mays (SanMiguel et al. 1996; Adams et al. 2000; Nurk et al. 2022). TEs 

are considered selfish elements because their proliferation in the genome is often at the 

expense of the host’s organismal fitness. Strong effects on host fitness are best exemplified by 

P-element-induced hybrid dysgenesis in D. melanogaster. Progeny of D. melanogaster males 

containing the P-element TE mated to females lacking P-elements have rampant mobilization 

and transposition of P-elements in the germline, which leads to gonadal atrophy and sterility 

(Kidwell 1985). A nearly identical mechanism of hybrid dysgenesis is seen in the I-element TE in 

D. melanogaster as well (Olovnikov et al. 2013; Ryazansky et al. 2017).  

Although most TE insertions are deleterious, some TE insertions can be adaptive to the 

host organism. The rapid rates of transposition, relative to nucleotide mutations, make TEs an 

excellent substrate for genomic novelty and adaptation (Adrion et al. 2017; Feusier et al. 2019). 

In natural populations of D. melanogaster there is an astonishing rate of adaptive TE insertions 

under positive selection, likely fueling adaptation of the species during its migration out of Africa 

and into more temperate climates as well as pesticide resistance (Aminetzach et al. 2005; 

González et al. 2008). In an experimental example , a P-element insertion was found to be 

under positive selection in a population of D. melanogaster that had P-elements introduced as 

an invading TE (Wang et al. 2023). Beyond fuelling adaptation, TEs have become inextricably 

linked to their hosts, for example being co-opted by mammalian genomes to play integral roles 

in development, as well performing necessary functions such as maintenance of D. 

melanogaster telomeres (Cosby et al. 2019). 

 However, given that TE insertions can account for massive proportions of the genome, it 

is highly unlikely that all TE insertions are either strongly deleterious or under positive selection. 

In fact, the fitness effects of TEs are primarily modeled as weakly deleterious. That is, each 

insertion itself has a nearly neutral effect on host fitness but effects are cumulative, only strongly 


 5 

affecting the host’s fitness at a high enough copy number. The primary models of cumulative 

weakly deleterious effects are the synergistic and additive models of fitness. Under synergistic 

epistasis, the deleterious effects on host fitness increase non-linearly with the number of TE 

insertions, while the additive model is linear (Charlesworth and Charlesworth 1983; Langley et 

al. 1983; Charlesworth and Langley 1989). The synergistic model is a preferred explanation 

over the linear additive model as it is thought to most accurately model the mechanisms by 

which TE insertions affect host fitness, which are ectopic recombination and changes in 

chromatin landscapes (Langley et al. 1988; Lee and Langley 2010; Lee 2015). Additionally, 

there is strong empirical evidence of synergistic epistasis of TE insertions in D. melanogaster 

populations (Lee 2022). 

 Ectopic recombination is a mechanism by which recombination occurs between two 

heterologous loci, in this case TE insertions, leading to deleterious structural rearrangements 

(Davis et al. 1987; Kupiec and Petes 1988; Montgomery et al. 1991; Lim and Simmons 1994; 

Mieczkowski et al. 2006). The probability of an ectopic recombination event will increase with 

the square of the number of TE insertions as there is a probability of recombination between 

any pair of TE insertions (Montgomery et al. 1987; Langley et al. 1988). TE-induced changes in 

the chromatin landscape is a mechanism more recently implicated in reducing host fitness. 

Under this model, epigenetic silencing of TEs through host regulatory mechanisms can spread 

to nearby genes and reduce organismal fitness (Lee and Langley 2010; Lee 2015; Choi and Lee 

2020). These two mechanisms are not mutually exclusive and there is evidence that both play a 

significant role.  

Synergistic epistasis paired with intergenerational transposition of TEs leads to an 

equilibrium “steady-state” of TE copy number known as “transposition-selection” balance, where 

TE insertions are kept at an equilibrium by selection (Charlesworth and Charlesworth 1983; 

Langley et al. 1983; Charlesworth and Langley 1989). For selection to control the copy number 

of TEs, transposition rate must be sufficiently low to not outpace the purging via selection 


 6 

(Charlesworth and Langley 1989). However, TEs with transposition rates low enough to be 

balanced by selection are very unlikely to overcome drift when first invading a population (Le 

Rouzic and Capy 2005; Le Rouzic et al. 2007; Groth and Blumenstiel 2017). Therefore, recent 

models propose that TEs must have a high initial transposition rate, that then decreases at 

equilibrium, likely through host regulation, allowing for a “steady-state” and preservation of the 

host population (Le Rouzic and Capy 2005; Groth and Blumenstiel 2017; Kofler 2019; Tomar et 

al. 2023). This has been observed in natural populations of Drosophila simulans, where P-

elements were introduced into naive populations resulting in a burst of transposition activity 

followed by copy-number equilibrium once host regulation was achieved (Kofler et al. 2018; 

Kofler et al. 2022).  

 Host regulation of TEs in animals is primarily through the piwi-associated small RNA 

(piRNA) pathway. piRNAs are small non-coding RNAs that function as an adaptive immune 

system for TEs. These small RNAs are produced from a long non-coding RNA precursor that is 

transcribed from a highly repetitive genomic locus filled with TE sequences, called a piRNA 

cluster (Aravin et al. 2007; Brennecke et al. 2007). This long precursor transcript is sequentially 

cleaved into small (21-30 bp) RNAs that complex with ago and piwi clade proteins to form a 

silencing complex. The piRNA silencing functions through a complementarity mechanism 

between the piRNA and the TE transcript. Once paired, the TE-piRNA complex will signal to 

other members of the piwi pathway to degrade the nascent transcript and process it into piRNAs 

that form a positive feedback loop by associating with the long non-coding piRNA precursor to 

increase production of piRNAs and therefore establish constitutive silencing. This process is 

called “ping-pong” amplification due to the recursive process of piRNA generation (Czech et al. 

2018). TE repression is a critical mechanism underlying the host’s fitness and is even implicated 

in aspects of human health. There is a growing body of evidence that de-repression of TEs in 

aging somatic tissue can lead to increased inflammation and other age-associated phenotypes 

(Gorbunova et al. 2021; Rigal et al. 2022).  Additionally, de-repression of TEs in cancerous cells 


 7 

is quite common, which may lead to further genomic instability and cancer progression (Clayton 

et al. 2016; Kong et al. 2019). But, it is unknown whether the de-repression of TEs in these 

examples are driven by defects in the piRNA system or other failures in the 

heterochromatization machinery. 

 There are several competing hypotheses about how piRNA clusters are formed in the 

genome. There is the “trap” model wherein as TEs invade in a population, by chance, one will 

insert in a piRNA cluster. When this happens the host genome now has acquired immunity to 

this TE and the immunity will spread in the population, the TE invasion will be quelled, and copy 

number will stabilize (Khurana et al. 2011; Kofler et al. 2018). Evidence for this model is 

attributed to the presence of large piRNA clusters found in the D. melanogaster genome full of 

ancient and fragmented TEs, such as the flamenco locus, suggesting the recurrent capture of 

TE sequences within piRNA clusters (Brennecke et al. 2007; Mével-Ninio et al. 2007). There is 

also evidence in natural populations of insertions of P-elements in large ancestral piRNA 

clusters during a recent invasion (Zhang et al. 2020). Additionally, large piRNA clusters tend to 

be enriched for young, recently active TEs, further evidence for the “trap” model of piRNA 

clusters (Srivastav et al. 2023 May 10). The sequence composition of the piRNA clusters seems 

to be highly polymorphic as well, suggesting a rapid turnover of sequence as new TEs invade 

and become trapped (Zanni et al. 2013; Zhang et al. 2020). 

 The other competing model is the de novo model. Under this model, piRNAs are not 

primarily sourced from large permanent piRNA clusters, but instead from smaller dynamic 

piRNA clusters dispersed throughout the genome. There is significant evidence that 

euchromatic TE insertions can be induced to produce piRNAs as de novo piRNA clusters 

through increased TE activity, and that this piRNA cluster licensing is stably inheritable (de 

Vanssay et al. 2012; Olovnikov et al. 2013; Shpiz et al. 2014). Furthermore, piRNA clusters are 

evolutionarily labile and divergent, with some of the largest “trap” heterochromatic piRNA 

clusters being completely dispensable for TE regulation (Gebert et al. 2021). Overall, it is likely 


 8 

that piRNA clusters form through mechanisms similar to both the “trap” model and de novo 

model (Srivastav et al. 2023 May 10). However, there is emerging evidence of an alternative 

model, the “crank-up” model that suggests that it is not the formation of piRNA clusters that is 

important, but rather an siRNA-induced activation of the “ping-pong” cycle that established host 

control over TE expression (Selvaraju et al. 2022). 

 Regardless of the exact model of piRNA cluster formation, piRNAs play an important 

role in the evolutionary dynamics of TEs and modern models of TE population genetics reflect 

this. Agent-based simulations of piRNA and TE dynamics have revealed that piRNA clusters 

can rapidly control TE invasions if they are sufficiently large to trap TE sequences, 

approximately ~0.2-3% of the genome depending on the population size and dominance of 

repression (Kofler 2020). A unified analytical model of TE-piRNA population dynamics was 

recently proposed, which satisfactorily describes the equilibrium states of this new 

“transposition-selection-regulation” balance. Importantly, the initial high transposition rates have 

no effect on the final TE copy number equilibrium, which has been recapitulated in an empirical 

study of P-element invasions of natural populations (Kofler et al. 2022; Tomar et al. 2023). The 

pairing of empirical data and population-genetic models has revealed an evolutionary arms race 

between TE invaders and the host’s piRNA adaptive immune system (Luo et al. 2020). 

These arms race models clearly describe how host genomes adapt to the invasion of 

TEs by limiting their transposition, but there is little description on how TEs might evolve to 

adapt to their hosts. TEs, much like other replicative entities, are subject to natural selection. 

But for TEs their fitness is linked to their ability to transpose and any mutation that increases 

TEs transposition would increase their fitness (Brunet and Doolittle 2015; Luo et al. 2020). 

Although population genetic models of TE sequence evolution have described the evolution and 

competition of derived TE subfamilies, these models tend to assume most mutations are either 

deleterious or neutral (Le Rouzic and Capy 2006; Kijima and Innan 2013; Iwasaki et al. 2020). 

However, positive selection to escape from host repression mechanisms may be playing a role 


 9 

in driving a TE to adapt. For example the CACTA TE in Oryza sativa produces a micro RNA that 

reduces the expression of methylation machinery, effectively de-repressing it (Nosaka et al. 

2013). But, it is possible that sophisticated mechanisms such as these are not necessary to 

escape silencing. 

When considering the piRNA pathway, it is possible that divergence of the TE from the 

piRNA sequences may be adaptive. That is because piRNAs must base-pair with the TE 

transcript and if that base-pairing is disrupted then that TE may escape repression. How this 

would function mechanistically is unclear as the specificity of piRNAs seems to vary across 

species. In Caenorhabditis elegans, disrupting the base pairing of a piRNA’s highly specific 

“seed” sequence will de-repress a silenced target, and even outside the “seed” no more than 

three mismatches are tolerated (Zhang et al. 2018). The piRNA pathway in D. melanogaster, on 

the other hand, seems more robust to mismatches as silencing of TEs is established as long as 

piRNA sequences have a 90% or greater sequence complementarity (Kotov et al. 2019). 

However, full de-repression of a TE subfamily may not be necessary as small gains in 

transposition relative to other subfamilies could give it a sufficient edge to outcompete its 

brethren (Le Rouzic and Capy 2006). There already exists evidence of variation in LINE1 

transposition rates between subfamilies, emphasizing how TE subfamilies may be under 

selection to increase their transposition (Lutz et al. 2003; Seleme et al. 2006). However, it is 

unclear whether these differences reflect escape from host regulation. 

The population genetic and evolutionary models of TEs have matured considerably over 

the course of their first descriptions (Charlesworth and Charlesworth 1983; Langley et al. 1983; 

Charlesworth and Langley 1989). From the first models describing “transposition-selection” 

balance to the discovery and integration of host regulatory piRNAs, these fundamental theories 

have been instrumental in understanding the evolution of TEs (Kofler 2019; Tomar et al. 2023). 

And while evolutionary models of TE sequence evolution and subfamily competition have 

provided thorough descriptions of subfamily dynamics, there is still a significant gap in the 


 10 

population genetic theory of the TE-host arms race (Le Rouzic and Capy 2006; Kijima and 

Innan 2013; Iwasaki et al. 2020). TE sequence evolution within a population has yet to be put 

into context with the emergence of piRNA host regulation in a population model. The challenges 

to describing this model are extensive, as it is unclear how the piRNA system acts as a selective 

force on TE sequences and which model of piRNA silencing best describes how TEs are 

regulated, e.g. “trap” model, de novo, or “crank-up”. Ultimately, significant advancements must 

be made on the empirical study of the population variation and evolution of TEs and piRNAs to 

be able to construct a robust theoretical model of the TE-host arms race. 

 
Evolutionary models of satellites 

 Microsatellites typically change in copy number through replication-mediated repeat 

expansions and contractions, such as polymerase slippage (Khristich and Mirkin 2020). These 

changes in copy-number have been modeled in a stepwise fashion with neutral effects on host 

fitness (Ohta and Kimura 1973; Di Rienzo et al. 1994; Sainudiin et al. 2004). Recent 

appreciation that microsatellite copy number variation increases risk for several diseases and 

disorders such as gout and autism, as well as affecting gene expression, has led to models of 

selection on microsatellites (Rockman and Wray 2002; Mitra et al. 2021; Mukamel et al. 2021). 

Selection models on microsatellites must take into account that copy number changes can be 

frequent and recurrent and therefore standard models of selection do not apply. Instead, 

microsatellite copy number is modeled under “mutation-selection-drift” equilibrium where copy 

number oscillates at a fitness optimum (Haasl and Payseur 2013). These models applied to 

human population genomic datasets found that the aggregate fitness burden of microsatellites is 

much higher than that of single nucleotide polymorphisms (Huang et al. 2022).  

 In contrast, mutational models of long satellite arrays consider changes in copy number 

as largely driven by unequal meiotic exchange during recombination rather than replication 

slippage (Smith 1976; Perelson and Bell 1977). During unequal exchange sister chromatids 


 11 

containing highly repetitive satellite arrays can mispair and cross over, producing new sister 

chromatids with different sized satellite arrays. The combination of unequal exchange with 

stabilizing selection on the satellite array size is predicted to maintain homogeneous satellite 

arrays, and experiments have shown that reducing selective constraints increases satellite size 

variability (Kimura and Crow 1964; Flynn et al. 2017). The birth of satellite arrays from single 

copy sequence is predicted to occur from any locus where selection is relaxed (Smith 1974; 

Smith 1976). The initial amplification may occur through unequal exchange, double-stranded 

break repair, replication slippage, TE insertion or other undescribed mechanisms (Liao et al. 

1997; Dover 2002). The extinction of the satellite array may also occur through unequal 

exchange and intra-chromatid recombination amplified by genetic drift (Charlesworth et al. 

1986; Walsh 1987).  

 Although it is predicted that satellite accumulation should be largely neutral with 

increasing array sizes being constrained by stabilizing selection, some satellites may be 

positively affecting host fitness. Pericentric heterochromatin seems to have a conserved 

function in the formation of “chromocenters'', an essential structure in the aggregation of the 

genome in the nucleus, and heterochromatic repeat content can modify the epigenome (Kelsey 

and Clark 2017; Jagannathan et al. 2018; Delanoue et al. 2023 Jun 12). These results suggest 

that although specific satellites may not have unique functions, satellite DNA in bulk plays an 

important role in the genome. Therefore, modeling satellite accumulation as a largely neutral 

process may not be accurate. There are empirical studies of within-species variation that reveal 

high variability in satellite array presence and copy number, as well as heterogeneous lineage-

specific satellite copy-number expansion in human Y-haplogroups and inbred mouse lines 

suggesting that the mutational and selective processes affecting satellite array size are not 

constant (Altemose et al. 2014; Wei et al. 2014; Flynn et al. 2018; Flynn et al. 2021). However, it 

is unclear whether these rapid rates of expansion and contraction match expected gains and 

losses given genetic drift and unequal exchange under a neutral model. Therefore, significant 


 12 

advancements must be made in the population-level study of satellite arrays and the 

evolutionary models to better understand the biological processes influencing their rapid 

evolution.  

 
Problems in repeat genomics and specific solutions 

 In the previous sections I have outlined significant gaps in the field’s understanding of 

the population genetic and evolutionary models of repeats. These gaps are driven by major 

limitations of modern genomic technologies. Modern genomics has been revolutionized by the 

ease and cost effectiveness of Illumina based Next Generation Sequencing (NGS). These NGS 

technologies have made sequencing genomes incredibly cheap and easy, but are 

fundamentally ill equipped for handling repeat sequences. This is primarily because NGS 

sequencers generate “reads”, strings of typically 75 or 150 basepairs from input DNA 

representing small fragments of the genome (Bentley et al. 2008). In a typical genomics 

workflow, these reads are aligned to an organismal reference genome to find reads that 

uniquely map to genomic loci. This workflow poses two major problems to the study of repeats. 

Firstly, the reference genome may not contain highly repetitive loci for reads to be aligned to as 

repeat-rich regions tend to be absent from genome assemblies. In the case of the human 

genome the centromeres and acrocentric arms of the human genome were completely missing 

until very recently (Nurk et al. 2022). Secondly,reads derived from repetitive DNA are non-

unique and will thus map to multiple loci, making it impossible to know which loci they were 

derived from. There can also be biases in k-mer content between library preparations and 

sequencing platform and GC-amplification bias, all of which must be accounted for in 

downstream analyses. 

 Alternative technologies to Illumina sequencers, such as PacBio or Oxford Nanopore 

can however circumvent these problems. These two technologies generate long-reads, an 

average of 30 kb long and up to 2.3 Mb in the most extreme cases (Amarasinghe et al. 2020). 


 13 

This allows the sequencer to read through long satellite arrays and provides fully haplotype-

resolved TE insertions. The combination of these technologies with proximity-ligation 

sequencing or optical mapping are capable of producing highly contiguous genome assemblies 

with fully assembled centromeres and telomeres (Nurk et al. 2022). However, long-read 

technologies are not without their problems as they can severely under-represent the 

abundance of simple satellites  (Flynn et al. 2020). Additionally, long-read sequencing is much 

more expensive than short-read sequencing and is currently unfeasible at the scale necessary 

for population genetic inference.  

 Currently, most population-scale genomics datasets are sequenced using NGS 

technologies and several methodologies have been developed to mine these datasets for 

specific types of repeats and solve specific problems. For TEs the “McClintock” suite of tools is 

a highly vetted meta-pipeline containing several packages for the detection of TEs in short-read 

data (Nelson et al. 2017). Within this suite there are specialized tools including for detection of 

TE insertions in pooled population sequencing and detection of non-reference TE insertions 

(Kofler et al. 2016; Yu et al. 2021). TE detection methods tend to rely on the fact that TE 

insertions are flanked by unique genomic sequences, therefore it is possible to detect 

polymorphic TE insertions by examining discordant read-pairs. There are several high quality 

tools to detect microsatellites (short tandem repeats, or variable number tandem repeats) in 

population-level NGS data, including the “-STR” collection of tools, adVNTR, and others 

(Bakhtiari et al. 2018; Mousavi et al. 2019; Mitra et al. 2021; Mousavi et al. 2021). These tools 

are similar to TE detection tools in that they use the unique genomic sequence flanking the 

microsatellite to localize the repeat locus, but are also able to use statistical models to estimate 

the repeat unit length (Mousavi et al. 2019). Long satellite arrays and TE insertions nested 

within repeats are much more complicated to analyze. Tools such as k-Seek and TAREAN 

examine unassembled short-reads to look for tandemly repeating k-mers and can generate 

consensus sequences and estimate abundances of satellites, while ConTExt can discover 


 14 

tandem arrays of TEs using mixture modeling (Wei et al. 2014; Novák et al. 2017; McGurk and 

Barbash 2018). 

Each toolset seeks to solve a specific problem for a specific type of repeat and because 

of the great diversity of repeats there is a great diversity of tools. If the ultimate goal is to 

analyze population-scale genomic datasets to construct evolutionary and population genetic 

models then we must find new ways to analyze repeats at scale. And depending on the type of 

repeat and the specific mode of evolution of the repeat, e.g., sequence, copy number, location, 

different methodologies and frameworks must be developed then applied. Eventually, long-read 

genomics may become affordable enough that NGS repeat analysis toolkits will no longer be 

necessary, but even so similar statistical principles will need to be applied to account for biases 

and errors in long-read sequencing. 

 
Overview of dissertation 

 The broad aims of this dissertation are firstly to provide descriptions of the population 

variation of repetitive elements, and secondly to leverage population scale data to contribute to 

the advancement of the evolutionary models of repeats. I aim to do this by first developing new 

methodologies for studying repeats using existing population scale NGS data. And then by 

using the results of these new methodologies to make inferences on the evolution of repetitive 

sequences within populations.  

In Chapter 2, I develop a specialized method for inferring TE clades from population-

scale NGS data to study the sequence variation of TEs in populations. This approach was 

successfully deployed to a large population genomics panel of D. melanogaster strains, finding 

considerable population specific variation. The clades are sets of sequence polymorphisms in 

the TE sequence that are strongly correlated and represent a statistical inference of underlying 

lineages of TEs. I next applied population genetic theory to the copy number distributions of 

clades to infer their putative age, and by integrating piRNA data found that the sequences of 


 15 

younger TE clades are more often silenced and used as piRNA generating loci than older 

clades. This tool has revealed important features of the evolutionary dynamics of the TE-host 

arms race. From these results, I propose a model that TEs used to generate silencing piRNAs 

turn over as newer, younger and active derived TE lineages emerge in the population. In this 

way the host can maintain silencing of TE lineages that are most active and not expend 

resources on silencing older and inactive TE lineages.  

In Chapter 3, I focus on the population variation of satellite repeats and apply an 

existing method to mine tandem k-mer content, k-Seek, to 2,504 human NGS libraries (Wei et 

al. 2014). From these data I describe the population variation of simple satellite repeats, finding 

some modest African-specific population structure and the presence of rare satellite expansions 

within single individuals. Next, I find that AT- and AG-rich satellites are more interspersed than 

random chance and tend to covary across individuals, suggesting a concerted amplification of 

higher order structures within the genome. Lastly, I compute genetic relatedness between 

individuals at the centromeres and find that centromeric ancestries are predictive of centromeric 

simple satellite copy number. In particular, I find an expansion of Human Satellite 2 copy 

number within a cluster of African centromeric ancestry that is also found in admixed Puerto 

Rican and Colombian individuals. By combining satellite mining techniques and large scale 

polymorphism data I show that there is structured variation of centromeres and satellites in 

human populations. These results present a framework for analyzing satellite evolution and 

variation from short-read data and provide descriptions of population patterns necessary for 

constructing better models of repeat evolution. 

 Chapter 4 discusses the future direction of the evolutionary models of repeats and the 

future of using NGS data for empirical study of repeats in populations. 

 
 16 

Works Cited 

Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li 
PW, Hoskins RA, Galle RF, et al. 2000. The genome sequence of Drosophila melanogaster. 
Science. 287(5461):2185–2195. 

Adrion JR, Song MJ, Schrider DR, Hahn MW, Schaack S. 2017. Genome-Wide Estimates of 
Transposable Element Insertion and Deletion Rates in Drosophila Melanogaster. Genome Biol 
Evol. 9(5):1329–1340. 

Altemose N, Miga KH, Maggioni M, Willard HF. 2014. Genomic characterization of large 
heterochromatic gaps in the human genome assembly. PLoS Comput Biol. 10(5):e1003628. 

Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. 2020. Opportunities and 
challenges in long-read sequencing data analysis. Genome Biol. 21(1):30. 

Aminetzach YT, Macpherson JM, Petrov DA. 2005. Pesticide resistance via transposition-
mediated adaptive gene truncation in Drosophila. Science. 309(5735):764–767. 

Aravin AA, Hannon GJ, Brennecke J. 2007. The Piwi-piRNA pathway provides an adaptive 
defense in the transposon arms race. Science. 318(5851):761–764. 

Bakhtiari M, Shleizer-Burko S, Gymrek M, Bansal V, Bafna V. 2018. Targeted genotyping of 
variable number tandem repeats with adVNTR. Genome Res. 28(11):1709–1719. 

Bayes JJ, Malik HS. 2009. Altered heterochromatin binding by a hybrid sterility protein in 
Drosophila sibling species. Science. 326(5959):1538–1541. 

Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers 
DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using 
reversible terminator chemistry. Nature. 456(7218):53–59. 

Berloco M, Palumbo G, Piacentini L, Pimpinelli S, Fanti L. 2014. Position effect variegation and 
viability are both sensitive to dosage of constitutive heterochromatin in Drosophila. G3 . 
4(9):1709–1716. 

Brennecke J, Aravin AA, Stark A, Dus M, Kellis M, Sachidanandam R, Hannon GJ. 2007. 
Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. 
Cell. 128(6):1089–1103. 

Brouwer JR, Willemsen R, Oostra BA. 2009. Microsatellite repeat instability and neurological 
disease. Bioessays. 31(1):71–83. 

Brunet TDP, Doolittle WF. 2015. Multilevel Selection Theory and the Evolutionary Functions of 
Transposable Elements. Genome Biol Evol. 7(8):2445–2457. 

Charlesworth B, Charlesworth D. 1983. The population dynamics of transposable elements. 
Genet Res . 42(1):1–27. 

Charlesworth B, Langley CH. 1989. The population genetics of Drosophila transposable 
elements. Annu Rev Genet. 23:251–287. 


 17 

Charlesworth B, Langley CH, Stephan W. 1986. The evolution of restricted recombination and 
the accumulation of repeated DNA sequences. Genetics. 112(4):947–962. 

Choi JY, Lee YCG. 2020. Double-edged sword: The evolutionary consequences of the 
epigenetic silencing of transposable elements. PLoS Genet. 16(7):e1008872. 

Chuong EB, Elde NC, Feschotte C. 2017. Regulatory activities of transposable elements: from 
conflicts to benefits. Nat Rev Genet. 18(2):71–86. 

Clayton EA, Wang L, Rishishwar L, Wang J, McDonald JF, Jordan IK. 2016. Patterns of 
Transposable Element Expression and Insertion in Cancer. Front Mol Biosci. 3:76. 

Cosby RL, Chang N-C, Feschotte C. 2019. Host-transposon interactions: conflict, cooperation, 
and cooption. Genes Dev. 33(17-18):1098–1116. 

Czech B, Munafò M, Ciabrelli F, Eastwood EL, Fabry MH, Kneuss E, Hannon GJ. 2018. piRNA-
Guided Genome Defense: From Biogenesis to Silencing. Annu Rev Genet. 52:131–157. 

Davis PS, Shen MW, Judd BH. 1987. Asymmetrical pairings of transposons in and proximal to 
the white locus of Drosophila account for four classes of regularly occurring exchange products. 
Proc Natl Acad Sci U S A. 84(1):174–178. 

Delanoue R, Clot C, Leray C, Pihl T, Hudry B. 2023 Jun 12. Y chromosome toxicity does not 
contribute to sex-specific differences in longevity. Nat Ecol Evol. doi:10.1038/s41559-023-
02089-7. http://dx.doi.org/10.1038/s41559-023-02089-7. 

Dimitri P, Pisano C. 1989. Position effect variegation in Drosophila melanogaster: relationship 
between suppression effect and the amount of Y chromosome. Genetics. 122(4):793–800. 

Di Rienzo A, Peterson AC, Garza JC, Valdes AM, Slatkin M, Freimer NB. 1994. Mutational 
processes of simple-sequence repeat loci in human populations. Proc Natl Acad Sci U S A. 
91(8):3166–3170. 

Doolittle WF, Sapienza C. 1980. Selfish genes, the phenotype paradigm and genome evolution. 
Nature. 284(5757):601–603. 

Dover G. 2002. Molecular drive. Trends Genet. 18(11):587–589. 

Ecker JR, Bickmore WA, Barroso I, Pritchard JK, Gilad Y, Segal E. 2012. Genomics: ENCODE 
explained. Nature. 489(7414):52–55. 

ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements in the human 
genome. Nature. 489(7414):57–74. 

Feusier J, Watkins WS, Thomas J, Farrell A, Witherspoon DJ, Baird L, Ha H, Xing J, Jorde LB. 
2019. Pedigree-based estimation of human mobile element retrotransposition rates. Genome 
Res. 29(10):1567–1577. 

Flynn JM, Brown EJ, Clark AG. 2021. Copy number evolution in simple and complex tandem 
repeats across the C57BL/6 and C57BL/10 inbred mouse lines. G3 . 11(8). 
doi:10.1093/g3journal/jkab184. http://dx.doi.org/10.1093/g3journal/jkab184. 


 18 

Flynn JM, Caldas I, Cristescu ME, Clark AG. 2017. Selection Constrains High Rates of Tandem 
Repetitive DNA Mutation in Daphnia pulex. Genetics. 207(2):697–710. 

Flynn JM, Long M, Wing RA, Clark AG. 2020. Evolutionary Dynamics of Abundant 7-bp 
Satellites in the Genome of Drosophila virilis. Mol Biol Evol. 37(5):1362–1375. 

Flynn JM, Lower SE, Barbash DA, Clark AG. 2018. Rates and Patterns of Mutation in Tandem 
Repetitive DNA in Six Independent Lineages of Chlamydomonas reinhardtii. Genome Biol Evol. 
10(7):1673–1686. 

Gebert D, Neubert LK, Lloyd C, Gui J, Lehmann R, Teixeira FK. 2021. Large Drosophila 
germline piRNA clusters are evolutionarily labile and dispensable for transposon regulation. Mol 
Cell. 81(19):3965–3978.e5. 

González J, Lenkov K, Lipatov M, Macpherson JM, Petrov DA. 2008. High rate of recent 
transposable element-induced adaptation in Drosophila melanogaster. PLoS Biol. 6(10):e251. 

Gorbunova V, Seluanov A, Mita P, McKerrow W, Fenyö D, Boeke JD, Linker SB, Gage FH, 
Kreiling JA, Petrashen AP, et al. 2021. The role of retrotransposable elements in ageing and 
age-associated diseases. Nature. 596(7870):43–53. 

Groth SB, Blumenstiel JP. 2017. Horizontal Transfer Can Drive a Greater Transposable 
Element Load in Large Populations. J Hered. 108(1):36–44. 

Haasl RJ, Payseur BA. 2013. Microsatellites as targets of natural selection. Mol Biol Evol. 
30(2):285–298. 

Huang B, Durvasula A, Mousavi N, Ziaei-Jam H, Maksimov M, Lohmueller KE, Gymrek M. 
2022. Genome-wide selection inference at short tandem repeats. bioRxiv.:2022.05.12.491726. 
doi:10.1101/2022.05.12.491726. [accessed 2023 May 31]. 
https://www.biorxiv.org/content/10.1101/2022.05.12.491726v1. 

Iwasaki WM, Kijima TE, Innan H. 2020. Population Genetics and Molecular Evolution of DNA 
Sequences in Transposable Elements. II. Accumulation of Variation and Evolution of a New 
Subfamily. Mol Biol Evol. 37(2):355–364. 

Jagannathan M, Cummings R, Yamashita YM. 2018. A conserved function for pericentromeric 
satellite DNA. Elife. 7. doi:10.7554/eLife.34122. http://dx.doi.org/10.7554/eLife.34122. 

Jurka J, Bao W, Kojima KK. 2011. Families of transposable elements, population structure and 
the origin of species. Biol Direct. 6:44. 

Kelsey KJP, Clark AG. 2017. Variation in Position Effect Variegation Within a Natural 
Population. Genetics. 207(3):1157–1166. 

Khristich AN, Mirkin SM. 2020. On the wrong DNA track: Molecular mechanisms of repeat-
mediated genome instability. J Biol Chem. 295(13):4134–4170. 

Khurana JS, Wang J, Xu J, Koppetsch BS, Thomson TC, Nowosielska A, Li C, Zamore PD, 
Weng Z, Theurkauf WE. 2011. Adaptation to P element transposon invasion in Drosophila 
melanogaster. Cell. 147(7):1551–1563. 


 19 

Kidwell MG. 1985. Hybrid dysgenesis in Drosophila melanogaster: nature and inheritance of P 
element regulation. Genetics. 111(2):337–350. 

Kijima TE, Innan H. 2013. Population genetics and molecular evolution of DNA sequences in 
transposable elements. I. A simulation framework. Genetics. 195(3):957–967. 

Kimura M. 1979. The neutral theory of molecular evolution. Sci Am. 241(5):98–100, 102, 108 
passim. 

Kimura M, Crow JF. 1964. THE NUMBER OF ALLELES THAT CAN BE MAINTAINED IN A 
FINITE POPULATION. Genetics. 49(4):725–738. 

Kofler R. 2019. Dynamics of Transposable Element Invasions with piRNA Clusters. Mol Biol 
Evol. 36(7):1457–1472. 

Kofler R. 2020. piRNA Clusters Need a Minimum Size to Control Transposable Element 
Invasions. Genome Biol Evol. 12(5):736–749. 

Kofler R, Gómez-Sánchez D, Schlötterer C. 2016. PoPoolationTE2: Comparative Population 
Genomics of Transposable Elements Using Pool-Seq. Mol Biol Evol. 33(10):2759–2764. 

Kofler R, Nolte V, Schlötterer C. 2022. The Transposition Rate Has Little Influence on the 
Plateauing Level of the P-element. Mol Biol Evol. 39(7). doi:10.1093/molbev/msac141. 
http://dx.doi.org/10.1093/molbev/msac141. 

Kofler R, Senti K-A, Nolte V, Tobler R, Schlötterer C. 2018. Molecular dissection of a natural 
transposable element invasion. Genome Res. 28(6):824–835. 

Kong Y, Rose CM, Cass AA, Williams AG, Darwish M, Lianoglou S, Haverty PM, Tong A-J, 
Blanchette C, Albert ML, et al. 2019. Transposable element expression in tumors is associated 
with immune infiltration and increased antigenicity. Nat Commun. 10(1):5228. 

Kotov AA, Adashev VE, Godneeva BK, Ninova M, Shatskikh AS, Bazylev SS, Aravin AA, 
Olenina LV. 2019. piRNA silencing contributes to interspecies hybrid sterility and reproductive 
isolation in Drosophila melanogaster. Nucleic Acids Res. 47(8):4255–4271. 

Kupiec M, Petes TD. 1988. Allelic and ectopic recombination between Ty elements in yeast. 
Genetics. 119(3):549–559. 

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, 
FitzHugh W, et al. 2001. Initial sequencing and analysis of the human genome. Nature. 
409(6822):860–921. 

Langley CH, Brookfield JF, Kaplan N. 1983. Transposable elements in mendelian populations. I. 
A theory. Genetics. 104(3):457–471. 

Langley CH, Montgomery E, Hudson R, Kaplan N, Charlesworth B. 1988. On the role of 
unequal exchange in the containment of transposable element copy number. Genet Res. 
52(3):223–235. 

Lee C, Wevrick R, Fisher RB, Ferguson-Smith MA, Lin CC. 1997. Human centromeric DNAs. 
Hum Genet. 100(3-4):291–304. 


 20 

Lee YCG. 2015. The Role of piRNA-Mediated Epigenetic Silencing in the Population Dynamics 
of Transposable Elements in Drosophila melanogaster. PLoS Genet. 11(6):e1005269. 

Lee YCG. 2022. Synergistic epistasis of the deleterious effects of transposable elements. 
Genetics. 220(2). doi:10.1093/genetics/iyab211. http://dx.doi.org/10.1093/genetics/iyab211. 

Lee YCG, Langley CH. 2010. Transposable elements in natural populations of Drosophila 
melanogaster. Philos Trans R Soc Lond B Biol Sci. 365(1544):1219–1228. 

Le Rouzic A, Boutin TS, Capy P. 2007. Long-term evolution of transposable elements. Proc Natl 
Acad Sci U S A. 104(49):19375–19380. 

Le Rouzic A, Capy P. 2005. The first steps of transposable elements invasion: parasitic strategy 
vs. genetic drift. Genetics. 169(2):1033–1043. 

Le Rouzic A, Capy P. 2006. Population genetics models of competition between transposable 
element subfamilies. Genetics. 174(2):785–793. 

Liao D, Pavelitz T, Kidd JR, Kidd KK, Weiner AM. 1997. Concerted evolution of the tandemly 
repeated genes encoding human U2 snRNA (the RNU2 locus) involves rapid intrachromosomal 
homogenization and rare interchromosomal gene conversion. EMBO J. 16(3):588–598. 

Lim JK, Simmons MJ. 1994. Gross chromosome rearrangements mediated by transposable 
elements in Drosophila melanogaster. Bioessays. 16(4):269–275. 

Luo S, Zhang H, Duan Y, Yao X, Clark AG, Lu J. 2020. The evolutionary arms race between 
transposable elements and piRNAs in Drosophila melanogaster. BMC Evol Biol. 20(1):14. 

Lutz SM, Vincent BJ, Kazazian HH Jr, Batzer MA, Moran JV. 2003. Allelic heterogeneity in 
LINE-1 retrotransposition activity. Am J Hum Genet. 73(6):1431–1437. 

McCLINTOCK B. 1950. The origin and behavior of mutable loci in maize. Proc Natl Acad Sci U 
S A. 36(6):344–355. 

McGurk MP, Barbash DA. 2018. Double insertion of transposable elements provides a substrate 
for the evolution of satellite DNA. Genome Res. 28(5):714–725. 

Mével-Ninio M, Pelisson A, Kinder J, Campos AR, Bucheton A. 2007. The flamenco locus 
controls the gypsy and ZAM retroviruses and is required for Drosophila oogenesis. Genetics. 
175(4):1615–1624. 

Mieczkowski PA, Lemoine FJ, Petes TD. 2006. Recombination between retrotransposons as a 
source of chromosome rearrangements in the yeast Saccharomyces cerevisiae. DNA Repair . 
5(9-10):1010–1020. 

Mitra I, Huang B, Mousavi N, Ma N, Lamkin M, Yanicky R, Shleizer-Burko S, Lohmueller KE, 
Gymrek M. 2021. Patterns of de novo tandem repeat mutations and their role in autism. Nature. 
589(7841):246–250. 

Montgomery EA, Huang SM, Langley CH, Judd BH. 1991. Chromosome rearrangement by 
ectopic recombination in Drosophila melanogaster: genome structure and evolution. Genetics. 
129(4):1085–1098. 


 21 

Montgomery E, Charlesworth B, Langley CH. 1987. A test for the role of natural selection in the 
stabilization of transposable element copy number in a population of Drosophila melanogaster. 
Genet Res. 49(1):31–41. 

Mousavi N, Margoliash J, Pusarla N, Saini S, Yanicky R, Gymrek M. 2021. TRTools: a toolkit for 
genome-wide analysis of tandem repeats. Bioinformatics. 37(5):731–733. 

Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. 2019. Profiling the genome-wide 
landscape of tandem repeat expansions. Nucleic Acids Res. 47(15):e90. 

Mukamel RE, Handsaker RE, Sherman MA, Barton AR, Zheng Y, McCarroll SA, Loh P-R. 2021. 
Protein-coding repeat polymorphisms strongly shape diverse human phenotypes. Science. 
373(6562):1499–1505. 

Neil AJ, Hisey JA, Quasem I, McGinty RJ, Hitczenko M, Khristich AN, Mirkin SM. 2021. 
Replication-independent instability of Friedreich’s ataxia GAA repeats during chronological 
aging. Proc Natl Acad Sci U S A. 118(5). doi:10.1073/pnas.2013080118. 
http://dx.doi.org/10.1073/pnas.2013080118. 

Nelson MG, Linheiro RS, Bergman CM. 2017. McClintock: An Integrated Pipeline for Detecting 
Transposable Element Insertions in Whole-Genome Shotgun Sequencing Data. G3 . 7(8):2763–
2778. 

Nosaka M, Ishiwata A, Shimizu-Sato S, Ono A, Ishimoto K, Noda Y, Sato Y. 2013. The copy 
number of rice CACTA DNA transposons carrying MIR820 does not correlate with MIR820 
expression. Plant Signal Behav. 8(8). doi:10.4161/psb.25169. 
http://dx.doi.org/10.4161/psb.25169. 

Novák P, Ávila Robledillo L, Koblížková A, Vrbová I, Neumann P, Macas J. 2017. TAREAN: a 
computational tool for identification and characterization of satellite DNA from unassembled 
short reads. Nucleic Acids Res. 45(12):e111. 

Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, 
Uralsky L, Gershman A, et al. 2022. The complete sequence of a human genome. Science. 
376(6588):44–53. 

Ohno S. 1972. So much “junk” DNA in our genome. Brookhaven Symp Biol. 23:366–370. 

Ohta T. 1973. Slightly deleterious mutant substitutions in evolution. Nature. 246(5428):96–98. 

Ohta T. 1992. The nearly neutral theory of molecular evolution. Annu Rev Ecol Syst. 23(1):263–
286. 

Ohta T, Kimura M. 1973. A model of mutation appropriate to estimate the number of 
electrophoretically detectable alleles in a finite population. Genet Res. 22(2):201–204. 

Ohzeki J-I, Nakano M, Okada T, Masumoto H. 2002. CENP-B box is required for de novo 
centromere chromatin assembly on human alphoid DNA. J Cell Biol. 159(5):765–775. 

Okada T, Ohzeki J-I, Nakano M, Yoda K, Brinkley WR, Larionov V, Masumoto H. 2007. CENP-
B controls centromere formation depending on the chromatin context. Cell. 131(7):1287–1300. 


 22 

Olovnikov I, Ryazansky S, Shpiz S, Lavrov S, Abramov Y, Vaury C, Jensen S, Kalmykova A. 
2013. De novo piRNA cluster formation in the Drosophila germ line triggered by transgenes 
containing a transcribed transposon fragment. Nucleic Acids Res. 41(11):5757–5768. 

Orgel LE, Crick FH. 1980. Selfish DNA: the ultimate parasite. Nature. 284(5757):604–607. 

Palazzo AF, Gregory TR. 2014. The case for junk DNA. PLoS Genet. 10(5):e1004351. 

Perelson AS, Bell GI. 1977. Mathematical models for the evolution of multigene families by 
unequal crossing over. Nature. 265(5592):304–310. 

Piovesan A, Antonaros F, Vitale L, Strippoli P, Pelleri MC, Caracausi M. 2019. Human protein-
coding genes and gene feature statistics in 2019. BMC Res Notes. 12(1):315. 

Rigal J, Martin Anduaga A, Bitman E, Rivellese E, Kadener S, Marr MT. 2022. Artificially 
stimulating retrotransposon activity increases mortality and accelerates a subset of aging 
phenotypes in Drosophila. Elife. 11. doi:10.7554/eLife.80169. 
http://dx.doi.org/10.7554/eLife.80169. 

Rockman MV, Wray GA. 2002. Abundant raw material for cis-regulatory evolution in humans. 
Mol Biol Evol. 19(11):1991–2004. 

Ryazansky S, Radion E, Mironova A, Akulenko N, Abramov Y, Morgunova V, Kordyukova MY, 
Olovnikov I, Kalmykova A. 2017. Natural variation of piRNA expression affects immunity to 
transposable elements. PLoS Genet. 13(4):e1006731. 

Sainudiin R, Durrett RT, Aquadro CF, Nielsen R. 2004. Microsatellite mutation models: insights 
from a comparison of humans and chimpanzees. Genetics. 168(1):383–395. 

SanMiguel P, Tikhonov A, Jin YK, Motchoulskaia N, Zakharov D, Melake-Berhan A, Springer 
PS, Edwards KJ, Lee M, Avramova Z, et al. 1996. Nested retrotransposons in the intergenic 
regions of the maize genome. Science. 274(5288):765–768. 

Seleme M del C, Vetter MR, Cordaux R, Bastone L, Batzer MA, Kazazian HH Jr. 2006. 
Extensive individual variation in L1 retrotransposition capability contributes to human genetic 
diversity. Proc Natl Acad Sci U S A. 103(17):6611–6616. 

Selvaraju D, Wierzbicki F, Kofler R. 2022. P-element invasions in Drosophila erecta shed light 
on the establishment of host control over a transposable element. bioRxiv.:2022.12.22.521571. 
doi:10.1101/2022.12.22.521571. [accessed 2023 May 25]. 
https://www.biorxiv.org/content/10.1101/2022.12.22.521571v1. 

Shpiz S, Ryazansky S, Olovnikov I, Abramov Y, Kalmykova A. 2014. Euchromatic transposon 
insertions trigger production of novel Pi- and endo-siRNAs at the target sites in the drosophila 
germline. PLoS Genet. 10(2):e1004138. 

Smith GP. 1974. Unequal crossover and the evolution of multigene families. Cold Spring Harb 
Symp Quant Biol. 38:507–513. 

Smith GP. 1976. Evolution of repeated DNA sequences by unequal crossover. Science. 
191(4227):528–535. 


 23 

Srivastav S, Feschotte C, Clark AG. 2023 May 10. Rapid evolution of piRNA clusters in the 
Drosophila melanogaster ovary. bioRxiv. doi:10.1101/2023.05.08.539910. 
http://dx.doi.org/10.1101/2023.05.08.539910. 

Tomar SS, Hua-Van A, Le Rouzic A. 2023. A population genetics theory for piRNA-regulated 
transposable elements. Theor Popul Biol. 150:1–13. 

de Vanssay A, Bougé A-L, Boivin A, Hermant C, Teysset L, Delmarre V, Antoniewski C, 
Ronsseray S. 2012. Paramutation in Drosophila linked to emergence of a piRNA-producing 
locus. Nature. 490(7418):112–115. 

Walsh JB. 1987. Persistence of tandem arrays: implications for satellite and simple-sequence 
DNAs. Genetics. 115(3):553–567. 

Wang L, Zhang S, Hadjipanteli S, Saiz L, Nguyen L, Silva E, Kelleher E. 2023. P-element 
invasion fuels molecular adaptation in laboratory populations of Drosophila melanogaster. 
Evolution. 77(4):980–994. 

Wei KH-C, Grenier JK, Barbash DA, Clark AG. 2014. Correlated variation and population 
differentiation in satellite DNA abundance among lines of Drosophila melanogaster. Proc Natl 
Acad Sci U S A. 111(52):18793–18798. 

Wells JN, Feschotte C. 2020. A Field Guide to Eukaryotic Transposable Elements. Annu Rev 
Genet. 54:539–561. 

Yu T, Huang X, Dou S, Tang X, Luo S, Theurkauf WE, Lu J, Weng Z. 2021. A benchmark and 
an algorithm for detecting germline transposon insertions and measuring de novo transposon 
insertion frequencies. Nucleic Acids Res. 49(8):e44. 

Zanni V, Eymery A, Coiffet M, Zytnicki M, Luyten I, Quesneville H, Vaury C, Jensen S. 2013. 
Distribution, evolution, and diversity of retrotransposons at the flamenco locus reflect the 
regulatory properties of piRNA clusters. Proc Natl Acad Sci U S A. 110(49):19842–19847. 

Zhang D, Tu S, Stubna M, Wu W-S, Huang W-C, Weng Z, Lee H-C. 2018. The piRNA targeting 
rules and the resistance to piRNA silencing in endogenous genes. Science. 359(6375):587–
592. 

Zhang S, Pointer B, Kelleher ES. 2020. Rapid evolution of piRNA-mediated silencing of an 
invading transposable element was driven by abundant de novo mutations. Genome Res. 
30(4):566–575. 

 
 24 

Chapter 2: Patterns of piRNA regulation in Drosophila revealed through transposable 

element clade inference 

 
Iskander Said, Michael P. McGurk, Andrew G. Clark, Daniel A. Barbash 

Published in Molecular Biology and Evolution, Volume 39, Issue 1, January 2022 

 
 25 

Abstract 

Transposable elements (TEs) are self-replicating “genetic parasites” ubiquitous to eukaryotic 

genomes. In addition to conflict between TEs and their host genomes, TEs of the same family 

are in competition with each other. They compete for the same genomic niches while 

experiencing the same regime of copy-number selection. This suggests that competition among 

TEs may favor the emergence of new variants that can outcompete their ancestral forms. To 

investigate the sequence evolution of TEs, we developed a method to infer clades: collections of 

TEs that share SNP variants and represent distinct TE family lineages. We applied this method 

to a panel of 85 Drosophila melanogaster genomes and found that the genetic variation of 

several TE families shows significant population structure that arises from the population-

specific expansions of single clades. We used population genetic theory to classify these clades 

into younger versus older clades and found that younger clades are associated with a greater 

abundance of sense and antisense piRNAs per copy than older ones. Further, we find that the 

abundance of younger, but not older clades, is positively correlated with antisense piRNA 

production, suggesting a general pattern where hosts preferentially produce antisense piRNAs 

from recently active TE variants. Together these findings suggest a pattern whereby new TE 

variants arise by mutation and then increase in copy number, followed by the host producing 

antisense piRNAs that may be used to silence these emerging variants. 

  
 26 

Introduction 

 Transposable elements (TEs) are mobile, selfish genetic elements commonly thought of 

as “genetic parasites''. At the start of an invasion TEs begin as a single copy within a host 

genome, but can transpose and expand rapidly in copy number throughout the population in 

each successive generation by using the host’s replication machinery (Doolittle and Sapienza 

1980; Orgel and Crick 1980). In Drosophila, explosive growth in copy number during a TE 

invasion is thought to be quickly followed by the host acquiring resistance to TE transpositions, 

commonly through host production of piwi-interacting small RNAs, piRNAs which interfere with 

TE transcripts (Le Rouzic and Capy 2005; Aravin et al. 2007; Brennecke et al. 2007; Kofler et al. 

2018). In the germline, piRNA-mediated silencing is established through the activation of two 

synergistic pathways. In the “primary” piRNA pathway, transcription of long precursor sense and 

antisense RNAs from TE-rich loci called piRNA clusters are processed into 21-30 base-pair-long 

antisense piRNAs that complex with Piwi clade proteins, bind to nascent sense TE transcripts 

by recognizing sequence complementarity, and then recruit additional proteins to 

transcriptionally silence homologous TEs. In the “secondary” piRNA pathway, Piwi-bound 

piRNAs degrade TE transcripts and form sense piRNAs that bind to antisense piRNA precursors 

and create a positive feedback loop, known as the Ping-Pong cycle, that establishes constitutive 

silencing (Aravin et al. 2007; Brennecke et al. 2007; Le Thomas et al. 2014; Czech et al. 2018). 

Ultimately the TE copy number may reach a steady state, with the rate of transposition 

dampened by piRNA silencing as well as selection against the deleterious consequences to 

reproductive fitness of the host organism (Charlesworth and Charlesworth 1983; Lee and 

Langley 2010; Kelleher et al. 2020). However, as TEs expand in copy number, they also acquire 

polymorphisms in their sequences, which may lead to the formation of new lineages or 

subfamilies (Moody 1988; Kimmel and Mathaes 2010; Kijima and Innan 2013; Iwasaki et al. 

2020). Multiple lineages of a TE will compete with each other, as long as their polymorphisms 

are not deactivating (Abrusán and Krambeck 2006; Le Rouzic and Capy 2006; Iwasaki et al. 


 27 

2020). Much in the same way individuals within an ecological system are constrained by a 

carrying capacity, variants of the same TE may be constrained by the copy-number carrying 

capacity of the host (Brookfield 2005; Le Rouzic and Capy 2006). This dynamic produces an 

arena of genomic competition of TE variants where selection may drive the propagation of more 

fit TE lineages, while less fit lineages are purged (Le Rouzic and Capy 2006).  

The study of selection on TE population variation has often focused on the fitness of the 

host organism rather than on the TEs themselves. Much of it centered on the variation of TE 

insertions within and between populations as well as fitness and phenotypic effects associated 

with particular insertion loci (Cridland et al. 2013; Blumenstiel et al. 2014; Kofler et al. 2015). 

The study of selection on sequence variation of TEs, on the other hand, is much more limited. 

TEs are typically categorized into classes and subclasses based first on their mechanism of 

transposition, and then on presence of shared motifs, relative sequence identity, and 

phylogenetic characteristics (Wicker et al. 2007; Arkhipova 2017; Makałowski et al. 2019). There 

is extensive systemization of TE families, describing their consensus sequences, open reading 

frames, and insertion site preferences (Bao et al. 2015).  

Due to the challenges of quantifying variation within repetitive sequences, however, the 

empirical study of TE sequence polymorphism is largely limited to analyses of reference 

genome assemblies. For example, the sequence variation of TE families in the Drosophila 

melanogaster reference genome has been comprehensively described (Kaminker et al. 2002; 

Lerat et al. 2003; Bergman and Bensasson 2007). In another example, phylogenetic and 

evolutionary analyses on retrotransposons within the Oryza sativa genome revealed strong 

purifying selection on protein-coding regions, with occasional bursts of positive selection 

(Baucom et al. 2009). To our knowledge, studies examining TE sequence variation from 

population samples are rare, likely because reference genomes are the primary source of full-

length TE sequence data. 


 28 

Ideally, to apply population genetic and molecular evolutionary principles to the genetic 

variation of TEs, we would study the complete sequences of individual TE insertions across 

many genomes. This is especially necessary if the aim is to assess competition between TE 

subfamily lineages, where reconstructing the underlying phylogenies would yield insight into the 

dynamics of how lineages diversify and potentially compete with each other. However, most 

current population genomic data comes from short-read sequencing, which does not permit an 

unambiguous assembly containing all the TEs and their internal SNPs. The problem is related to 

haplotype phasing, which can be done with short reads (Clark 1990; Excoffier and Slatkin 1995; 

Browning and Browning 2007; Delaneau et al. 2008), except here the TE insertions are at 

nonhomologous positions. Furthermore, the high multiplicity of TEs greatly complicates the task 

of determining which internal polymorphisms co-occur in the same insertion, such that with short 

reads, unambiguous TE haplotypes cannot be recovered as complete sequences of linked 

SNPs. Although new long-read technologies, like PacBio and Oxford Nanopore, have emerged 

that greatly reduce phasing problems and allow for more complete analysis of TEs (Long et al. 

2018; Miller et al. 2018; Chakraborty et al. 2019; Ellison and Cao 2020; Wierzbicki et al. 2020; 

Gebert et al. 2021 Jul 30), their higher cost and relatively high error rates have limited their 

application to large-scale population genomics studies.  

To gain insight into the sequence evolution of actively invading TEs, we sought to 

resolve some of these challenges by leveraging a straightforward intuition: If a set of SNPs co-

occur in the same TE lineage, their copy-number variation should be correlated across 

genomes, covarying as the copy number of that lineage varies across genomes. To this end, we 

took advantage of the large sample sizes in a population-genomic dataset to quantify positive 

correlations in the copy number of SNPs across multiple individuals, and from these we identify  

groups of SNPs that we infer to co-occur within TE lineages. We refer to these groups of SNPs 

as clades, which are inferred to distinguish lineages of TE subfamilies while sidestepping the 

task of reconstructing the full phylogenies from short-read data. We use the term “TE lineages” 


 29 

to describe the true but unknown genealogical relationship of TE sequences in a family, while 

“clades” are statistical inferences of this genealogy informed by the positive correlation of SNP 

copy number. Applying our method to a set of 85 D. melanogaster genomes from the Global 

Diversity Lines (GDL) (Grenier et al. 2015), we inferred clades in 41 recently active TE families. 

We then used public PacBio datasets and simulations to validate our inferred clades (Long et al. 

2018; Chakraborty et al. 2019). We analyzed the population variation of TE variants and found 

significant population structure driven by population-specific TE clades, several which are likely 

active. We additionally analyzed several piRNA libraries from ovaries focused on SNPs that 

distinguish clades, and found piRNAs are especially enriched for younger TE clades.  

 
New approaches 


 30 

 
Figure 1. Outline and examples of the TE clade inference method. (a) Cartoon depicting the 
method. The genomes of three individuals (orange rectangles) contain copies of an ancestral 
TE (blue rectangles) and a derived TE with SNPs A and B (blue rectangles with a red and 
purple stripe, respectively). As the copy number of the derived TE varies in copy number across 
individuals 1-3, so does the copy number of SNPs A and B. This relationship in copy number is 
depicted as a cartoon scatterplot, where each red dot represents the copy number of SNPs A 
and B in one of nine individuals sampled in the population. The copy number of the SNPs is 
positively correlated because the SNPs are physically linked. (b) Scatterplot depicting the 
correlation in copy number across GDL individuals for two SNPs in the Jockey element. Each 

C_2238, C_4204

Pearson’s r

(a)

(b) (c)

(d)


 31 

dot represents the copy number of the SNPs C at position 2238 (C_2238) and C at position 
2402 (C_4204) for each individual, colored by their population of origin. The degree of 
correlation of these two SNPs is high (Pearson’s r = 0.82), suggesting that they are physically 
linked and represent a clade. Black dashed line is a linear fit of the data drawn for emphasis. (c) 
Heatmap showing correlation of the copy number of all SNPs from the Jockey element. Cells in 
the heatmap are seriated via hierarchical clustering to create clusters of tightly correlated SNPs, 
which are inferred to be Jockey clades segregating in the population. The cells are shaded by 
the pairwise Pearson’s correlation between SNP copy number. The SNPs from (b) are outlined 
in a block box. SNPs are annotated by the cluster they belong to i.e. clade. Only clusters of two 
or more SNPs are annotated and included in downstream analysis. Clades are additionally 
annotated by predicted age as described in the section, The majority of clades are predicted to 
be young and recently active. (d) The percent of clades inferred from GDL data that were then 
detected in a set of PacBio genomes (includes only clades where at least two SNPs were 
detected at any frequency in the PacBio data). The results are separated by TE family, with total 
clades shown on the far right. Fraction of clades validated over the total number of clades found 
are placed above each bar. 

 
Hierarchical clustering of SNPs uncovers TE clades in NGS data 

 We leverage large population-genomic datasets to detect TE clades by inferring the co-

occurrences of SNPs within a TE lineage. We expect that if two alleles exist within the same 

lineage they will correlate in copy number, varying together as the TEs of that lineage vary in 

copy number (Figure 1a). We apply this principle to all pairwise combinations of SNPs within an 

element to compute a correlation matrix and then use hierarchical clustering to cluster groups of 

SNPs that are strongly correlated. The result is clusters of SNPs that co-vary in their copy 

number across samples; because these are inferred to occur within the same TE lineage we 

refer to these clusters as “clades”. Hierarchical clustering is a particularly appropriate choice for 

this problem as SNPs within TE lineages are truly related to each other in an underlying tree-like 

structure that is analogous to a hierarchical clustering dendrogram. The correlations between 

alleles are unlikely to be a result of co-transposition of multiple TEs because the linkage 

between TEs is very low and the sampling variation in these data is quite high.  

  We employed this clade inference method using short-read libraries from the Global 

Diversity Lines (GDL), 85 D. melanogaster lines from populations in Beijing, Ithaca, the 

Netherlands, Tasmania, and Zimbabwe (Grenier et al. 2015). We aligned the short-read data to 


 32 

the TE consensus sequences of 41 recently active TEs and the D. melanogaster reference 

genome using ConTExt (McGurk and Barbash 2018). Then we calculated allele frequencies of 

SNPs from the read pileups of the alignments and calculated copy number from the read depth. 

In brief, copy number was estimated by dividing the observed read depth at each position on the 

TE consensus sequence by the expected read depth of single copy sequences inferred from the 

read depth of the reference genome, with corrections for GC bias (McGurk et al. 2020). For 

each individual we took the allele frequencies at each position and multiplied them by the 

estimated copy number at each position to generate the copy number of alleles at each position 

for that individual. We compute pairwise correlations between the copy number of alleles across 

individuals (Figure 1b), and then employ hierarchical clustering to cluster positively correlated 

SNPs, thus inferring clades (Figure 1c, Supplemental File 6, https://github.com/is-the-

biologist/TE_CladeInference). For each of the 41 TEs analyzed, we report the SNP clusters, as 

well as the copy number of each inferred clade, calculated by averaging the copy number of the 

individual alleles (e.g. C_2238, C_4204 in Figure 1) (Supplemental File 1, https://github.com/is-

the-biologist/TE_CladeInference). 

One important consideration of this method is that we are not identifying the full set of 

TE insertions within an inferred clade nor the complete sequences of the insertions at any 

particular locus that belong to an inferred clade. Rather, we are identifying sets of SNPs that 

distinguish lineages from each other (lineage-informative SNPs). Lineage-informative SNPs are 

clustered together by our method and output as inferred clades, which are statistical inferences 

of the set of true lineages that exist for a TE family. However, because all TE lineages of a 

family are related by an underlying phylogeny, there likely does not exist a single correlation cut-

off that optimally groups TEs into distinct clades. Rather, any chosen threshold induces some 

degree of coarse-graining in how it collapses this phylogeny, splitting and merging lineages of 

TE variants into clades. Generally, a higher stringency in the clustering cut-off will produce many 

small clusters of tightly correlated SNPs that split lineages, while low stringency cut-offs will 


 33 

produce a few large clusters that merge lineages together. By analyzing the phylogeny of full 

length TE insertions for a given family, we found that insertions that belong to clades inferred 

under stringent clustering parameters tend to be more closely related phylogenetically than 

those inferred under lenient clustering parameters (Supplemental Figure 1). Therefore, the 

degree of correlation between the copy number of alleles is related to the phylogenetic distance 

between the TE insertions that bear those alleles. However, this relationship is not always 

perfect. For example, a pair of alleles that are present in all TE insertions in a population may be 

strongly correlated, but the phylogenetic distance between TE insertions bearing those alleles 

may be relatively large, as for “Cluster_7'' of the Jockey element (Supplemental Figure 1d, 1e).  

Due to the inherent coarse-graining of hierarchical clustering, whether or not SNPs are 

merged into one clade depends on the clustering cut-off and how often the lineage-informative 

SNPs co-occur. This can result in an insertion belonging to more than one clade, depending on 

the correlation cut-off chosen. For example, two closely related lineages may share an ancestral 

set of SNPs, but have recently diverged such that one lineage acquired a small number of 

polymorphisms in addition to the ancestral SNPs. Depending on the stringency of the clustering 

cut-off these two lineages may be called as a single clade containing all of the SNPs or as two 

clades: one only with the ancestral subset of SNPs and one only with the derived SNPs. In the 

latter case of two clades, TE insertions from the derived lineage have both the ancestral and 

derived SNPs and would be classified as belonging to both clades. Therefore, clades should not 

necessarily be interpreted as distinct waves of invasion nor as being collections of unique 

insertions, but rather as clusters of SNPs that co-occur within insertions that are statistical 

representations of evolving lineages. 

We attempt to address these caveats and trade-offs in our analysis by using simulations 

and PacBio data to validate our inferences. When inferring clades in the GDL short-read data, 

we chose stringent clustering parameters to make the clade calls conservative (splitting distantly 

related lineages). These sets of parameters were chosen to essentialize TE clades to a 


 34 

minimum number of core SNPs that co-occur with a high positive correlation, while increasing 

the number of distinct, resolved clades. High stringency cut-offs also minimize the number of 

false positives that can result from performing many thousand pairwise correlations of allele 

copy number (Supplemental Figure 1a, 1b). Parameters could be tuned to be less stringent to 

define clades harboring greater internal SNP variation, but the sets of parameters we chose 

performed well in our validation as 70% of detectable clades inferred from the GDL data were 

detected in long-read PacBio assemblies (Figure 1d). Undetected clades were a mix of closely 

related lineages that had been merged into a single cluster and errors in clustering (see 

Methods). Our method performed well on simulated datasets under a wide range of clustering 

parameters and under various degrees of simulated sequencing error, but was most negatively 

affected by excessive fragmentation of elements (Supplemental Figure 2). We additionally 

assessed the robustness of our method to smaller sample sizes by down-sampling the GDL 

data and found that when sample sizes were below 50 libraries the ability to recover accurate 

cluster labels dropped, especially in LINE-like retrotransposons (Supplemental Figure 3a).  

 
Results 

Diversity and variation of TE clades in the GDL 

 To determine whether high sequence diversity of a TE family is due to the evolution of 

many distinct lineages or a few highly diverged lineages, we assessed the sequence diversity of 

TE families and the number of clades segregating in the GDL. We calculated the average 

nucleotide diversity across the GDL of active TE families and found a positive relationship 

between the number of clades and nucleotide diversity (Figure 2a; Pearson’s r = 0.74; p-value < 

0.05). Both the telomeric TEs and LTR retrotransposons have several families with a high 

number of clades and high sequence diversity. This observation conflicts with previous reports 

of LTR retrotransposons being typically less diverse than other classes of transposons because 

they are younger and had recently invaded D. melanogaster (Bergman and Bensasson 2007). 


 35 

Our higher estimates of sequence diversity may be a result of our sampling population-level 

data as opposed to previous studies that were limited to a single reference genome. It should 

also be noted that the LTR families Zam, Gypsy and Gypsy1 are the most diverse, have the 

highest number of clades and are thought to be older than other LTR families in our list  (Kofler 

et al. 2012; Kelleher and Barbash 2013; Kofler et al. 2015). Due to their age, their high diversity 

likely reflects inactive and degenerate lineages. However, nucleotide diversity is also a product 

of the mutational processes of a TE, and differences in sequence diversity between classes may 

also be driven by differences in the rate of mutation.   

 
Figure 2. Summary statistics of TE clades inferred from GDL short-reads. (a) Average 
nucleotide diversity for each TE family vs. the number of clades inferred for that family, colored 
by TE class. (b) The number of phase-informative SNPs that compose each clade inferred for 
every TE family. Each point represents a clade and is colored by TE class. (c)  Histogram of the 
clade population frequency of all clades from 41 recently active TE families. (d) Boxplots of the 
clade population frequency of all clades separated by TE family. Each point represents a clade 
and is colored by TE class. 

 
(a) (b)

(c) (d)


 36 

 Besides the LTR retrotransposons, there are also several LINE-like retrotransposon and 

DNA transposon families with high sequence diversity, but they tend to have fewer clades than 

non-LINE retrotransposons with similar sequence diversity. For example, the R1 family has a 

high sequence diversity (𝜋 ≈  0.026) and only eight clades, while the LTR retrotransposon Zam 

has comparable sequence diversity and 94 clades. This pattern may be driven by merging 

SNPs into large clades rather than splitting them into many small ones, so we characterized 

each clade by the number of lineage-informative SNPs it contains (Figure 2b). In general, the 

distribution of lineage-informative SNPs is small and tightly distributed, with a median number of 

two and an interquartile range (IQR) of one. The small cluster size is indicative of a preference 

for splitting multiple related clades rather than merging them into larger clusters. Clusters of 

SNPs we discover may therefore not be mutually exclusive and may occur together within a 

subset of insertions, but with a degree of positive correlation insufficient to pass the clustering 

threshold. This preference for splitting would upwardly bias the number of clades that we 

estimate, as large clusters of SNPs with distant genealogical relationships might be broken up 

into many small clusters of SNPs that are closely related. The exact number of clades 

segregating in each TE family is affected by parameterization of the clustering as well as the 

degree of fragmentation of the TE in the genomes themselves. Therefore, some caution must 

be taken when considering the absolute numbers of clades segregating in a TE family and not 

relative proportions. However, the inference of clades using simulated data shows that the 

number of clades and quality of clustering is surprisingly robust to clustering parameters 

(Supplemental Figure 2).  

Although most TE families have clades with a small number of SNPs, R1 clades are 

notable outliers, with a median of 14 SNPs and IQR 19.75 and two clades with 60 and 100 

SNPs each. This might be explained by the presence of two independently evolving populations 

of R1 elements in D. melanogaster, the hundreds of R1 insertions in the highly repetitive 

ribosomal DNA array, and a separate lineage of divergent elements that comprise a megabase-


 37 

sized satellite array (Wellauer and Dawid 1977; Roiha et al. 1981; Xiong and Eickbush 1988; 

Luan et al. 1993; McGurk and Barbash 2018). Divergence between these lineages likely 

explains the high sequence diversity. The similarity of sequences within each lineage and 

dissimilarity between lineages may favor their merging into a handful of large clades during 

clustering.  

We next addressed whether a single clade dominates a transposable element family in 

terms of copy number or if instead the clades occur at roughly equal frequency. We determined 

the proportion of all copies of a TE family in the GDL population belonging to a clade by 

calculating the “clade population frequency”, dividing the total clade copy number by the 

average copy number across the lineage-informative positions (e.g. position 2238, position 4204 

in Figure 1) summed across the GDL (Supplemental File 1, https://github.com/is-the-

biologist/TE_CladeInference). It should be noted that clade population frequency in this context 

is not the frequency of any insertion of a TE in the population, but the number of copies 

belonging to a clade out of all copies in a population. We find that most clades occur in the 

population at a frequency between ~10-30% (mean 17%; Figure 2c). There is a notable lack of 

low frequency clades, likely due to the filtering of low copy-number and low population-

frequency alleles before calling clades. We found 21 clades with high population frequency ( > 

40%) occurring in 13 different TE families, including telomeric TEs and several LTR and LINE-

like retrotransposons. The Diver and Nomad LTR retrotransposons had the greatest number of 

high frequency clades (3-4 per family), but with dramatically different clade population frequency 

distributions. The Diver frequency distribution had many clades spread across the entire range 

from low to high, while the Nomad distribution was clearly split between a handful of low and 

high frequency clades (Figure 2d). Jockey, Doc, and DM412 are examples of TE families with 

Nomad-like frequency distributions, while Gypsy1, Zam, and I-element are examples of TE 

families with more uniform clade frequency distribution similar to Diver (Figure 2d). The Nomad-

like frequency distributions may reflect a relatively fast copy-number expansion of a handful of 


 38 

clades that outcompeted other lineages, while Diver-like distributions may reflect gradual 

diversification and slow increase in copy number of many clades, possibly driven by stochastic 

processes.  

One important consideration is that due to the way population frequency was calculated, 

clades with SNPs in commonly deleted portions of a TE may be at a high frequency despite 

being at a relatively low copy number. This is particularly important for LINE-like elements and 

DNA transposons that are frequently truncated and internally deleted. Therefore, the clade 

population frequency does not necessarily reflect the proportion of TE insertions in the clade, 

but instead the number of TE insertions that have those nucleotide sites in the given clade. High 

population-frequency clades in Gypsy, Zam, HeT-A2, and Tart-B1 had very low copy numbers 

(~1-2 copies on average; Supplemental File 1, https://github.com/is-the-

biologist/TE_CladeInference), likely due to having SNPs in commonly deleted portions of their 

respective TE sequences. However, we found that many high population-frequency clades were 

at high copy number (10-40 copies on average). The high frequency clades of Jockey, DM412, 

and Doc were particularly striking as they are at high copy number and dominate other clades of 

their respective families. These clades may be at high frequency due to age, having a 

competitive edge over other variants, or by pure chance. 

 
The majority of clades are predicted to be young and recently active 

The frequency of individual TE insertions in a population, or “insertion-site frequency”, is 

an informative parameter in estimating the age and potential activity of TEs. While the 

limitations of short-read data prevent us from mapping SNPs to specific insertions and 

estimating insertion-site frequency, the insertion-site frequency spectrum of a TE influences the 

variance of its copy-number distribution across individuals. Our approach infers TE clades, 

which are collections of TE insertions that share a subset of SNPs, but the same idea applies -- 

the insertion-site frequency spectrum, and therefore copy-number distribution, of clades is also 


 39 

expected to be related to their age. We therefore applied population genetic theory to predict the 

age of clades from their copy-number distributions. The copy-number distribution of a 

transposable element lineage in a population is related to the insertion-site frequency spectrum: 

 
𝑉𝑛 = 𝑛 (1 −
𝑛

𝑇
) − 𝑇σ𝑥

2 + 4 ∑ 𝐷𝑖,𝑗

i<j

 
Where, Vn and n are the variance and the mean of the copy-number distribution, respectively. T  

is the number of occupiable sites of a TE, 𝜎x
2 is the variance of the insertion-site frequency, and 

4∑Di,,j represents the sum of the coefficients of linkage disequilibrium between insertions 

(Charlesworth and Charlesworth 1983; Langley et al. 1983).  

In a simple scenario, a recently active, young lineage will have insertions that are mostly 

at a low population frequency with little variance (𝜎x
2 ≅ 0). When the number of occupiable sites 

is large (T >> n) and the effect of linkage disequilibrium between insertions is small (4∑Di, j ≅0), 

then the mean and variance of the copy-number distribution will be Poisson distributed: 

 
𝑉𝑛 ≈ 𝑛 

 
𝑉𝑛

𝑛
≈ 1 

 
An older lineage, on the other hand, will have insertions at variable frequencies (𝜎x
2 > 0) 

-- due to drift increasing the frequency of older insertions. This results in the variance of the 

copy-number distribution being less than the mean i.e. “underdispersed” relative to the Poisson 

expectation: 

 
 40 

𝑉𝑛 ≈ 𝑛 − 𝑇σ𝑥
2 

 
𝑉𝑛 ≈ 𝑛 − 𝑇σ𝑥
2 

 
𝑉𝑛 < 𝑛 

 
𝑉𝑛

𝑛
< 1 

 
Therefore, an underdispersed copy-number distribution is indicative of a lineage with a broad 

insertion-site frequency spectrum, which would indicate older age and inactivity (Charlesworth 

and Charlesworth 1983; Langley et al. 1983; McGurk et al. 2020). 

 This is further complicated by linkage disequilibrium between insertions (e.g. population 

structure). For a recently active lineage (𝜎x
2 ≅ 0) with population structure, the copy-number 

distribution will no longer be Poisson. Instead, the variance of the copy number will be greater 

than the mean i.e. “overdispersed”: 

 
𝑉𝑛 ≈ 𝑛 + 4 ∑ 𝐷𝑖,𝑗

i<j

 
𝑉𝑛 > 𝑛 

 
𝑉𝑛

𝑛
> 1 

 
It is possible that an older TE lineage (𝜎x
2 > 0) could be experiencing population structure as 

well, in which case whether or not the copy number distribution remains underdispersed 


 41 

depends on how strong the linkage is between insertions. More generally, when T𝜎x
2 > 4∑Di,j the 

TE will be underdispersed, when T𝜎x
2 < 4∑Di,j the TE will be overdispersed and when T𝜎x

2 ≅ 

4∑Di,j the TE will fit a Poisson (Charlesworth and Charlesworth 1983; Langley et al. 1983; 

McGurk et al. 2020). Although the theoretical expectation is that young lineages will fit a Poisson 

and old lineages will be underdispersed, whether these expectations are borne out in the data is 

dependent on the population structure and demographic history of the organism. Further 

complicating our expectations are the complex life histories of TEs, such as recurrent invasions 

or extended continuous activity, wherein ancient TE insertions would be at high frequency but 

recent insertions would be at low frequency, thus creating an “underdispersed” copy number 

distribution despite having recent transpositions.  

Although these caveats should not be discounted, the copy-number distributions of 

known active and inactive TE families do recapitulate these expectations (McGurk et al. 2020). 

Therefore, we analyzed the copy-number distribution of clades (as a proxy for the true lineages) 

by using a two-tailed dispersion test with multiple testing correction to ask whether the 

distributions are overdispersed, underdispersed, or fit a Poisson (Figure 3a) (Yang et al. 2009). 

We predict that clades with copy-number distributions that fit a Poisson or are overdispersed are 

younger, recently active lineages and clades that are underdispersed are older, likely inactive 

lineages. By using this test we are not directly assaying transpositional activity of these clades, 

but are inferring their age based on the theoretical expectations of their copy-number 

distribution. These inferences of age are imperfect due to the caveats mentioned above, so we 

sought additional support for our inferences of young and old clades by finding insertions 

belonging to these clades in the PacBio data and determining their sequence diversity, length, 

and genomic position. We found that insertions belonging to young clades tend to have much 

lower sequence diversity (𝞹 = ~0.05), are more often full length and are found less often in 

heterochromatin than insertions belonging to old clades (𝞹 = ~0.16) (Mann-Whitney U: p-value 

< 0.05; Supplemental Figure 4). These results confirm that clades with underdispersed copy-


 42 

number distributions are composed of older, more degenerate insertions that have accumulated 

in the heterochromatin and match our theoretical expectation. Although in general the Poisson 

fit of the copy-number distribution of clades is highly concordant with other predictors of age, not 

all clades that were predicted to be young or old fully conformed with expectations. Ultimately, 

these inferences of age are an imperfect proxy for activity and are not directly assaying the 

ability to transpose. 

 
Figure 3. Age of clades are inferred by their copy-number distribution. (a) Mean-variance 
relationship of the clade copy-number distributions for clades from all families. The copy-number 
distributions for each clade were tested for goodness of fit to a Poisson distribution, and then 
colored based on acceptance or rejection of this test: “overdispersed” (rejected, red), 
“dispersed” (fail to reject, yellow), or “underdispersed” (rejected, purple). (b) Clade population 
frequency of young (red: “dispersed'' and “overdispersed”) or old (purple: “underdispersed”) 
clades across all TE families. (c) Number of discovered clades per TE family that are young 
(red: “dispersed'' and “overdispersed”) or old (purple: “underdispersed”). (d) Boxplot of the copy-
number distribution of the sole putatively active I-element clade from (c) for each GDL 
population. There is a significant elevation in the copy number of this clade in Beijing. 
 

Of the clades that were at a high clade population frequency (> 40%), the clades of 

telomeric TEs and other families such as Jockey, Copia, Nomad and DM412 had copy-number 

(a) (b)

(c) (d)


 43 

distributions consistent with recent activity (dispersed or overdispersed), while the high 

frequency clades of Zam and Stalker-4 were classified as older lineages (underdispersed). High 

frequency clades in Doc and Diver, on the other hand, were a mixture of young and old clades. 

We found that on a