EVOLUTION AND POPULATION DYNAMICS OF MYCOBACTERIAL 
PATHOGENS WITH APPLICATIONS FOR DISEASE CONTROL 
 
 
 
 
 
 
 
 
A Dissertation 
Presented to the Faculty of the Graduate School 
of Cornell University 
In Partial Fulfillment of the Requirements for the Degree of 
Doctor of Philosophy 
 
 
 
 
 
 
by 
Kristina Ceres 
August 2022
 
 
 
 
 
 
 
 
 
 
 
 
 
© 2022 Kristina Ceres
 
 
EVOLUTION AND POPULATION DYNAMICS OF MYCOBACTERIAL 
PATHOGENS WITH APPLICATIONS FOR DISEASE CONTROL 
 
 
Kristina Ceres, Ph.D. 
Cornell University 2022 
 
The field of mathematical epidemiology aims to understand the distribution of diseases 
in populations using mathematical models. Building models involves distilling the 
biological facts of a disease system and using mathematical theory to describe the 
system dynamics. Similarly, molecular epidemiology relies on population genetic 
theory to link pathogen genetic data to epidemiological inference. This thesis aims to 
both understand the biological principles that drive the evolution of mycobacterial 
pathogens, but to also determine which biological processes can be described by current 
theory, and which questions require new analytic approaches to fully describe.  
Here we describe a new hidden Markov model to characterize the progression 
of Mycobacterium avium subsp. paratuberculosis infection, the cause of Johne’s disease 
in ruminants through disease states. We find a minority of infected animals remain low-
shedders throughout the duration of infection, whereas others progress to a high 
shedding disease state. Our modeling framework could be used to predict future 
progression patterns based on curing shedding state, which could aid in decision making 
on the farm.  
 
 Next we study the pangenome of Mycobacterium bovis, the cause of bovine 
tuberculosis. Recent studies suggest that M. bovis has a large accessory genome, yet 
there is no known mechanism for horizontal gene transfer in M. bovis. We found 
significant errors in accessory gene classification, correcting for errors, we show that 
M. bovis has a much smaller accessory genome than previously described with little 
gene content variation generated over outbreaks. 
 Finally, we study the within-host evolution of M. bovis and employ flexible 
forward genetic simulation tools to evolutionary processes that deviate from standard 
evolutionary models. We found that the distribution of mutations that evolve de novo 
within animal hosts is random, suggesting M. bovis evolution is driven by drift at short 
time scales. We also find that the combined effect relatively rapid mutation rates and 
diversity reducing infection bottlenecks and skewed offspring distributions best 
describe within host evolution M. bovis. Lastly, we explore the effects of a rapid 
mutation rate on the evolution of antimicrobial resistance genotypes in the absence of 
antimicrobial use.  
 
  
  
 
 
BIOGRAPHICAL SKETCH 
Kristina pursued a bachelor’s in animal science degree at Cornell University, with the 
intention becoming a small animal veterinarian after graduation. During her time as an 
undergraduate, Kristina had the opportunities to take international courses in field 
research in Kenya, and evolutionary biology in the Galapagos, and studied conservation 
medicine in South Africa experience. These experiences shaped her interests in wildlife 
conservation and sparked her curiosity for research. Kristina participated in 
undergraduate research in the Behling-Kelly laboratory studying novel 
immunohistochemistry markers for canine lymphoma. After starting veterinary school 
in 2015, Kristina joined the Gröhn laboratory first as a part time research assistant, and 
then as a summer student through the veterinary leadership program to develop skills in 
epidemiology. During her summer in the Gröhn laboratory, Kristina developed skills in 
mathematical modeling and a passion for mathematical biology and decided to apply 
for the Cornell combined DVM-PhD program to pursue a PhD in infectious disease 
epidemiology. Before she was accepted into the combined degree program, Kristina 
rotated through the Ivanek laboratory, where she was involved in a scoping review on 
Listeria monocytogenes sampling strategies in food processing facilities, and in the 
design of a survey used to understand perceptions of antimicrobial use among dairy 
farmers. In 2017 Kristina officially joined the Gröhn laboratory as a PhD student, and 
began working on developing mathematical modeling of Mycobacterium avium subsp. 
paratuberculosis infections in dairy cattle. Kristina later transitioned into molecular 
epidemiology work with the assistance of collaborators at the USDA National 
Veterinary Services Laboratories and was awarded a USDA NIFA predoctoral 
fellowship to study M. bovis evolution and epidemiology. In addition to Kristina’s 
dissertation work, she has had the pleasure to work with the Stanhope and Goodman 
laboratories on the epidemiology, zoonotic transmission and gene distribution of 
v 
 
Escherichia coli infections in dogs and human hosts, and on the temporal dating of 
coronaviruses in a multi-host system. Kristina also collaborated with the USDA 
National Veterinary Services Laboratories on studying the phylogeographic distribution 
and temporal dating of M. bovis introductions into the Dominican Republic. Lastly, 
Kristina has greatly enjoyed working with Michael Stanhope on shark genomics 
projects, which allowed her to develop skills in vertebrate genomics and impact wildlife 
conservation through studying the historical distributions of shark species including the 
critically endangered great hammerhead.  
 In the fall of 2022 Kristina will re-join the veterinary program to complete her 
doctorate in veterinary medicine degree. During her remaining time in school, Kristina 
plans to continue work in M. bovis model development and plans to further develop 
skills in genomics, evolutionary biology and conservation medicine. After graduation 
she intends to pursue a research career at the intersection of infectious disease 
epidemiology, and host-pathogen evolution with an emphasis on wildlife conservation 
and One Health.  
 
 
 
 
 
 
 
 
 
 
 
vi 
 
 
 
 
 
 
 
This work is dedicated to my family, whose never-ending support has allowed me to 
develop my perfect career, and for the animals, particularly Oreo, Timbi and Olive who 
have set me on this path. 
vii 
 
ACKNOWLEDGMENTS 
 
First, I would like to thank my advisor Yrjö Gröhn, who first took a chance on me when 
I had no experience in mathematical biology, statistics or epidemiology. His support 
and gentile mentoring style over the past six years has allowed me to explore multiple 
disciplines and gain expertise in a variety of fields to develop my own unique research 
niche. Although my research plan was not always clear, Yrjö consistently worked hard 
to ensure I could develop a project that I was passionate about, and I am very thankful 
for his consistent guidance and for always being available for advice.  
I would also like to thank my committee members for their support during my 
PhD. I would like to thank Michael Stanhope for not only helping me learn the ropes in 
comparative genomics and, but for also treating me as a collaborator, which helped me 
develop confidence in bioinformatics and as a scientist. Our many conversations about 
pathogen evolution, electroreception and interesting shark genes helped me realize the 
broad applications for the skills I was developing in molecular evolution, and has 
enriched my PhD experience. I would also like to thank Brian VanderVen for pushing 
me to dig into the basic biology of M. bovis, and for using his expertise in tuberculosis 
biology to provide helpful feedback. Lastly, I would like to thank Marty Wells for his 
guidance in statistics, and for his course Theory of Statistics, which allowed me to 
develop an intuitive understanding of statistics.   
I would also like to thank Dr. Suelee Robbe-Austerman and her colleagues at 
the USDA National Veterinary Services Laboratories for brainstorming ideas for M. 
bovis research, helping me transition into bioinformatics work, and providing 
exceptional support in helping secure a large travel grant, and in supporting my 
fellowship application.  
 
viii 
 
Next I would like to thank my non-research mentors in the veterinary school 
and in the BBS graduate program, whose career advice has helped me develop my 
dream career, and my wonderful peers in the veterinary and graduate programs, for 
their support. I would particularly like to thank my friends in the combined degree 
program, especially Frances Chen and Erica Lachenauer who immediately made me 
feel welcome in the combined degree program, and Eileen Troconis and Kieran Koch-
Laskowski for both their friendship and their leadership in our program. I would also 
like to thank Alison Stout, Casey Cazer and Scarlett Lee for their friendship, career 
advice and shared research interests that provided me with a sense of community in 
the BBS program. Lastly, I would like to thank my friends outside of Cornell: Maggie 
Meurer, Leah Oliver, Pam Skelton, and especially Biagio DiSalvo for always being 
there for me – thank you for being a (best) friend.  
Lastly, I would like to thank my family for their exceptionally valuable support 
outside of school, which enabled my success in school. I would like to particularly 
thank my parents and my sister and brother-in-law (and my newest family member, 
my niece Harper), for giving me the opportunity to pursue my dreams and for their 
love and support. Finally, I would like to thank Shay and our pets Timbi and Olive for 
their everyday support and love, which keeps me pushing forward. I would like to 
thank Shay for not only the providing the typical support system expected of a great 
partner, but also for teaching me the basics of writing code, which gave me a 
foundation in computer science that helped me in every aspect of my PhD work. 
Without him, I would not have had the confidence to pursue computational biology 
research and for his unique support, I will always be grateful.   
ix 
 
TABLE OF CONTENTS 
CHAPTER 1 ........................................................................................................................................... 16 
PATHOGENIC MYCOBACTERIA .............................................................................................................. 16 
IMPACT OF MYCOBACTERIAL DISEASES OF CATTLE .............................................................................. 16 
GENOMIC EPIDEMIOLOGY OF M. BOVIS ................................................................................................ 19 
WITHIN HOST EVOLUTION AND THE EMERGENCE OF RESISTANCE ........................................................ 21 
DISSERTATION OBJECTIVES ................................................................................................................. 22 
REFERENCES ........................................................................................................................................ 24 
CHAPTER 2 ........................................................................................................................................... 27 
ABSTRACT ............................................................................................................................................ 28 
INTRODUCTION .................................................................................................................................... 29 
METHODS ............................................................................................................................................. 31 
RESULTS AND DISCUSSION ................................................................................................................... 39 
FUTURE DIRECTIONS AND CONCLUSIONS ............................................................................................. 47 
ACKNOWLEDGEMENTS ......................................................................................................................... 50 
REFERENCES ........................................................................................................................................ 51 
SUPPLEMENTARY MATERIALS ............................................................................................................. 53 
CHAPTER 3 ........................................................................................................................................... 59 
ABSTRACT ............................................................................................................................................ 60 
IMPACT STATEMENT ............................................................................................................................. 61 
DATA SUMMARY .................................................................................................................................. 61 
INTRODUCTION .................................................................................................................................... 62 
METHODS ............................................................................................................................................. 65 
RESULTS .............................................................................................................................................. 70 
DISCUSSION ......................................................................................................................................... 87 
CONCLUSIONS ...................................................................................................................................... 92 
ACKNOWLEDGEMENTS ......................................................................................................................... 92 
REFERENCES ........................................................................................................................................ 94 
SUPPLEMENTARY MATERIALS ............................................................................................................. 98 
CHAPTER 4 ......................................................................................................................................... 107 
ABSTRACT .......................................................................................................................................... 108 
INTRODUCTION .................................................................................................................................. 109 
METHODS ........................................................................................................................................... 111 
RESULTS ............................................................................................................................................ 119 
SIMULATION RESULTS ........................................................................................................................ 124 
DISCUSSION ....................................................................................................................................... 127 
CONCLUSIONS .................................................................................................................................... 133 
REFERENCES ...................................................................................................................................... 134 
SUPPLEMENTARY MATERIALS ............................................................................................................ 137 
CHAPTER 5 ......................................................................................................................................... 141 
ABSTRACT .......................................................................................................................................... 142 
INTRODUCTION .................................................................................................................................. 143 
METHODS ........................................................................................................................................... 146 
RESULTS ............................................................................................................................................ 152 
DISCUSSION ....................................................................................................................................... 161 
CONCLUSIONS .................................................................................................................................... 166 
REFERENCES ...................................................................................................................................... 167 
SUPPLEMENTARY MATERIALS ........................................................................................................... 170 
x 
 
CONCLUSIONS AND FUTURE DIRECTIONS ............................................................................. 171 
CONCLUSIONS .................................................................................................................................... 171 
FUTURE DIRECTIONS .......................................................................................................................... 174 
REFERENCES ...................................................................................................................................... 176 
 
 
xi 
 
LIST OF FIGURES 
 
Figure 1. MAP fecal shedding patterns from five seven cows. .................................... 33 
 
Figure 2. Diagram of hidden Markov model 1. ............................................................ 36 
 
Figure 3. Transient distributions and emission distributions in two and three hidden 
state models. ................................................................................................................. 42 
 
Figure 4. Progression patterns through disease states. ................................................. 45 
 
Figure 5 M. bovis core phylogeny and sample geography. .......................................... 71 
 
Figure 6. Population structure assessed in SNP variation and gene presence absence 
patterns. ........................................................................................................................ 73 
 
Figure 7. Accessory gene labels are created through redundant gene annotation. ....... 77 
 
Figure 8. Filtered gene presence absence patterns in the M. bovis pangenome. .......... 80 
 
Figure 9. Gene content variation in a Minnesota outbreak. ......................................... 83 
 
Figure 10. Methodological differences result in drastically different pangenome sizes.
 ...................................................................................................................................... 86 
 
Figure 11. Within host population simulation schematic. .......................................... 115 
 
Figure 12. Phylogeny of study samples colored by geographic region and three levels 
of RhierBAPS phylogenetic clustering. ..................................................................... 120 
 
Figure 13. Distinguishing polyclonal from single source infections using allele 
frequencies. ................................................................................................................. 122 
 
Figure 14. Model comparison using log10 Bayes factors. ........................................... 125 
 
Figure 15. Model parameter estimates calculated with Approximate Bayesian 
Computation. .............................................................................................................. 126 
 
Figure 16. Simulation structure and graphical hypothesis. ........................................ 149 
 
Figure 17. Phylogeny of TB-Profiler predicted resistant samples and closely related 
samples. ...................................................................................................................... 153 
 
xii 
 
Figure 18. TB-Profiler predicted resistance genotypes for M. bovis samples from 
human, cattle, and deer hosts. ..................................................................................... 155 
 
Figure 19. Correlation between resistance-causing mutations and potential 
compensatory mutations. ............................................................................................ 156 
 
Figure 20. Distribution of mutations associated with rifampicin resistant, and other 
correlated mutations in RNA polymerase subunit genes rpoB and rpoC. Clades 
without correlated rpoB/C mutations were collapsed for clarity. ............................... 157 
 
Figure 21. Variable and fixed mutations in RNA polymerase subunit genes plotted 
along a radial depiction of the M. bovis phylogeny shown in Figure 17. .................. 158 
 
Figure 22. Allele frequencies of variable SNPs associated with resistance and linked 
rpoC compensatory mutations. ................................................................................... 159 
 
Figure 23. Prevalence of resistance-causing mutations katG that emerged during 
forward genetic simulations. ...................................................................................... 161 
xiii 
 
LIST OF TABLES 
 
Table 1 Descriptions and initial values of parameters in three hidden state models. ... 38 
 
Table 2. Model comparison using AIC. ....................................................................... 39 
 
Table 3. Transition and emission probability estimates. .............................................. 40 
 
Table 4. Components of each model structure tested ................................................. 115 
 
Table 5. Prior distributions for simulation parameters ............................................... 118 
 
Table 6. Fixed model parameter values. ..................................................................... 118 
 
Table 7. Parameter values used in model assessment. ............................................... 118 
 
Table 8. Within host mutations in genes associated with antimicrobial resistance .... 123 
 
Table 9. Distribution of de novo SNPs among gene categories ................................. 124 
 
Table 10. Simulation model parameters. .................................................................... 151 
xiv 
 
  LIST OF ABBREVIATIONS 
 
MAC Mycobacterium avium complex 
MTBC Mycobacterium tuberculosis complex 
MAP Mycobacterium avium subsp paratuberculosis 
SFS Site frequency spectrum 
AMR Antimicrobial resistance 
ROS Reactive oxygen species 
RDQMA Regional Dairy Quality Alliance Management 
CFU Colony forming units 
HMM Hidden Markov Model 
SNP Single nucleotide polymorphism 
fSNP Fixed single nucleotide polymorphism 
vSNP Variable single nucleotide polymorphism 
indel Insertion/deletion   
MCMC Markov Chain Monte Carlo 
ABC Approximate Bayesian Computation 
PAS para-aminosalicylic acid 
CNN Convolutional Neural Network 
       
xv 
 
CHAPTER 1 
Introduction 
Pathogenic mycobacteria 
 
The genus Mycobacteria contains medically important human and animal pathogens 
including the Mycobacterium avium complex (MAC) and the Mycobacterium 
tuberculosis complex (MTBC) and environmental, opportunistic pathogens including 
Mycobacterium canetti, and environmental non-pathogenetic species. The MAC and 
MTBC fall within two distinct clades within the slow growing mycobacteria, that 
evolved from a rapidly growing ancestor [1]. The MTBC, including M. tuberculosis, 
the cause of tuberculosis in humans, and M. bovis, the cause of bovine tuberculosis 
evolved from a free-living ancestor, which, through a combination of gene gain and 
gene loss adapted to an obligate intracellular lifestyle characterized by clonal growth 
and little to no recombination [2]. Similarly, Mycobacterium avium subsp. 
paratuberculosis (MAP), the cause of Johne’s disease in ruminants evolves clonally; 
however horizontal gene transfer is common in other M. avium species [3].  
 
Impact of mycobacterial diseases of cattle 
 
Both MAP and M. bovis cause granulomatous disease in ruminants with long latency 
periods, which prevents rapid infection detection by clinical signs. MAP causes Johne’s 
disease, which is characterized by a severe granulomatous enteritis. MAP infection 
causes a progressive loss of milk production that precedes clinical signs of disease 
driven by intestinal malabsorption including diarrhea, wasting and eventually death [4]. 
16 
17 
Johne’s disease is endemic in the United States and is responsible for severe economic 
losses to the dairy industry through decreased milk production [5, 6]. Once a farm is 
infected with MAP, it is difficult to control because of its environmental persistence and 
because of the long latency period of 2-5 years before clinical signs of disease appear 
[7-9]. Because of its endemic status in the United States, much of the current disease 
control efforts are toward disease control rather than eradication. Additionally, there is 
variability in fecal shedding patterns among infected animals, and high shedding 
animals increase environmental contamination and MAP persistence on dairy farms 
[10]. Mathematical modeling has been used extensively to aid in disease management 
on farms [11-13] and has also elucidated the impact of high-shedding animals on the 
MAP infection on dairy farms [12].   
  In contrast, M. bovis, the cause of bovine tuberculosis, has been the object of a 
National Eradication Program since its inception in 1917. Although eradication efforts 
successfully reduced the prevalence of M. bovis infection from 5% to less than 0.001% 
[14], bovine tuberculosis remains an important agricultural disease from both an 
economic and a public health perspective. M. bovis has a wide host range including cattle, 
white-tailed deer (Odocoileus virginianus), and humans [15], and causes decreased milk 
and meat production in cattle. Humans can become infected by contact with infected 
animals or, more commonly through consumption of unpasteurized milk [16]. Risk of 
zoonotic M. bovis infection has been greatly reduced in developed countries by a 
combination of effective test and slaughter control programs and milk pasteurization; 
however, M. bovis is still responsible for 1-2% of human tuberculosis cases in the US 
 
18 
[17]. Despite over 100 years of directed national control, outbreaks still occur in cattle 
every year. 
There are two explanations for the difficulty in eliminating bovine tuberculosis. 
First, M. bovis infection in cattle remains subclinical, and potentially undetected for 
months to years until lesions become large enough to impair organ function [18]. 
Second, there are multiple sources of M. bovis infection in cattle including spillover 
from wildlife, animal movement, zoonotic transmission, and fomite transmission. The 
only known wildlife reservoir of bovine tuberculosis in the US is a population of white-
tailed deer in Michigan, and cross species transmission remains an important source of 
cattle infection in this region [19]. Cattle trade between Mexico and the US is common 
and bovine tuberculosis prevalence is 14.2% in some regions in Mexico [20]. 
Movement of animals from Mexico to the US could result in transmission; however, 
animal movement and wildlife do not account for all outbreaks in the US. Remaining 
cases may be linked to zoonotic transmission [21, 22]. 
Due to the potential lag time between infection and detection, as well as the 
multiple possible sources of infection, identifying the source of infection in a herd is 
difficult. Although thorough sampling is routine in traceback investigations, animal 
movement records are often incomplete, and other potential sources of infection, 
including zoonotic transmission, are difficult to characterize. To maximize the 
information available from outbreak investigations, whole genome sequencing of M. 
bovis isolates from infected herds was incorporated into the bovine tuberculosis 
eradication program by the National Veterinary Services Laboratories in 2013 [21].  
 
19 
Genomic epidemiology of M. bovis 
 
Pathogen whole genome sequencing can provide useful information on outbreak 
dynamics that is difficult or impossible to discern using non-molecular epidemiologic 
methods including determining single vs multiple introductions of pathogen into a 
population [23], understanding the role of different host species in multi-host 
transmission events [19], characterizing the role of within-host evolution on 
transmission dynamics [24] and estimating estimate the time of transmission events 
[25]. Molecular epidemiology, at its core is based on the foundational mathematical 
theory developed by Wright [26], Fisher [27], Kimura [28] and others, that describes 
how genomes evolve through mutation, drift and selection. Models including the 
Wright-Fisher model and the Kingman’s coalescent theory [29] enable linking 
evolutionary theory to real polymorphism data, enabling inference of epidemic history 
through the evolution of pathogens. The power of molecular data to understand how 
pathogens evolve and move through host populations has exploded with the increased 
accessibility of pathogen whole genome sequencing, however, model assumptions must 
be evaluated to ensure proper application to host-pathogen systems. In particular, 
methods that rely on the Kingman coalescent theory based on the Wright-Fisher 
population model for inference on epidemiologically relevant evolutionary processes 
may be inappropriate for some host-pathogen systems. Specifically, strong linkage 
disequilibrium in Mycobacterium tuberculosis complex (MTBC) species including M. 
bovis due to low recombination rates violates the independent sites assumption of the 
Wright Fisher model. Additionally, population bottlenecks caused by transmission and 
 
20 
potential differences in reproductive success of individuals violates the small variance 
offspring distribution assumption of the Kingman’s coalescent [30]. 
Violations of these basic assumptions in the evolution of MTBC can be 
visualized in the site frequency spectrum (SFS), where an excess of rare variants is seen 
compared to the neutral SFS calculated under the infinite mutations model and the 
Kingman’s coalescent [30, 31]. Both clonal progeny skew caused by wide variance 
offspring distribution and exponential population growth, consistent with demography 
in an outbreak setting, have been demonstrated to cause a shift in the SFS towards an 
excess of rare variants [32]. Additionally, the effects of progeny skew and exponential 
growth can be distinguished by differing right tail distributions in the SFS. Lastly, 
negative selection can also cause a left shift in the SFS because deleterious alleles are 
prevented from reaching high frequencies in the population. Recent studies have found 
multiple-merger coalescent models [33]  and explicit modeling of skewed offspring 
distributions [34] describe M. tuberculosis evolution better than standard Kingman’s 
coalescent based models. Additionally, since both M. bovis and MAP evolve clonally 
with no horizontal gene transfer, the pangenome of both species should reflect this trend 
toward genome reduction with compact accessory genomes. Recently, the MAP 
pangenome was found to have a compact pangenome with no detected recombination 
[35] however MTBC pangenome estimates suggest both M. tuberculosis and M. bovis 
have rather large pangenomes, inconsistent with an inability to acquire new genes 
through horizontal gene transfer [36, 37]. Applying models that do not adequately 
account for the evolutionary biology of a pathogen system may lead to erroneous 
conclusions. For example, using a simulation structure that incorporated high variance 
 
21 
offspring distributions, Morales-Arce et al. found that the observed distribution of 
genetic variants within hosts was best described by a mutation rate orders of magnitude 
greater than that described using coalescent methods [34]. 
 
Within host evolution and the emergence of resistance  
 
A consequence of a rapid mutation rate may be that antimicrobial resistant genotypes 
may evolve by drift more commonly than expected assuming a slow mutation rate. 
Antimicrobial resistance (AMR) is a global health crisis in the control of tuberculosis 
in humans, and since there is no known mechanism for horizontal gene transfer, all new 
resistance evolution occurs through mutation. Although it is well understood that 
mutations arising through drift or a combination of drift and selection cause AMR, the 
cellular mechanisms of mutation generation in the context of host and pathogen 
interaction are less understood. Bacterial mutations may arise due to random errors in 
replication but may also result from host driven interactions. For example, upon 
phagocytosis, the macrophage produces reactive oxygen species (ROS) which can cause 
double stranded DNA breaks and can kill engulfed pathogens. If a bacterial cell survives 
the macrophage ROS defense, it may need to repair in double stranded breaks. MTBC 
species can repair double stranded breaks through homologous recombination or non-
homologous ends joining, however these mechanisms, particularly non-homologous 
ends joining, are error prone and can create mutations. Additionally, if ROS driven DNA 
damage occurs in DNA repair genes, a strain can become hypermutable [38]. Therefore, 
ROS driven DNA damage could be an additional source of potential AMR causing 
mutations. Variability in ROS production by the host or ROS tolerance by bacteria could 
 
22 
also influence intracellular survival and the evolution of AMR, and mutations have been 
identified in M. tuberculosis strains related to ROS tolerance [39]. Similarly, variability 
in T cell response may also influence bacterial mutations and within host diversity. In a 
mouse model of tuberculosis, comparing strain diversity in T cell knock out mice to 
wild type mice, Copin et al. found that mice without T cells had clonal infections, 
whereas mice with T cells contained strains with multiple different mutations; therefore, 
T cells may play a role in generating within host diversity [40]. These random or host-
driven diversity generating mechanisms coupled with population dynamics influenced 
by small population sizes during transmission events may lead to the evolution of AMR 
genotypes without targeted positive selection by antimicrobial treatment. A low level of 
AMR may evolve in M. bovis in animal hosts in the absence of antimicrobial use, and 
this basal AMR prevalence may reflect the underlying distribution of AMR genotypes 
in M. tuberculosis that selection acts upon under exposure to antimicrobials.  
 
Dissertation Objectives 
 
In this dissertation we aim to both understand the evolutionary mechanisms that drive 
genetic variation in mycobacterial pathogens and to develop tools to aid in disease 
control. In chapter 2 we discuss a time-series analysis of MAP fecal shedding data to 
predict future progression patterns by classing individual animals into MAP shedding 
groups based on current and past colony forming unit counts. In chapter 3 we dive into 
the methodological challenges associated with using short read data in constructing 
accurate pangenome estimates, and find that M. bovis has a compact genome, consistent 
with clonal evolution. In chapter 4 we investigate the distinct roles of selection, 
 
23 
demography, and progeny skew in M. bovis evolution over the course of an outbreak in 
a cattle herd and find that the combined diversity reducing effects of skewed offspring 
distributions and population bottlenecks and the diversity generating effect of a 
relatively rapid mutation rate describe the within-host evolution of M. bovis in animal 
hosts. Lastly, in chapter 5 we explore the effects of a rapid mutation rate and population 
bottlenecks on the evolution of antimicrobial resistance in the absence of antimicrobial 
treatment.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24 
References 
 
1. Bachmann NL, Salamzade R, Manson AL, Whittington R, Sintchenko V et 
al. Key Transitions in the Evolution of Rapid and Slow Growing. Front Microbiol 
2019;10:3019. 
2. Veyrier FJ, Dufort A, Behr MA. The rise and fall of the Mycobacterium 
tuberculosis genome. Trends Microbiol 2011;19(4):156-161. 
3. Reva O, Korotetskiy I, Ilin A. Role of the horizontal gene exchange in 
evolution of pathogenic Mycobacteria. BMC Evol Biol 2015;15 Suppl 1:S2. 
4. Rathnaiah G, Zinniel DK, Bannantine JP, Stabel JR, Gröhn YT et al. 
Pathogenesis, Molecular Genetics, and Genomics of. Front Vet Sci 2017;4:187. 
5. USDA-APHIS. NAHMS. Johne’s Disease on U.S. Dairies, 1991-2007. 2008. 
6. Ott SL, Wells SJ, Wagner BA. Herd-level economic losses associated with 
Johne's disease on US dairy operations. Prev Vet Med 1999;40(3-4):179-192. 
7. Whittington RJ, Sergeant ES. Progress towards understanding the spread, 
detection and control of Mycobacterium avium subsp paratuberculosis in animal 
populations. Aust Vet J 2001;79(4):267-278. 
8. Cocito C, Gilot P, Coene M, de Kesel M, Poupart P et al. Paratuberculosis. 
Clin Microbiol Rev 1994;7(3):328-345. 
9. Whitlock RH, Buergelt C. Preclinical and clinical manifestations of 
paratuberculosis (including pathology). Vet Clin North Am Food Anim Pract 
1996;12(2):345-356. 
10. Slater N, Mitchell RM, Whitlock RH, Fyock T, Pradhan AK et al. Impact of 
the shedding level on transmission of persistent infections in Mycobacterium avium 
subspecies paratuberculosis (MAP). Vet Res 2016;47:38. 
11. Mitchell RM, Whitlock RH, Grohn YT, Schukken YH. Back to the real 
world: connecting models with data. Prev Vet Med 2015;118(2-3):215-225. 
12. Smith RL, Schukken YH, Grohn YT. A new compartmental model of 
Mycobacterium avium subsp. paratuberculosis infection dynamics in cattle. Prev Vet 
Med 2015;122(3):298-305. 
13. Al-Mamun MA, Smith RL, Schukken YH, Grohn YT. Use of an Individual-
based Model to Control Transmission Pathways of Mycobacterium avium Subsp. 
paratuberculosis Infection in Cattle Herds. Sci Rep 2017;7(1):11845. 
14. Humphrey HM, Orloski KA, Olea-Popelka FJ. Bovine tuberculosis slaughter 
surveillance in the United States 2001-2010: assessment of its traceback investigation 
function. BMC Vet Res 2014;10:182. 
15. Fitzgerald SD, Kaneene JB. Wildlife reservoirs of bovine tuberculosis 
worldwide: hosts, pathology, surveillance, and control. Vet Pathol 2013;50(3):488-499. 
16. Pérez-Lago L, Comas I, Navarro Y, González-Candelas F, Herranz M et al. 
Whole genome sequencing analysis of intrapatient microevolution in Mycobacterium 
tuberculosis: potential impact on the inference of tuberculosis transmission. J Infect Dis 
2014;209(1):98-108. 
17. Scott C, Cavanaugh JS, Pratt R, Silk BJ, LoBue P et al. Human Tuberculosis 
Caused by Mycobacterium bovis in the United States, 2006-2013. Clin Infect Dis 
2016;63(5):594-601. 
 
25 
18. Domingo M, Vidal E, Marco A. Pathology of bovine tuberculosis. Res Vet Sci 
2014;97 Suppl:S20-29. 
19. Salvador LCM, O'Brien DJ, Cosgrove MK, Stuber TP, Schooley AM et al. 
Disease management at the wildlife-livestock interface: Using whole-genome 
sequencing to study the role of elk in Mycobacterium bovis transmission in Michigan, 
USA. Mol Ecol 2019;28(9):2192-2205. 
20. Álvarez JBD, Clifford JR. United States – Mexico Joint Strategic Plan for 
Collaboration on Bovine Tuberculosis In: APHIS, editor.: USDA; 2013. 
21. Orloski K, Robbe-Austerman S, Stuber T, Hench B, Schoenbaum M. Whole 
Genome Sequencing of Mycobacterium bovis Isolated From Livestock in the United 
States, 1989-2018. Front Vet Sci 2018;5:253. 
22. Lombard JE, Patton EA, Gibbons-Burgener SN, Klos RF, Tans-Kersten JL 
et al. Human-to-Cattle. Front Vet Sci 2021;8:691192. 
23. Pandey P, Bhatnagar AK, Mohan A, Sachdeva KS, Samantaray JC et al. 
Mycobacterium tuberculosis polyclonal infections through treatment and recurrence. 
PLoS One 2020;15(8):e0237345. 
24. Theys K, Libin P, Pineda-Peña AC, Nowé A, Vandamme AM et al. The 
impact of HIV-1 within-host evolution on transmission dynamics. Curr Opin Virol 
2018;28:92-101. 
25. Salje H, Wesolowski A, Brown TS, Kiang MV, Berry IM et al. 
Reconstructing unseen transmission events to infer dengue dynamics from viral 
sequences. Nat Commun 2021;12(1):1810. 
26. Wright S. Evolution in Mendelian Populations. Genetics 1931;16(2):97-159. 
27. Fisher RA. Darwinian evolution of mutations. Eugen Rev 1922;14(1):31-34. 
28. Kimura M. The neutral theory of molecular evolution. Sci Am 1979;241(5):98-
100, 102, 108 passim. 
29. Kingman JFC. The coalescent. Stochastic Processes and their Applications 
1982;13(3):235-248. 
30. Eldon B, Wakeley J. Coalescent processes when the distribution of offspring 
number among individuals is highly skewed. Genetics 2006;172(4):2621-2633. 
31. Li LM, Grassly NC, Fraser C. Genomic analysis of emerging pathogens: 
methods, application and future trends. Genome Biol 2014;15(11):541. 
32. Irwin KK, Laurent S, Matuszewski S, Vuilleumier S, Ormond L et al. On 
the importance of skewed offspring distributions and background selection in virus 
population genetics. Heredity (Edinb) 2016;117(6):393-399. 
33. Menardo F, Duchene S, Brites D, Gagneux S. The molecular clock of 
Mycobacterium tuberculosis. PLoS Pathog 2019;15(9):e1008067. 
34. Morales-Arce AY, Harris RB, Stone AC, Jensen JD. Evaluating the 
contributions of purifying selection and progeny-skew in dictating within-host 
Mycobacterium tuberculosis evolution. Evolution 2020;74(5):992-1001. 
35. Richards VP, Nigsch A, Pavinski Bitar P, Sun Q, Stuber T et al. Evolutionary 
genomic and bacteria GWAS analysis of Mycobacterium avium subsp. paratuberculosis 
and dairy cattle Johne's disease phenotypes. Appl Environ Microbiol 2021. 
36. Reis AC, Cunha MV. The open pan-genome architecture and virulence 
landscape of Mycobacterium bovis. Microb Genom 2021;7(10). 
 
26 
37. Kavvas ES, Catoiu E, Mih N, Yurkovich JT, Seif Y et al. Machine learning 
and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic 
signatures of antibiotic resistance. Nat Commun 2018;9(1):4306. 
38. Gorna AE, Bowater RP, Dziadek J. DNA repair systems and the pathogenesis 
of Mycobacterium tuberculosis: varying activities at different stages of infection. Clin 
Sci (Lond) 2010;119(5):187-202. 
39. Mestre O, Hurtado-Ortiz R, Dos Vultos T, Namouchi A, Cimino M et al. 
High throughput phenotypic selection of Mycobacterium tuberculosis mutants with 
impaired resistance to reactive oxygen species identifies genes important for 
intracellular growth. PLoS One 2013;8(1):e53486. 
40. Copin R, Wang X, Louie E, Escuyer V, Coscolla M et al. Within Host 
Evolution Selects for a Dominant Genotype of Mycobacterium tuberculosis while T 
Cells Increase Pathogen Genetic Diversity. PLoS Pathog 2016;12(12):e1006111. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27 
 
 
 
 
 
 
 
 
 
CHAPTER 2 
Characterizing infectious disease progression through discrete states using hidden 
Markov models 
 
 
Kristina M. Ceres1, Ynte H. Schukken2, Yrjö T. Gröhn1  
 
1  Department of Population Medicine and Diagnostic Sciences, College of Veterinary 
Medicine, Cornell University, Ithaca, NY 
2 Department of Animal Sciences, Wageningen University, Wageningen, the 
Netherlands   
 
Chapter 2 is published in PLOS ONE 
 
 
 
 
 
 
 
 
 
 
 
 
28 
Abstract 
 
Infectious disease management relies on accurate characterization of disease 
progression so that transmission can be prevented. Slowly progressing infectious 
diseases can be difficult to characterize because of a latency period between the time an 
individual is infected and when they show clinical signs of disease. The introduction of 
Mycobacterium avium ssp. paratuberculosis (MAP), the cause of Johne’s disease, onto 
a dairy farm could be undetected by farmers for years before any animal shows clinical 
signs of disease. In this time period infected animals may shed thousands of colony 
forming units. Parameterizing trajectories through disease states from infection to 
clinical disease can help farmers to develop control programs based on targeting 
individual disease state, potentially reducing both transmission and production losses 
due to disease. We suspect that there are two distinct progression pathways; one where 
animals progress to a high-shedding disease state, and another where animals maintain 
a low-level of shedding without clinical disease. We fit continuous-time hidden Markov 
models to multi-year longitudinal fecal sampling data from three US dairy farms, and 
estimated model parameters using a modified Baum-Welch expectation maximization 
algorithm. Using posterior decoding, we observed two distinct shedding patterns: cows 
that had observations associated with a high-shedding disease state, and cows that did 
not. This model framework can be employed prospectively to determine which cows 
are likely to progress to clinical disease and may be applied to characterize disease 
progression of other slowly progressing infectious diseases. 
 
 
 
29 
Introduction 
 
Slowly progressing infectious diseases, like tuberculosis in humans and animals, 
HIV/AIDS in humans, and Johne’s disease in cattle, are difficult to characterize because 
they are often associated with a latency period between the time an individual is infected 
and when they show clinical signs or symptoms of disease. Treatment efficacy and 
transmission dynamics are dependent on the state of infection, and understanding an 
individual’s infection state, and probability of transitioning among infectious states at 
the population level, can help provide better treatment protocols and targeted 
epidemiologic interventions. For example, only patients with active tuberculosis 
infections are likely to transmit Mycobacterium tuberculosis to others; however, 
targeting latently infected individuals that have a high probability to progress to active 
disease with preventative therapeutics could prevent transmission from occurring [1].   
Johne's disease is caused by Mycobacterium avium subsp. paratuberculosis 
(MAP) and is characterized by a chronic granulomatous enteritis leading to diarrhea, 
wasting and eventually death. Johne’s disease is considered an important production 
disease, especially in the dairy industry, and is estimated to cost 200 million dollars per 
year in production losses and early culling in the United States [2]. MAP has also been 
implicated as a possible cause or contributing factor to Crohn’s disease in humans; 
therefore, zoonotic MAP transmission may be relevant to public health [3].  Johne's 
Disease is pervasive in dairy farms in the United States. In 2007, 68 percent of dairy 
operations tested positive for MAP in at least one environmental sample [4]. MAP is 
mainly transmitted among cattle through the fecal-oral route and animals can become 
infected by eating contaminated material from their environment [5]. Calves are more 
 
30 
susceptible than other age groups, and often become infected either in-utero or soon 
after birth; however, they may not show clinical signs of disease for 2-5 years [5-7]. 
MAP can also survive for up to a year in certain environmental conditions, which could 
allow new infections even after an infected animal is removed from the herd [5, 8]. Due 
to, among other reasons, persistence in the environment and a long latency period before 
infected animals show clinical signs, Johne's disease is very difficult to control.  
In order to effectively control Johne's disease, it is essential to understand how 
the infection progresses. It is currently hypothesized that animals in certain disease 
states transiently shed MAP in their feces, and it is known that some animals shed much 
more MAP than others [9]. Schukken et al. observed two distinct patterns of disease 
progression among infected animals [10]; animals that progressed to a high shedding 
disease state, and those that controlled infection. Furthermore, Mitchell et al. observed 
that most animals that shed MAP intermittently never progress to a high shedding state, 
but those that do progress maintain a consistent high level of shedding [11]. 
Johne's disease infection dynamics have been studied extensively using 
compartmental and agent-based models [12-14]; however, these models contain a large 
number of parameters, and it is sometimes necessary to rely on assumptions instead of 
data in order to fully parameterize the model. In order to study Johne's and other disease 
dynamics it is usually necessary for the investigator to divide disease progression into a 
series of discrete states and determine reasonable rates of movement between the states 
to parameterize the model. Some model parameters can be estimated using field data, 
but parameters regarding transition among disease states are less obviously estimable, 
especially in the case of Johne's disease, where animals can maintain an infectious state 
 
31 
for years without any visible signs of disease. With slowly progressing diseases like 
Johne's, it may be difficult or in many animals impossible to observe direct signs of 
subclinical disease states, so determining parameters for rates of transition among 
disease states is non-trivial. Hidden Markov models (HMMs) offer one solution to the 
problem of estimating these parameters by using unsupervised machine learning to 
estimate model parameters including disease state transition rates. A validated HMM 
can also be used as a diagnostic to predict disease progression outcomes in infected 
cows.    
We constructed sequences of MAP fecal culture results from repeated samples 
of individual animals sampled biannually up to 6 years from three Northeast US dairy 
herds, and used these sequences to infer the underlying hidden Johne's disease states in 
a HMM framework. Our objectives were to estimate disease state and transition 
parameters among infected cows, and to determine if two or more types of disease 
progression existed. We hypothesized that the most probable path through hidden states 
identified by posterior decoding would show two distinct most probable state 
progression paths that represent separate MAP infection progression types.  
 
Methods 
 
Data description 
 
Fecal samples were taken from all adult animals on three northeastern farms 
between 2004 and 2009 biannually, totaling 6530 samples from 1714 cows as part of 
the Regional Dairy Quality Alliance Management (RDQMA) program [15]. The overall 
period prevalence was 5.9%, and fecal shedding was variable among infected cows 
 
32 
ranging from 0 to 6360 colony forming units (CFU). Cows were sampled between 2 and 
11 times with only one cow sampled 11 times over 5.5 years. Fecal samples from each 
cow were placed in four 15 ml solid medium slant tubes, and were processed and 
cultured as described previously [15, 16]. The sum of CFU cultured from the four tubes 
was multiplied by 5.3 to account for dilutions, and the log base 10 of the resulting 
quantity was used as the observed CFU count for model fitting. Positive samples were 
defined as greater than 0 CFU in a fecal sample. Samples were taken on average 15 
weeks apart with a median of 14 weeks apart; however, sampling intervals ranged from 
1 to 40 weeks. To avoid misclassification of healthy cows as infected, we only included 
cows that had at least two positive fecal culture samples out of three consecutive 
samples. Samples that were not readable due to contamination with other bacterial 
growth or fungal growth were removed from the analysis. After removing samples that 
did not meet inclusion criteria, we were left with 129 observations. Fecal shedding 
patterns of seven representative cows are shown in Figure 1.  
 
 
 
 
 
 
 
 
 
 
33 
 
Figure 1. MAP fecal shedding patterns from five seven cows. 
Shedding patterns were variable among individuals. The number of samples taken from 
each cow was also variable, ranging from 2 to 11 samples. The variation in both CFU 
count and number of samples may be related to disease progression, as animals with 
high CFU counts may be culled early due to clinical MAP infection related production 
losses.   
 
 
 
 
 
 
 
34 
Model description  
 
A HMM is a probabilistic model which describes the transition through unobserved 
states in a Markovian system. Although the hidden state of the system cannot be 
observed directly, information about the state can be inferred indirectly through model 
emissions. This framework is useful for modeling MAP infection progression because 
the underlying disease state is not readily observable, but MAP shedding in feces can 
be easily measured and can be used in the HMM framework as model emissions. We 
developed four candidate continuous time HMMs to model MAP infection progression 
in dairy cows. Since samples were taken at irregular intervals, we chose to model disease 
progression using a continuous time HMM framework as opposed to a discrete time 
framework. In a continuous time HMM, both the hidden state at the time a sample was 
taken and the number and types of transitions between hidden states between two 
consecutive observations are unknown [17]. Continuous time HMMs are defined by a 
vector of initial hidden state probabilities, 𝝅𝟎 that define the probability of being in each 
state at time 0, a state transition rate matrix Q that defines the rate of transition among 
hidden states, and emission probability distributions, 𝑬.	The emission probability 
distributions, which are defined as the probability of observing a log10 CFU given an 
underlying hidden disease state k, were modeled using gamma distributions, 
𝐸"~𝐺𝑎𝑚𝑚𝑎(𝛼" , 𝜃"),	where		𝛼" 	and 𝜃"are gamma distribution shape and scale 
parameters, respectively. The probability of transitioning from state j to state k depends 
on the interval of time t and is defined as	𝑃#"(𝑡) = 	 𝑒$%.  
The number of true disease states is unknown, but models with very large state 
spaces may not provide a useful representation of MAP infection from veterinary 
 
35 
standpoint since as the number of states increases the differences in disease phenotype 
between states may diminish, and states without a remarkable associated change in 
disease phenotype do not have obvious clinical relevance.  Additionally, the potential 
model state space that can be explored is limited by our relatively low sample size of 
cows with MAP infection. Therefore, we chose to create a candidate model set including 
models with two to five hidden states to capture clinically relevant HMM structures. A 
graphical description of an example three hidden state model is shown in Figure 2. In 
Figure 2, the state labels “0”, “1” and “2” represent disease states associated with low, 
moderate, and high CFU counts, respectively. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36 
 
Figure 2. Diagram of hidden Markov model 1.   
This diagram represents a hidden Markov model with three hidden states labeled 0, 1, 
and 2. The model is shown with initial state probabilities, π&, time dependent transition 
probabilities, P and emissions, E, which have gamma distributions.  
 
 
37 
Parameter learning 
 
Maximum likelihood estimates for model parameters, 𝜋&, 𝑄, 𝛼, 𝜃, were 
computed using a modified Baum-Welch expectation maximization (EM) algorithm 
developed for continuous time HMMs by Liu et al. [17]. We used the “Expm” method, 
which relies on the matrix exponential of an auxiliary matrix A to calculate EM 
estimates of model parameters. The auxiliary matrix, 𝐴 = 7𝑄 𝐵0 𝑄:,	was calculated for 
each state i, and edge i,j.  B is a square matrix with the same dimensions of Q and has 1 
in the (i,j) position and 0 elsewhere. This method was determined to be the most robust 
EM algorithm of the three presented by Liu et al. At each iteration of Expm EM, gamma 
shape and scale parameters were updated using the Newton-Raphson method, which is 
further described in the supplemental S1 file. The algorithm guarantees an increase in 
model likelihood with each iteration, and was terminated when the change in likelihood 
was less than	0.001.	All algorithms were implemented in Python 3.7.0 using open 
source libraries: Scipy [18], NumPy[19], and pandas[20].  
 
Initial parameterization and model selection 
 
Each model was initialized with 250 sets of parameter values drawn from uniform 
distributions shown in Table 1. Initial state probabilities 𝝅𝟎 were normalized such that 
each set of initial state probabilities	summed to 1. The fit of each model state structure 
was assessed using Akaike’s information criterion (AIC) calculated as 2𝑝 − 2𝑙𝑜𝑔(𝐿), 
where L is the model likelihood and p is the number of parameters estimated in a specific 
model.  
 
 
38 
Table 1 Descriptions and initial values of parameters in three hidden state models. 
Parameter Description Initial Value  
𝜋&! 	 Probability of starting in each hidden state k Unif(0,1) 
𝛼" 	 Gamma distribution shape parameter for state k Unif(1,20) 
𝜃" 	 Gamma distribution scale parameter for state k Unif(0.01, 1) 
𝑞#" 	 Transition rate from state j to state k Unif(0.01,5) 
Four hidden Markov model structures with two, three, four or five hidden states were 
initialized with 250 sets of random parameter values drawn from uniform distributions. 
We chose parameter values for transition rates and emission distributions that would 
produce reasonable state transition probabilities and gamma distributions to fit the 
observed CFU values.  We allowed the initial state probabilities to range from 0 to 1 
reflecting our limited knowledge on the proportion of animals in each underlying 
disease state.  
 
Bootstrap confidence intervals 
 
The CFU shedding dataset of 129 cows was sampled with replacement to 
generate 1000 bootstrap samples of the same size as the original dataset using the 
resample function from sklearn[21]. Model transition and emission parameters were 
estimated for each sample using the initialization parameter combination that produced 
the lowest AIC score. 95% confidence intervals were calculated for each transition and 
emission probability using the 2.5 and 97.5 percentiles of the parameter values 
generated using the bootstrap samples.  
 
 
 
 
39 
Results and discussion 
 
Model likelihood and AIC 
 
The AIC values for each model type (two, three, four, five states) is shown in 
Table 2. 
Table 2. Model comparison using AIC. 
Model pi log(Li) AICi Rel. Li w(AIC)i 
3 State 14 -95.83 219.66 1 0.993 
2 State 7 -107.86 229.72 6.5 E -3 0.006 
4 State 23 -106.24 258.48 3.7 E -9 3.7 E -9 
5 State 34 -104.38 276.76 4.0 E -13 4.0 E -13 
 
Rel. Li represents the relative log likelihood for model i and is calculated as 
𝑒𝑥𝑝 E'()"#$*'()#F		where is AICmin the minimum AIC value in the candidate set. w(AIC)+ i 
is the AIC weight for model i and is calculated as ,-..0#∑ . pi is the number of parameters !,-.	0!
in model i, and log(Li) is the log likelihood for model i.  
 
The three state model had the lowest AIC and had the largest AIC weight so it was 
determined to be the best fitting model out of the four types tested to minimize 
information loss.  This suggests that the three state model fits the data best with our 
limited sample size, but it is possible that with a larger dataset a different state structure 
would be preferred. For example, if there were more consecutive samples taken for each 
cow over a longer period of time, there may be stronger support for a model with a 
larger number of states as this dataset may contain finer scale changes in shedding 
patterns that were not evident in our comparatively sparse dataset. On the other hand, 
factors such as individual heterogeneity and outliers may lead AIC to favor models with 
 
40 
larger state spaces when smaller state spaces are more biologically realistic[22]. 
Although the three state model had the lowest AIC, since AIC may favor larger state 
spaces, and since both the two and three state models had AIC weights greater than zero, 
for the remainder of this paper will focus on results for both the two and three state 
models. The initial parameter combination that produced the lowest AIC value for the 
two, three, four and five state models are shown in supplementary S1 Table. Parameter 
estimates for the four and five state models are shown in supplementary S2 Table.  
Parameter estimates 
 
EM estimates for two and three hidden state model parameters are shown in 
Table 3.  
Table 3. Transition and emission probability estimates. 
Parameter Description Estimate 95% CI 
Two state models 
𝜋&	 Probability of starting in state 0 0.45 (0.30, 0.64) 
𝛼& Gamma distribution shape parameter 17.56 (4.81, 26.33) 
for state 0 
𝛼3 Gamma distribution shape parameter 17.86 (13.86, 184.37) 
for state 1 
𝜃& Gamma distribution scale parameter 0.14 (0.09, 0.82) 
for state 0 
𝜃3 Gamma distribution scale parameter 0.39 (0.04, 0.48) 
for state 1 
𝛼&𝜃& Gamma distribution mean for state 0 2.44 (2.30, 3.95) 
𝛼3𝜃3 Gamma distribution mean for state 1 6.87 (6.49, 8.03) 
𝑞&3 Transition rate from state 0 to 1 0.03 (0.01, 2.90) 
𝑞3& Transition rate from state 1 to 0 2.00e-3 (1.22e-07, 1.02) 
Three state models 
𝜋&	 Probability of starting in state 0 0.43 (0.29, 0.62) 
𝜋3	 Probability of starting in state 1 0.21 (4.11e-36, 0.39) 
𝛼& Gamma distribution shape parameter 19.41 (4.79, 30.07) 
for state 0 
 
41 
𝛼3 Gamma distribution shape parameter 16.24 (12.29, 15307.72) 
for state 1 
𝛼+ Gamma distribution shape parameter 173.20 (36.00, 239.35) 
for state 2 
𝜃& Gamma distribution scale parameter 0.12 (0.08, 0.78) 
for state 0 
𝜃3 Gamma distribution scale parameter 0.35 (3.27e-4,0.39) 
for state 1 
𝜃+ Gamma distribution scale parameter 0.05 (0.03, 0.20) 
for state 2 
𝛼&𝜃& Gamma distribution mean for state 0 2.40 (2.17, 3.75) 
𝛼3𝜃3 Gamma distribution mean for state 1 5.64 (3.83, 6.18) 
𝛼+𝜃+ Gamma distribution mean for state 2 7.80 (7.14, 8.12) 
𝑞&3 Transition rate from state 0 to 1 0.03 (0.01, 0.98) 
𝑞&+ Transition rate from state 0 to 2 1.13e-40 (1.89e-46, 0.48) 
𝑞3& Transition rate from state 1 to 0 4.65e-39 (1.86e-131, 1.46) 
𝑞3+ Transition rate from state 1 to 2 0.03 (0.02, 6.15) 
𝑞+& Transition rate from state 2 to 0 3.00e-3 (5.12e-143, 0.60) 
𝑞+3 Transition rate from state 2 to 1 1.50e-06 (1.50e-8, 1.49) 
 
Parameter estimates for the two and three hidden disease models were generated using 
the modified Baum-Welch Expectation Maximization algorithm and bootstrap 
confidence intervals, using initial values in supplementary Table S1. 
 
Transient distributions of the Markov chain are shown in Figure 3A. A 
histogram of the transformed CFU data is shown in Figure 3B and fitted emission 
distributions for the two and three state models are shown in Figure 3C. The CFU count 
data shows a large number of low CFU observations, fewer mid-range CFU counts 
between 3 and 5 CFU and a larger number of observations between 5 and 9 CFU. The 
CFU count data are well approximated by the fitted gamma emissions, which show 
higher densities for low and high CFU counts than for mid-range CFU counts.  
 
42 
 
Figure 3. Transient distributions and emission distributions in two and three hidden 
state models. 
The transient distribution for each Markov chain was calculated for time t=0,..,300 
weeks (A).  Fitted emission distributions are shown in (C), and the observed data are 
shown in grey (B). The number of random samples used to create each plotted state 
emission distribution was proportional to the stationary distribution value for that 
hidden state. 
 
The stationary distributions 𝜋4	are calculated by solving 𝜋4𝑄 = 0, and can be 
approximated by the equilibrium distribution shown in Figure 3A, where for large t,  
𝜋*$%& ≈ 𝜋4. The stationary distribution for the two state model was 𝜋4 = [0.07, 0.93], 
and for the three state model was 𝜋4 = [0.09, 0.09, 0.81]. For both the stationary 
 
43 
distributions in the two and three state models, there was a very high probability of being 
in the highest disease state (state 1 in the two state model and state 2 in the three state 
model). Thus, the majority of cows are expected to reach a disease state associated with 
high CFU shedding; however, a minority of cows are expected to have observations 
associated with a low-shedding disease state.  
The sojourn time before transitioning between hidden states is exponentially 
distributed with parameter -𝑞55, where 𝑞55 	is a diagonal element of the Q matrix. In both 
the two and three state models, the probability of remaining within a state is much higher 
than transitioning between states. The average sojourn time for state 0 in the two state 
model was 37.7 weeks, and the average sojourn time in the three state model was 38.3 
weeks for state 0 and 37.8 weeks for state 2; therefore, on average cows remained in a 
low CFU-emitting disease hidden state for about 38 weeks before transitioning to a 
higher CFU-emitting disease state. Unlike the two state model, the three state model 
contains an intermediate state (state 1) that is associated with a moderate level of MAP 
shedding. In the three state model, the transition rate from state 0 to 1 is much higher 
than the transition rate from state 2 to 1, and the transition rate from state 1 to 2 is much 
higher than the transition rate from state 1 to 0. Similarly, in both the two and three state 
models, the transition rate between lowest shedding disease states to the highest 
shedding disease state was much higher that the transition rate from a high shedding 
disease state to lower shedding disease states. Together, this indicates that transition 
from any higher shedding state to a lower shedding state is less likely than either 
remaining in the current state or transitioning to a higher shedding disease state, which 
is consistent with the known progression of MAP infection from subclinical infection 
 
44 
associated with low MAP shedding to clinical Johne’s disease associated with high 
MAP shedding. In addition to transitioning from low shedding to high shedding states, 
transition rate estimates in the three state model suggest that it is possible to transition 
from state 2 to state 0, and that this transition is more likely than transitioning from state 
2 to 1. This type of transition may exist in an intermittent high shedding progression 
pattern, where cows have observations associated with a high shedding disease state 
flanked by observations associated with a low shedding disease state; however, this 
pattern is only clearly demonstrated by one cow, C_106, shown in Figure 4B, where 
between 50 and 65 weeks there is a transition from state 2 to state 0.  
The maximum posterior probabilities of a cow being in a particular disease state 
at each sampling time were constructed to visualize progression patterns of individual 
cows, and these individual progression patterns are shown in Figure 4A. The observed 
and predicted log(CFU) MAP values for representative cows is shown in Figure 4B. 
The predicted log(CFU) MAP values were generated by taking 1000 random samples 
from the fitted gamma emission distributions from the state with the highest posterior 
probability for each observation. The majority of the predicted log(MAP) 10-90 
percentiles overlapped the observed log(MAP) values; however, the three state model 
10-90 percentile range predictions were narrower and overlapped observed values more 
often than the two state model predictions indicating that the three state model fit the 
data better, especially for intermediate log(CFU) values.  
 
45 
 
Figure 4. Progression patterns through disease states. 
Progression patterns of individual cows through disease states estimated using posterior 
decoding are shown for both the two state and the three state models (A). The color of 
the line indicates the state with the highest posterior probability at the last observation. 
 
46 
The lines were jittered vertically so that overlapping lines could be distinguished. The 
empirical and predicted log(CFU) MAP value from seven representative cows shows 
variability in fecal shedding over the sampling period, with some cows shedding very 
high CFU counts over the sampling period and others maintaining a lower CFU count 
(B) The median, and 10-90 percentile range for the predicted log(CFU) distributions are 
shown as points and bars, respectively. The predicted CFU values are colored according 
to the fitted state emission distribution for each sample. The empirical log(MAP) CFU 
counts are shown in black. 
 
In both the two and three state model structures, the majority of cows had 
increasing fecal CFU counts over the sampling period; however, a small group of cows 
maintained a low CFU count. This group, labeled as having an observation at the last 
sampling interval in the lowest disease state in Figure 4A, represents the proportion of 
cows with consecutive low CFU counts that did not progress to a high-shedding disease 
state. The stationary distributions of the Markov chains and the differential posterior 
decoded patterns demonstrate that there are at least two types of MAP infection 
progression patterns, where most cows progress to a disease state associated with higher 
MAP shedding in both the two and three state models. These “progressors” shed higher 
MAP loads over the course of their infection than “non-progressors” that remain in 
lower CFU-emitting disease states. This suggests that there may be a minority of 
infected cattle that do not shed large amounts of MAP, and thus may play a smaller role 
in the transmission chain within a farm. If these animals are high producers, it may be 
reasonable to keep them on the farm despite their positive MAP infection status. 
 
47 
 
Potential factors related to disease progression path 
 
Our study provides additional evidence to support that at least two progression 
patterns exist among cows infected with MAP. Previous studies have suggested that age 
at time of infection plays a role in disease progression type [11]; however, we were not 
able to determine the age at the time of infection since all animals were infected 
naturally, and we only had samples from adult animals. There is also evidence that host 
immune factors are related to Johne’s disease progression in cattle and other species. 
De Silva et al. found that high fecal MAP DNA quantity and lower interferon gamma 
response early in life were positively associated with progressing to clinical disease in 
sheep, suggesting that early immune response may play a role in disease progression 
path [23]. Koets et al. found that cows with mutations in the Toll-like receptor 2 gene 
may show an increased macrophage activity, increased T cell activation and reduced 
susceptibility to MAP infection [24]. Future research incorporating disease path 
progression using HMM and immune monitoring could lead to breeding cattle that are 
resistant to progression to the state of high shedding.  
 
Future directions and Conclusions  
 
Mathematical modeling has been an important tool to study Johne’s disease 
control due to the complex nature of the slowly progressing disease dynamics and farm 
dynamics. Some previous modeling studies have included non-progressing and 
progressing patterns through discrete disease states [13, 25]; however no study has 
estimated probabilities of transition among those disease states, or the probability of 
 
48 
shedding CFU count given membership in a specific disease state. The addition of state 
transition probabilities from our model can improve the accuracy of predicted 
progression patterns in mathematical models, and the addition of emission probability 
estimates to simulation models can more accurately model disease transmission 
dynamics. In addition to the utility of our model estimates in simulation studies, our 
HMM can also be used for disease progression path prediction on infected farms. After 
validation in an external population, the model can be incorporated into Johne’s disease 
control programs to predict which progression paths are more likely given a series of 
fecal culture samples from individual cows. With this knowledge, and with knowledge 
from future simulation studies that take advantage of a multi-progression path disease 
structure, farmers can make more informed decisions on which animals to cull 
depending on their predicted disease paths.  
The proposed model includes simplifying assumptions that do not fully 
characterize the complex nature of disease progression. One such simplification is 
creating discrete disease states and fecal shedding states, where in reality disease 
progression is a continuous process through many possible disease states. Our ability to 
resolve fine-scale differences between disease states is limited by the intervals between 
samples, and by the number of samples taken from each cow. It is possible that including 
covariates such as milk production data or sampling cows more frequently and over a 
longer period of time could provide a more accurate description of disease progression. 
Our analysis may also underestimate the true prevalence of MAP infected cows at each 
sampling interval because some positive animals may not have been identified using 
fecal culture. Other diagnostic tests exist relying on serological data, but fecal culture 
 
49 
may be the most reliable marker for disease progression [26]. Another potential 
limitation is that expectation maximization may identify local maxima, and may not 
find global maxima; however, initializing the model with a range of parameter values 
allowed us to explore a variety of potential maxima. Additionally, our analysis may be 
influenced by survival bias because cows with more advanced MAP infection may 
produce less milk than cows with subclinical infection, which results in higher culling 
rates in lower producing cows. This could result in an underrepresentation of severe 
cases with higher MAP CFU counts than would be expected if no culling decisions were 
related to MAP infection stage. Although we expect this survival bias to reduce the 
expected sojourn time of the hidden states associated with higher CFU counts, we 
believe our data are representative of MAP infection patterns on typical dairy farms that 
make culling decisions based on milk production statistics. Lastly, the parameter 
estimates generated by the model reflect progression patterns among cows included in 
our study, and the model needs to be validated in an external population to determine if 
the progression patterns are true of all infected animals.  
Johne’s disease is difficult to control due to a number of factors including its 
slow progression and is extremely difficult to eliminate from a farm due to 
environmental persistence. We determined that there are at least two distinct progressing 
patterns, non-progressing and progressing, among infected animals in the three herds 
studied. Non-progressors are likely to have low fecal MAP CFU counts associated with 
a low-shedding disease state, whereas progressors are more likely to transition from low 
to high disease state, which is associated with a high (> 600) CFU count. Parameters 
from this model can be used to inform disease simulation models to test control 
 
50 
strategies that include methods targeted at managing progressors. After model 
validation, this model can also be used on infected farms to predict future disease states 
after a relatively small number of observed fecal culture counts. This modeling 
framework can be used to classify progression in other slowly progressing infectious 
diseases.  
 
Acknowledgements  
 
We would like to acknowledge the RDQMA research team for their support in 
collecting the samples and providing farm data. We would also like to acknowledge Dr. 
Annete Nigsch for her assistance in the MAP database curation.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51 
References  
1. Ehlers S, Schaible UE. The granuloma in tuberculosis: dynamics of a host-
pathogen collusion. Front Immunol. 2012;3:411. Epub 2013/01/12. doi: 
10.3389/fimmu.2012.00411.  
2. Ott SL, Wells SJ, Wagner BA. Herd-level economic losses associated with 
Johne's disease on US dairy operations. Prev Vet Med. 1999;40(3-4):179-92. Epub 
1999/07/29. doi: 10.1016/s0167-5877(99)00037-9.  
3. McNees AL, Markesich D, Zayyani NR, Graham DY. Mycobacterium 
paratuberculosis as a cause of Crohn's disease. Expert Rev Gastroenterol Hepatol. 
2015;9(12):1523-34. Epub 2015/10/17. doi: 10.1586/17474124.2015.1093931. 
4. USDA-APHIS. NAHMS. Johne’s Disease on U.S. Dairies, 1991-2007. 2008. 
5. Whittington RJ, Sergeant ES. Progress towards understanding the spread, 
detection and control of Mycobacterium avium subsp paratuberculosis in animal 
populations. Aust Vet J. 2001;79(4):267-78. Epub 2001/05/15. doi: 10.1111/j.1751-
0813.2001.tb11980.x.  
6. Cocito C, Gilot P, Coene M, de Kesel M, Poupart P, Vannuffel P. 
Paratuberculosis. Clin Microbiol Rev. 1994;7(3):328-45. Epub 1994/07/01. doi: 
10.1128/cmr.7.3.328.  
7. Whitlock RH, Buergelt C. Preclinical and clinical manifestations of 
paratuberculosis (including pathology). Vet Clin North Am Food Anim Pract. 
1996;12(2):345-56. Epub 1996/07/01. doi: 10.1016/s0749-0720(15)30410-2.  
8. Grewal SK, Rajeev S, Sreevatsan S, Michel FC, Jr. Persistence of 
Mycobacterium avium subsp. paratuberculosis and other zoonotic pathogens during 
simulated composting, manure packing, and liquid storage of dairy manure. Appl 
Environ Microbiol. 2006;72(1):565-74. Epub 2006/01/05. doi: 10.1128/AEM.72.1.565-
574.2006.  
9. Slater N, Mitchell RM, Whitlock RH, Fyock T, Pradhan AK, Knupfer E, et al. 
Impact of the shedding level on transmission of persistent infections in Mycobacterium 
avium subspecies paratuberculosis (MAP). Vet Res. 2016;47:38. Epub 2016/03/02. doi: 
10.1186/s13567-016-0323-3.  
10. Schukken YH, Whitlock RH, Wolfgang D, Grohn Y, Beaver A, VanKessel J, et 
al. Longitudinal data collection of Mycobacterium avium subspecies Paratuberculosis 
infections in dairy herds: the value of precise field data. Vet Res. 2015;46:65. Epub 
2015/06/21. doi: 10.1186/s13567-015-0187-y.  
11. Mitchell RM, Schukken Y, Koets A, Weber M, Bakker D, Stabel J, et al. 
Differences in intermittent and continuous fecal shedding patterns between natural and 
experimental Mycobacterium avium subspecies paratuberculosis infections in cattle. 
Vet Res. 2015;46:66. Epub 2015/06/21. doi: 10.1186/s13567-015-0188-x.  
12. Mitchell RM, Whitlock RH, Grohn YT, Schukken YH. Back to the real world: 
connecting models with data. Prev Vet Med. 2015;118(2-3):215-25. Epub 2015/01/15. 
doi: 10.1016/j.prevetmed.2014.12.009.  
13. Smith RL, Schukken YH, Grohn YT. A new compartmental model of 
Mycobacterium avium subsp. paratuberculosis infection dynamics in cattle. Prev Vet 
Med. 2015;122(3):298-305. Epub 2015/11/02. doi: 10.1016/j.prevetmed.2015.10.008.  
 
52 
14. Al-Mamun MA, Smith RL, Schukken YH, Grohn YT. Use of an Individual-
based Model to Control Transmission Pathways of Mycobacterium avium Subsp. 
paratuberculosis Infection in Cattle Herds. Sci Rep. 2017;7(1):11845. Epub 2017/09/21. 
doi: 10.1038/s41598-017-12078-z.  
15. Pradhan AK, Van Kessel JS, Karns JS, Wolfgang DR, Hovingh E, Nelen KA, et 
al. Dynamics of endemic infectious diseases of animal and human importance on three 
dairy herds in the northeastern United States. J Dairy Sci. 2009;92(4):1811-25. Epub 
2009/03/25. doi: 10.3168/jds.2008-1486.  
16. Pradhan AK, Mitchell RM, Kramer AJ, Zurakowski MJ, Fyock TL, Whitlock 
RH, et al. Molecular epidemiology of Mycobacterium avium subsp. paratuberculosis in 
a longitudinal study of three dairy herds. J Clin Microbiol. 2011;49(3):893-901. Epub 
2011/01/07. doi: 10.1128/JCM.01107-10.  
17. Liu YY, Li S, Li F, Song L, Rehg JM. Efficient Learning of Continuous-Time 
Hidden Markov Models for Disease Progression. Adv Neural Inf Process Syst. 
2015;28:3599-607. Epub 2015/01/01.  
18. Jones Eea. Open source scientific tools for Python. 2001. 
19. van der Walt S, Colbert SC, Varoquaux G. The NumPy Array: A Structure for 
Efficient Numerical Computation. Computing in Science & Engineering. 
2011;13(2):22-30. doi: 10.1109/mcse.2011.37. 
20. W. M. Data Structures for Statistical Computing in Python. 2010:51-6. 
21. Pedregosa al. Scikit-learn: Machine Learning in Python. Journal of Machine 
Learning Research. 2011;12:2825-30. 
22. Pohle J, Langrock R, van Beest FM, Schmidt NM. Selecting the Number of 
States in Hidden Markov Models: Pragmatic Solutions Illustrated Using Animal 
Movement. Journal of Agricultural, Biological and Environmental Statistics. 
2017;22:270-93. 
23. de Silva K, Begg DJ, Plain KM, Purdie AC, Kawaji S, Dhand NK, et al. Can 
early host responses to mycobacterial infection predict eventual disease outcomes? Prev 
Vet Med. 2013;112(3-4):203-12. Epub 2013/09/17. doi: 
10.1016/j.prevetmed.2013.08.006.  
24. Koets A, Santema W, Mertens H, Oostenrijk D, Keestra M, Overdijk M, et al. 
Susceptibility to paratuberculosis infection in cattle is associated with single nucleotide 
polymorphisms in Toll-like receptor 2 which modulate immune responses against 
Mycobacterium avium subspecies paratuberculosis. Prev Vet Med. 2010;93(4):305-15. 
Epub 2009/12/17. doi: 10.1016/j.prevetmed.2009.11.008.  
25. Marquetoux N, Mitchell R, Ridler A, Heuer C, Wilson P. A synthesis of the 
patho-physiology of Mycobacterium avium subspecies paratuberculosis infection in 
sheep to inform mathematical modelling of ovine paratuberculosis. Vet Res. 
2018;49(1):27. Epub 2018/03/09. doi: 10.1186/s13567-018-0522-1.  
26. Magombedze G, Shiri T, Eda S, Stabel JR. Inferring biomarkers for 
Mycobacterium avium subsp. paratuberculosis infection and disease progression in 
cattle using experimental data. Sci Rep. 2017;7:44765. Epub 2017/03/21. doi: 
10.1038/srep44765.  
 
 
 
53 
Supplementary Materials  
File S1. Description of gamma parameter estimation 
 
Newton-Raphson Estimation of gamma emission parameters. 
MAP CFU emissions 𝑥,	were modeled with gamma distributions, with the following 
pdf: 
𝑥6*3𝑒*7/9
𝑓(𝑥|𝛼, 𝜃) = Γ(𝛼)𝜃6 	 
And log likelihood function:  
𝑥
log(𝐿) = Π;5:3𝑓(𝑥|𝛼, 𝜃) = 	−𝑁𝑙𝑜𝑔Γ(𝛼) − 𝛼𝑁𝑙𝑜𝑔𝜃 + (𝛼 − 1) log(∑𝑥
5
5) − ∑ 𝜃  
The maximum likelihood estimator for the scale parameter 𝜃X is 7̅, however there is no 
6
closed form maximum likelihood estimate for the shape parameter 𝛼.	Therefore, an 
estimate was generated using Newton Raphson estimation. The methods of moments 
estimator was used as the initial estimate for 𝛼: 
𝑁𝑥+
𝛼& = ∑; (𝑥 − ?̅?)+
	 
5:3 5
Following the initial estimation of 𝛼&,	Newton Raphson Estimation was used to 
iteratively calculate 𝛼[" 	for each state k: 
log(𝛼 ) − 𝜓(𝛼 ) − log(?̅?) + 𝑙]𝑜]]𝑔]]𝑥]
𝛼=-> = 𝑎?.@ −
?.@ ?.@
1  − 𝜓A𝛼 (𝛼?.@)?.@
where 𝑥]]] = ∑ B#,!7# ,  𝑙]𝑜]]𝑔]]𝑥] = ∑ B#,!CDE	 7#" 5 ∑ 5 ∑  and 𝑝
G!(5)J!(5)
B B 5,"
= .  
! #,! ! #,! ∑!G!(5)J!(5)		
Estimation was stopped when the 𝛼 value changed by less than 0.0001 or if the 
number of Newton-Raphson iterations exceeded 10000. If the number of iterations 
exceeded this threshold, the run was terminated. 
 
54 
Table S1. Initial parameter values  
Parameter Description Initial Value  
Two state models 
𝜋&	 Estimate for probability of starting in state 0 0.44 
𝛼& Gamma distribution shape estimate for state 0 4.42 
𝛼3 Gamma distribution shape estimate for state 1 10.18 
𝜃& Gamma distribution scale estimate for state 0 0.84 
𝜃3 Gamma distribution scale estimate for state 1 0.45 
𝑞&3 Transition rate estimate from state 0 to 1 4.35 
𝑞3& Transition rate estimate from state 1 to 0 0.38 
Three state models 
𝜋&	 Estimate for probability of starting in state 0 0.28 
𝜋3	 Estimate for probability of starting in state 1 0.40 
𝛼& Gamma distribution shape estimate for state 0 1.09 
𝛼3 Gamma distribution shape estimate for state 1 9.88 
𝛼+ Gamma distribution shape estimate for state 2 19.8 
𝜃& Gamma distribution scale estimate for state 0 0.42 
𝜃3 Gamma distribution scale estimate for state 1 0.76 
𝜃+ Gamma distribution scale estimate for state 2 0.51 
𝑞&3 Transition rate estimate from state 0 to 1 0.60 
𝑞&+ Transition rate estimate from state 0 to 2 0.20 
𝑞3& Transition rate estimate from state 1 to 0 2.26 
𝑞3+ Transition rate estimate from state 1 to 2 4.60 
𝑞+& Transition rate estimate from state 2 to 0 0.99 
𝑞+3 Transition rate estimate from state 2 to 1 0.66 
Four state models 
𝜋&	 Estimate for probability of starting in state 0 0.23 
𝜋3	 Estimate for probability of starting in state 1 0.29 
𝜋+	 Estimate for probability of starting in state 2 0.45 
𝛼& Gamma distribution shape estimate for state 0 2.83 
𝛼3 Gamma distribution shape estimate for state 1 8.65 
𝛼+ Gamma distribution shape estimate for state 2 13.4 
𝛼K Gamma distribution shape estimate for state 3 18.9 
𝜃& Gamma distribution scale estimate for state 0 0.62 
𝜃3 Gamma distribution scale estimate for state 1 0.97 
 
55 
𝜃+ Gamma distribution scale estimate for state 2 0.96 
𝜃K Gamma distribution scale estimate for state 3 0.45 
𝑞&3 Transition rate estimate from state 0 to 1 2.13 
𝑞&+ Transition rate estimate from state 0 to 2 1.15 
𝑞&K Transition rate estimate from state 0 to 3 2.45 
𝑞3& Transition rate estimate from state 1 to 0 1.71 
𝑞3+ Transition rate estimate from state 1 to 2 1.11 
𝑞3K Transition rate estimate from state 1 to 3 1.36 
𝑞+& Transition rate estimate from state 2 to 0 4.12 
𝑞+3 Transition rate estimate from state 2 to 1 1.02 
𝑞+K Transition rate estimate from state 2 to 3 3.82 
𝑞K& Transition rate estimate from state 3 to 0 4.83 
𝑞K3 Transition rate estimate from state 3 to 1 4.01 
𝑞K+ Transition rate estimate from state 3 to 2 1.26 
Five state models 
𝜋&	 Estimate for probability of starting in state 0 0.05 
𝜋3	 Estimate for probability of starting in state 1 0.14 
𝜋+	 Estimate for probability of starting in state 2 0.31 
𝜋K	 Estimate for probability of starting in state 3 0.13 
𝛼& Gamma distribution shape estimate for state 0 9.69 
𝛼3 Gamma distribution shape estimate for state 1 11.4 
𝛼+ Gamma distribution shape estimate for state 2 15.0 
𝛼K Gamma distribution shape estimate for state 3 15.0 
𝛼L Gamma distribution shape estimate for state 4 18.5 
𝜃& Gamma distribution scale estimate for state 0 0.74 
𝜃3 Gamma distribution scale estimate for state 1 0.60 
𝜃+ Gamma distribution scale estimate for state 2 0.36 
𝜃K Gamma distribution scale estimate for state 3 0.22 
𝜃L Gamma distribution scale estimate for state 4 0.88 
𝑞&3 Transition rate estimate from state 0 to 1 4.79 
𝑞&+ Transition rate estimate from state 0 to 2 3.66 
𝑞&K Transition rate estimate from state 0 to 3 3.29 
𝑞&L Transition rate estimate from state 0 to 4 2.05 
𝑞3& Transition rate estimate from state 1 to 0 0.12 
𝑞3+ Transition rate estimate from state 1 to 2 3.53 
 
56 
𝑞3K Transition rate estimate from state 1 to 3 1.58 
𝑞3L Transition rate estimate from state 1 to 4 1.13 
𝑞+& Transition rate estimate from state 2 to 0 2.88 
𝑞+3 Transition rate estimate from state 2 to 1 0.84 
𝑞+K Transition rate estimate from state 2 to 3 4.81 
𝑞+L Transition rate estimate from state 2 to 4 1.93 
𝑞K& Transition rate estimate from state 3 to 0 0.68 
𝑞K3 Transition rate estimate from state 3 to 1 2.54 
𝑞K+ Transition rate estimate from state 3 to 2 0.59 
𝑞KL Transition rate estimate from state 3 to 4 3.72 
𝑞L& Transition rate estimate from state 4 to 0 1.46 
𝑞L3 Transition rate estimate from state 4 to 1 4.36 
𝑞L+ Transition rate estimate from state 4 to 2 2.58 
𝑞LK Transition rate estimate from state 4 to 3 4.11 
 
These input parameter values produced the lowest AIC values for the two, three, four 
and five state models.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57 
S2 Table. Four and five state model parameter estimates. 
Parameter Description 4 state model 5 state model 
Four state models  
𝜋&	 Probability of starting in state 0 0.448 1.728e-51 
𝜋3	 Probability of starting in state 1 0.196 0.211 
𝜋+	 Probability of starting in state 2 0.195 0.173 
𝜋K	 Probability of starting in state 3 0.160 0.427 
𝜋L	 Probability of starting in state 4 -- 0.190 
𝛼& Gamma distribution shape parameter 17.67 81742 
for state 0 
𝛼3 Gamma distribution shape parameter 21.72 229.0 
for state 1 
𝛼+ Gamma distribution shape parameter 4923 19.09 
for state 2 
𝛼K Gamma distribution shape parameter 336.8 19.29 
for state 3 
𝛼L Gamma distribution shape parameter -- 5055 
for state 4 
𝜃& Gamma distribution scale parameter 0.139 7.526e-5 
for state 0 
𝜃3 Gamma distribution scale parameter 0.261 0.033 
for state 1 
𝜃+ Gamma distribution scale parameter 0.002 0.271 
for state 2 
𝜃K Gamma distribution scale parameter 0.023 0.124 
for state 3 
𝜃L Gamma distribution scale parameter -- 0.002 
for state 4 
𝛼&𝜃& Gamma distribution mean for state 0 2.448 6.152 
𝛼3𝜃3 Gamma distribution mean for state 1 5.674 7.581 
𝛼+𝜃+ Gamma distribution mean for state 2 8.643 5.182 
𝛼K𝜃K Gamma distribution mean for state 3 7.642 2.400 
𝛼L𝜃L Gamma distribution mean for state 4 -- 8.646 
𝑞&3 Transition rate from state 0 to 1 2.373 6.419 
𝑞&+ Transition rate from state 0 to 2 1.561 5.271 
𝑞&K Transition rate from state 0 to 3 3.263 3.628 
𝑞&L Transition rate from state 0 to 4 -- 2.566 
𝑞3& Transition rate from state 1 to 0 1.537 0.086 
𝑞3+ Transition rate from state 1 to 2 1.351 3.798 
𝑞3K Transition rate from state 1 to 3 1.626 1.298 
 
58 
𝑞3L Transition rate from state 1 to 4 -- 1.057 
𝑞+& Transition rate from state 2 to 0 3.036 1.997 
𝑞+3 Transition rate from state 2 to 1 0.838 0.784 
𝑞+K Transition rate from state 2 to 3 3.755 3.683 
𝑞+L Transition rate from state 2 to 4 -- 1.679 
𝑞K& Transition rate from state 3 to 0 3.628 0.619 
𝑞K3 Transition rate from state 3 to 1 3.348 3.087 
𝑞K+ Transition rate from state 3 to 2 1.283 0.777 
𝑞KL Transition rate from state 3 to 4 -- 4.209 
𝑞L& Transition rate from state 4 to 0 -- 1.165 
𝑞L3 Transition rate from state 4 to 1 -- 4.681 
𝑞L+ Transition rate from state 4 to 2 -- 2.974 
𝑞LK Transition rate from state 4 to 3 -- 3.633 
Parameter estimates for the two and three hidden disease models were generated using 
the modified Baum-Welch Expectation Maximization algorithm 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59 
 
 
 
 
 
CHAPTER 3 
 
 
A critical evaluation of Mycobacterium bovis pangenomics, with reference to its utility 
in outbreak investigation 
 
 
Kristina M. Ceres1,2, Michael J. Stanhope1,2 & Yrjö T. Gröhn1,2 
1. Department of Population Medicine and Diagnostic Sciences, College of 
Veterinary Medicine, Cornell University 
2. Population and Ecosystem Health, College of Veterinary Medicine, Cornell 
University 
 
Chapter 3 is accepted for publication in Microbial Genomics 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60 
 
Abstract 
 
The increased accessibility of next generation sequencing has allowed enough genomes 
from a given bacterial species to be sequenced to describe the distribution of genes in 
the pangenome, without limiting analyses to genes present in reference strains. 
Although some taxa have thousands of whole genome sequences available on public 
databases, most genomes were sequenced with short read technology, resulting in 
incomplete assemblies. Studying pangenomes could lead to important insights into 
adaptation, pathogenicity, or molecular epidemiology, however given the known 
information loss inherent in analyzing contig-level assemblies, these inferences may be 
biased or inaccurate. In this study we describe the pangenome of a clonally evolving 
pathogen, Mycobacterium bovis, and examine the utility of gene content variation in M. 
bovis outbreak investigation. We constructed the M. bovis pangenome using 1463 de 
novo assembled genomes. We tested the assumption of strict clonal evolution by 
studying evidence of recombination in core genes and analyzing the distribution of 
accessory genes among core monophyletic groups. To determine if gene content 
variation could be utilized in outbreak investigation, we carefully examined accessory 
genes detected in a well described M. bovis outbreak in Minnesota. We found significant 
errors in accessory gene classification. After accounting for these errors, we show that 
M. bovis has a much smaller accessory genome than previously described and provide 
evidence supporting ongoing clonal evolution and a closed pangenome, with little gene 
content variation generated over outbreaks. We also identified frameshift mutations in 
multiple genes, including a mutation in glpK, which has recently been associated with 
 
61 
antibiotic tolerance in Mycobacterium tuberculosis. A pangenomic approach enables a 
more comprehensive analysis of genome dynamics than is possible with reference-
based approaches; however, without critical evaluation of accessory gene content, 
inferences of transmission patterns employing these loci could be misguided.  
 
Impact statement 
 
Dissecting true gene content variation, from noise generated by analysis of incomplete 
genomes, is a necessary step to understand the role of horizontal gene transfer in the 
evolution of the M. bovis pangenome. Here we show that analysis of incomplete 
genomes can lead to incorrect identification of accessory genes, which artificially 
inflates the pangenome size. When correcting for these annotation errors, we show 
that M. bovis evolves clonally, and has a very compact accessory genome. We also 
show that there is limited variation in gene content among outbreak sequences and that 
this variation is driven by indels, suggesting indel variation could be useful in 
distinguishing between highly similar SNP profiles in epidemiological analysis. 
 
Data summary 
 
Sequence reads are available on the National Centre for Biotechnology Information 
(NCBI) Sequence Reads Archive. A full list of accession numbers for sequences used 
in this study are available in Supplemental File 1. The M. bovis gene presence/absence 
matrix is available in Supplemental File 2. The quality of genome assemblies was 
assessed using contig count, N50 and checkM completeness and contamination 
measures in addition to manual screening. Data processing scripts are publicly available 
 
62 
on https://github.com/kmceres/Mbovis_pangenome. Supplementary materials are 
available on Figshare: https://figshare.com/s/c9665c733243e68326f7 
 
Introduction 
 
Mycobacterium bovis is a member of the Mycobacterium tuberculosis complex 
(MTBC), the collection of pathogenic bacteria including M. tuberculosis, which causes 
tuberculosis in humans and M. bovis, which causes bovine tuberculosis in cattle and 
other mammalian species. MTBC species are unique among bacterial taxa because 
unlike many species, they have highly conserved genomes and are thought to evolve 
strictly clonally [1, 2]. A consequence of strict clonality is relatively little gene content 
variation compared to species that can generate diversity through horizontal gene 
transfer (HGT).  
In addition to HGT, variability in gene content can arise through de novo gene 
formation, or through gene deletion. The competing forces of gene gain and gene loss 
create a U-shaped gene frequency spectrum, with many genes shared by most genomes 
on one end of a gene frequency spectrum, and a similar number of rare genes on the 
other end. Since MTBC genomes are thought to evolve clonally, without HGT, MTBC 
accessory genomes are hypothesized to be very small, and have fewer rare genes 
compared to other taxa. Recent MTBC studies, however, challenge the idea of a small 
accessory MTBC genome, with estimates of MTBC accessory genome size ranging 
from 2086 genes (36% of all genes) [3] to 7620 genes (69% of all genes) [4]. 
Importantly, a recent M. bovis study by Reis and Cunha suggests that M. bovis has an 
open pangenome [5], implying new M. bovis genes would be discovered with each new 
 
63 
genome sequenced. If M. bovis does have an open pangenome, it would suggest M. 
bovis does not evolve clonally.  
Similarly, studies investigating homologous recombination in MTBC have 
yielded conflicting results. Experimental evidence has shown that while the closely 
related free-living Mycobacterium canettii could accept donor DNA via homologous 
recombination, MTBC could not [1]. However, despite this lack of experimental 
evidence, computational methods sometimes identify recombination in MTBC genomes 
[3, 6, 7]. Some of the recombination in MTBC species identified through computational 
methods stems from poor quality data or unreliable alignments. For example, Godfroid 
et al. identified an imbalance in the number of recombination events between terminal 
branches and internal nodes, with a skew towards more recombination events on 
terminal branches and an unequal distribution of recombination events among genomes 
[8]. Furthermore, phylogenies created with these erroneous regions conflicted with the 
established M. tuberculosis phylogenies, leading the authors to conclude that any true 
recombination signal that was identified was greatly outweighed by noise stemming 
from poor data quality [8]. 
Resolving the conflict between bioinformatic detection of recombination and 
large accessory genomes and experimental evidence suggesting clonality is necessary 
to understand MTBC genome dynamics and their implication for outbreak analysis and 
the evolution of important phenotypes like antimicrobial resistance. Whole genome 
sequencing is an indispensable tool in outbreak investigation, enabling the identification 
of epidemiological links between individuals or groups that would otherwise be difficult 
to ascertain by comparing pathogen sequences. However, accurate reconstruction of a 
 
64 
phylogenetic tree relies on identifying regions of the genome that evolve clonally 
because phylogenetic trees are usually bifurcating trees where each node has exactly 
one parent. To accurately model evolution of a pathogen with proportions of the 
sequence evolving through HGT, nodes must be allowed to have more than one parent 
reflecting the inheritance of genetic material from more than one parental source, 
leading to a network structure or ancestral recombination graph, rather than a bifurcating 
tree structure. Therefore, to infer transmission relationships using a bifurcating tree, 
regions of the genome that are affected by recombination are often excised from 
alignments. If there is ongoing HGT in M. bovis and other MTBC species, alignments 
used to create trees relevant to transmission analysis must be carefully curated to ensure 
only clonally evolving genome segments are included. Additionally, if MTBC species 
have true accessory genomes gene content variation could be used in outbreak analysis 
to discriminate between highly similar core genome sequences.  
Here we investigate the evolutionary forces that shape the M. bovis pangenome 
and explore evolutionary scenarios that could have resulted in observed gene content 
variation using de novo assembled draft M. bovis. We look for evidence of homologous 
recombination and true gene content variation, and carefully examine accessory genes 
for assembly or annotation errors. Since there is no known mechanism for HGT in the 
MTBC, we suspect that gene deletion and gene duplication are the major drivers of 
pangenomic diversity in the MTBC. Consequently, after accounting for technical errors 
created during sequencing or assembly or bioinformatic misclassification, we expect 
that gene content variation patterns will closely resemble core genome single nucleotide 
polymorphism (SNP) variation patterns. We show that M. bovis has a much smaller 
 
65 
accessory genome than previously described and provide evidence supporting ongoing 
clonal evolution. This finding suggests careful examination of pangenome and 
recombination results from bioinformatic analyses is necessary to accurately describe 
MTBC genome dynamics.  
 
Methods 
 
Sequence assembly and annotation 
Raw M. bovis sequence files from Almaw et al. [9], Loiseau et al [10],  Orloski et al. 
[11], Reis & Cunha [6], Rodrigues et al. [12], Zwyer et al. [13] and from BioProject 
PRJNA769553 were downloaded from the NCBI Sequence Read Archive. A full list of 
downloaded samples can be found in Supplemental File 1. Sequences were assembled 
using SPAdes genome assembler v3.14.0[14] in careful mode, and polished using Pilon 
v1.23 [15]. The quality of assemblies was assessed using QUAST v5.0.2 [16] and 
checkM v1.1.3 [17]. To minimize annotation issues inherent with draft genomes, only 
sequences with less than or equal to 150 contigs and less than 5% contamination were 
used for further analysis. It was noted upon visual inspection that some contigs 
contained duplicate gene sequences that were the reverse compliment of each other, 
such that the entire contig was one palindromic sequence. Our view is this was likely 
caused by assembly error, and therefore systematically identified all such palindromic 
sequences (81 contigs total) by searching for contigs that contained reverse compliments 
of the same gene with identical DNA sequences. Each identified contig was visually 
inspected, and sequences that had any confirmed palindromes were removed. Sequences 
 
66 
were annotated using prokka v1.14.16 [18] using the AF2122/97 (NCBI accession 
NC_002945.4) M. bovis reference genome as a guide [19].  
 
Sequence selection 
Before assembling genomes, we determined which lineage each sequence belonged to 
according to the new M. bovis lineage classification scheme suggested by Zwyer et al. 
[13]. We called SNPs for all downloaded fastq using the vSNP pipeline 
(https://github.com/USDA-VS/vSNP) against the Mycobacterium tuberculosis H37Rv 
reference genome (NCBI accession NC_000962.3). SNPs were filtered out if they had 
a read depth less than 10, an allele frequency less than 0.9, and a mapping quality less 
than 30 using vcffilter in vcflib v1.0.2 [20]. Lineages were assigned using H37Rv 
genome positions in Table 4 of Zwyer et al. [13]. 
 
Pangenome analysis 
Panaroo was used to characterize the M. bovis pangenome [21]. To identify the 
reference coordinates of partial genes undergoing pseudogenization, the consensus gene 
sequence created by Panaroo for each gene was blasted to the AF2122/97 M. bovis 
reference, using a similarity e-value cutoff of 1x10-10. A network of strong reference 
gene – accessory gene BLASTn matches was created using the IRanges R package 
v2.24.1, and this network was filtered to contain only gene-reference matches where at 
least 75% of the length of the Panaroo-identified gene overlapped with a reference gene. 
The resulting network was used to identify gene fragments that all mapped to the same 
reference gene, indicating that each fragment was part of the same original reference 
 
67 
gene undergoing pseudogenization. Since we expected to identify gene content variation 
driven by deletion, we investigated whether certain functional categories of genes were 
more likely to be pseudogenized. We used PANTHER and the PANTHER GO Slim 
annotation set to analyze whether certain Gene Ontology groups were overrepresented 
in genes undergoing pseudogenization compared to core genes [22]. Additionally, as a 
negative control for detecting pseudogenization, we determined whether any of the 
pseudogenes identified in this analysis were essential in the H37Rv M. tuberculosis 
reference strain [23]. Recently updated AF2122/97 functional annotations were used to 
map AF2122/97 annotated genes to orthologous coding sequences in M. tuberculosis 
H37Rv [24].  
 
Core genome recombination detection  
A core genome alignment was constructed using parsnp v 1.2 [25], using the M. 
tuberculsosis H37Rv genome as a reference and outgroup. Parsnp uses MUSCLE [26] 
to create gene alignments and FastTree2 [27] to create a core phylogeny. Recombination 
was detected using Gubbins [28] with default parameters and a starting tree created by 
parsnp using a GTR substitution model. For each gene with detected recombination, 
whole-contig alignments were created using progressiveMauve [29] and reviewed in 
Geneious Prime v 2021.2.2 to rule out detection by assembly or alignment error. All 
regions of the genome that Gubbins determined to be recombinant were excised from 
the core gene alignment to ensure only regions of clonal evolution were included in the 
population structure analysis. Trees visualizations were created using iTOL v5 [30] and 
FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/). In the process of creating a 
 
68 
core genome alignment, parsnp excises regions of the alignment that are missing from 
at least one genome. Since this resulting alignment is conservative, some potential 
recombination events may be excised from the alignment before it is fed to Gubbins. 
Therefore, in addition to the recombination analysis on the core genome alignment, 
individual gene alignments  that had previously been implicated in recombination events 
in M. bovis [6]  were also assessed for recombination using fastGEAR [31]. The 
individual gene alignments were created using the built-in MAFFT aligner included in 
Panaroo [21, 32]. 
 
SNP validation analysis 
Filtered SNPs were used to validate potential recombination hotspots identified by 
Gubbins and fastGEAR. SNPs were considered valid if they passed the filtering criteria 
described in section 5.2. SNP quality was also verified by manually inspecting read 
mapping files using the Integrative  Genome Viewer [33]. The functional consequence 
of SNPs was assessed using snpEff [34].  
 
Population structure assessed by core and accessory genes 
M. bovis population structure was compared using core SNPs and accessory gene 
content. To analyze population structure using the core genome, principal component 
analysis (PCA) was applied directly to Boolean vector transformed SNP vectors 
following the methods presented in Konishi et al. [35]. In accessory genes population 
structure was assessed using PCA on the Jaccard distance calculated using binary gene 
presence absence patterns using the R package abdiv v0.2.0 [36]. A dendrogram was 
 
69 
created from accessory gene clusters using Ward’s hierarchical clustering on the Jaccard 
distance matrix. All post-processing analysis was done in RStudio 2021.09.0 [37] using 
R v 4.1.2 [38].  
 
Outbreak analysis 
To investigate the utility of gene presence absence data in outbreak investigation, we 
analyzed the pangenome of 62 M. bovis genomes from well-described outbreak that 
occurred in Minnesota from 2005-2012 and was documented in Glaser et al. [39]. The 
outbreak pangenome was assessed using Panaroo and the methods described in section 
5.3. In addition to gene presence absence patterns, we also created a core genome 
phylogenetic tree using the SNP matrix from Table S2 in Glaser et al [39] and IQ-TREE-
2 v 2.1.1 [40] with a GTR substitution model. We also used ScarTrek [41] to call indels. 
Indels were removed if they occurred in both the outgroup, sequences from a Texas 
outbreak (13-3900, 12-5094 and 12-5090), and the Minnesota sequences. Binary indel 
presence/absence information was used to create an indel tree with a JC2 substitution 
model in IQ-TREE-2 [40].  
 
Investigating pangenome openness 
To compare the impact of methodological artifacts on M. bovis pangenome analysis, we 
reanalyzed the genomes included in a recent M. bovis pangenome study by Reis and 
Cunha [5]. We downloaded raw or complete sequence files listed in Supplementary 
Table 1 of Reis and Cunha [5] from NCBI. We processed the non-complete genomes 
and annotated all genomes using the sequence assembly and annotation methods 
 
70 
described in section 5.1. We assembled the pangenome using Panaroo and the post-hoc 
filtering methods described in section 5.3.  
All R scripts and parameters used for other bioinformatic programs are available 
on https://github.com/kmceres/Mbovis_pangenome.  
 
Results 
 
Sequence assembly  
3626 raw fastq files were downloaded from the NCBI Sequence Read Archive. SNPs 
were called on all download samples and were used to assign clonal complex lineage 
labels. After lineage labels were determined, 1781 samples were chosen to be assembled 
to maximize representation of the known M. bovis clonal complexes in our final dataset. 
After assembly and quality filtering, 1463 whole genome sequences remained. All 
sequences were near complete (>= 90% completeness), and all sequences had low or 
moderate contamination (<10% contamination).  949 sequences had no contamination. 
Assembly quality figures are shown in Supplemental Figure 1.  
 
Core phylogeny 
The core phylogeny shown in Figure 5 contained eight clonal complexes labeled using 
the lineage designations, described in Zwyer et al 2018 [13]. The recombination free 
core phylogeny was created using RAxML [42] with a GTR substitution model in 
Gubbins and was rooted at the M. tuberculosis reference, H37Rv. Each core 
phylogenetic group contained samples from multiple geographic regions, reflecting 
multiple transmission events between geographic regions over the complex history of 
 
71 
M. bovis evolution [43]. Our sample includes genomes from every continent except 
Antarctica (Figure 5C and D), and in a variety of host species (Figure 5C), however, 
most samples were from cattle hosts. Lineages La1.7 and La1.8 and samples from the 
United States, Mexico, and New Zealand were over-represented, reflecting the intensive 
ongoing bovine tuberculosis molecular surveillance programs in those regions.  
 
 
Figure 5 M. bovis core phylogeny and sample geography. 
A. A recombination-free core phylogenetic tree constructed in Gubbins with a GTR 
substitution model is shown with clonal complex labels. Sample collection continent 
distribution is shown in B. and host distribution is shown in C. The number of samples 
included from each country is shown in D.  
 
72 
Population structure analysis  
Panaroo identified 3681 core genes present in greater than 95% of isolates, and 1441 
accessory genes including 970 genes present in less than 15% of isolates. The Panaroo 
estimate of accessory genome size was smaller than previous studies [3-5, 43-46], which 
is more consistent with a hypothesis of clonal evolution.  
The population structure assessed using core SNPs and accessory gene presence 
absence patterns is shown in Figure 6. Principal components analysis of the core SNPs 
and pangenome content are shown in Figure 6B and 6C, respectively. A dendrogram 
was created to further visualize the similarity between gene content variation and core 
SNP variation. If the pangenome evolved clonally we would expect to see high 
concordance between core phylogenetic groups and accessory gene content patterns, 
although incomplete concordance was observed.  
 
 
 
 
 
 
 
73 
 
Figure 6. Population structure assessed in SNP variation and gene presence absence 
patterns. 
A. Gene content similarity dendrogram created using the Jaccard distance of the 
pangenome gene presence absence matrix. Core clonal complexes are labeled along the 
circumference of the tree. B-C. Core SNP and gene content variation principal 
components analysis colored by core clonal complexes created with core SNP principal 
components. SNP variation is shown in B and gene content variation is shown in C. 
Gene content variation is not concordant with SNP variation.  
 
74 
To test whether this lack of perfect correlation between genome content variation and 
core phylogenetic grouping was true evidence of a departure from clonality or if there 
was classification error generating the discordant signal, we investigated if and where 
accessory gene consensus sequences mapped to the M. bovis reference genome 
AF2122/97. Only 38 genes did not map to the AF2122/97 reference genome. Two out 
of the 38 had no BLAST matches, 29 mapped to Bos taurus reference genomes and are 
most likely host genome contamination, five mapped to non-MTBC microbes, including 
the Escherichia phage PhiX174 and one, espI_2 mapped to Esx-1 protein EspI on 
several MTBC genomes including M. bovis Mb3601. Similarly, esxM did not blast to 
AF2122/97, but did map to several other MTBC reference genomes. An alignment of 
espI_2 between genomes included in this study, AF2122/97, Mb3601 and M. 
tuberculosis H37rv shows that AF2122/97, and many samples included in this study 
have a 2049 base pair deletion of a duplicated copy of espI (Supplemental Figure 2). 
This one gene deletion is the only example we found of a gene present in our sample 
but not in AF2122/97, indicating that gene content variation is not generated by the 
addition of genes from external sources. We cannot rule out that the phage PhiX174 
infected some M. bovis genomes, however all contigs that contained genes that mapped 
to PhiX174 did not contain any M. bovis genes, suggesting the sequence fragments 
containing PhiX174 genes were not part of M. bovis genomes. Since M. bovis does not 
contain plasmids, it is more likely that these partial phage genes were remnants of the 
standard Illumina phiX internal sequencing control that did not get fully removed as 
part of their automated demultiplexing process. Because we did not find genes contained 
 
75 
in M. bovis genomes that did not map to a M. bovis reference genome, it is unlikely that 
M. bovis acquires new genes from other prokaryotes via HGT.  
 
Redundant labeling of accessory genes 
Further inspection of alignments of contigs with and without accessory genes showed 
that most accessory genes were redundant annotations of the same gene undergoing 
pseudogenization. A subset of redundant gene groups with at least six genes mapped to 
the same reference genome are shown in Supplemental Figure 3. The gene presence 
absence pattern of each redundant gene group is plotted along the circumference of the 
core phylogeny and is colored by the reference gene to which the redundant gene group 
maps. There are two distinct patterns of gene content variation among redundant 
accessory genes. Either the redundant genes are mostly found on the same genome, or 
the genes within a group are more diffusely present or absent in genomes. This gene 
content variation pattern was further explored in Figure 7A, which shows the 
distribution of gene groups with at least two redundant genes mapped to a single 
reference genome, colored by the fraction of the time that all of the genes in a given 
group were found on the same genome. A total of 29 out of the 60 reference genes that 
had three or more mapped redundant gene groups were PE and PPE family genes. 
PE/PPE genes are highly repetitive and have an increased number of indel mutations 
compared to genes with higher sequence complexity [41]; therefore, it is likely some of 
the redundant annotation is due to frameshift mutations caused by indels. The low 
sequence complexity of PE/PPE genes also makes them harder to sequence accurately, 
so a fraction of the redundant annotation may also be attributed to sequencing or 
 
76 
assembly error. Because PE and PPE gene families are difficult to sequence accurately, 
they are often excluded from molecular epidemiological analyses [47].  
Example gene alignments demonstrating the two causes of redundant annotation 
are shown in Figure 7B and C. The accessory genes in B were caused by annotation 
error, likely due to low coverage at the 3’ end of pks7. In C, the difference in accessory 
gene labeling was caused by a deletion in BQ2027_MB2105 in a subset of genomes 
including SAMN03300658. This deletion led to the annotation of an additional gene, 
labeled 369. An additional source of annotation error resulted from palindromic contigs, 
which likely resulted from assembly error. Although we removed palindromic contigs 
through visual inspection described in section 5.1, additional palindromic contigs 
remained in the analysis. The result of palindrominc contigs was identical sets of genes 
sequence annotated with different gene names. On non-palindromic contig genes from 
the identical sets received either the original gene annotation or the reverse compliment 
annotation. This resulted in an inflation of accessory genome count because core genes 
were designated as accessory due to inconsistent annotation. When we removed 
redundant gene labels generated by mislabeling at the ends of contigs, and palindromic 
contig labels, the size of the accessory genome decreased from 494 accessory genes 
present in between 15 and 95% of genomes and 970 rare genes present in less than 15% 
of genomes to 52 accessory and 85 rare genes (Figure 7D). The remaining accessory 
genes were either paralogs (33 genes) or pseudogenes (104 genes). 
 
 
 
 
77 
 
Figure 7. Accessory gene labels are created through redundant gene annotation. 
A. The distribution of all redundantly labeled genes, colored by the fraction of the time 
all genes in a redundant gene annotation group were found in the same genome. Within 
each reference genome, the number of mapped genes that were found within 1000 bp 
of the end of a contig are shown (black circles). B and C. Alignments of two example 
genes, pks7 and BQ2027_MB2105 (purple) with redundantly annotated accessory genes 
(blue). D. After removing redundant gene annotations, the accessory genome size (genes 
present in less than 95% of genomes) decreased from 1464 to 137 genes.  
 
 
78 
Although these redundantly annotated genes do not represent true accessory gene 
content variation, they do represent true differences in genotypes. This set of redundant 
genes contains genes with frameshift mutations leading to premature stop codons and 
loss of gene function, including mutations in single base repeats creating 
homopolymeric tracts, that may be reversible [41, 48]. We found insertions and 
deletions in homopolymeric tracts in multiple genes including frdBC and a C537, 
7Cà8C insertion in glpK, which was shown to cause reversible antibiotic tolerance in 
M. tuberculosis [48]. Interestingly, the frameshift-causing homopolymeric tract 
mutations in glpK are roughly uniformly distributed across the phylogeny, indicating 
that these mutations occur in the absence of a clear selective force (Supplemental Figure 
4A). An example alignment of the variable homopolymeric region in sequences from 
an outbreak that occurred in New Mexico, which had variable presence or absence 
indicating that variation in homopolymer tract length occurred within relatively short 
timespans (Supplemental Figure 4B).  
 
Attributes of accessory genes 
The gene presence absence patterns of the filtered set of accessory genes are shown in 
Figure 8. Three patterns of gene content variation were observed. First, it is evident that 
8 accessory genes have perfectly correlated presence and absence patterns, shown in 
Figure 8. These genes are a group of 8 pseudogenes that are remnants of a phage, 
PhiRv1, also identified in the M. bovis accessory genome by Zimpel et al.[43], which 
was thought to be acquired when the ancestor of the MTBC transitioned to an 
intracellular lifestyle. This phage is absent in the M. bovis BCG vaccine strain, and it is 
 
79 
hypothesized that the loss of this phage is associated with the reduced virulence of the 
vaccine strain [49]. PhiRv1 genes are present most lineages but are notably absent from 
a monophyletic group within La1.8 consisting of 49% of the La1.8 samples included in 
this study. Absence of PhiRv1 genes from this subgroup could be used as a phylogenetic 
marker. We ran an additional PCA analysis on the Jaccard distance of the filtered 
presence/absence matrix and found that there were two main clusters of accessory 
genes. Although accessory genes do not appear to cluster by core genome lineage 
(Supplemental Figure 5A), the two distinct PCA clusters are separated by the presence 
and absence of the PhiRv1 genes (Supplemental Figure 5B). The second group consists 
of genes identified to be paralogs by Panaroo. This group consists mainly of genes 
involved in transposition (transposases, endonucleases), but also contains PE/PPE 
family genes and four genes that form a unique cluster in M. bovis La1.7. These genes 
originated from a single, duplication event of four genes (bpa, ragH, glfR, and a 
hypothetical protein) within rfbB (Supplemental Figure 6). The last pattern of accessory 
gene content is more diffuse presence or absence, and this pattern is driven by accessory 
genes that are undergoing pseudeogenization. We were not able to identify any 
functional category that was more likely to be undergoing pseudogenization in the 
PANTHER overrepresentation test, suggesting that the process of gene 
pseudogenization may be somewhat random; however, essential genes seem to be 
protected from pseudogenization. Only one detected pseudogene mapped to an essential 
M. tuberculosis H37Rv gene, glmS. Although 14 genomes contained this pseudogene, 
each genome had a different insertion sequence all located in a repetitive region of glmS; 
This region of glmS may either be hypervariable or difficult to sequence accurately. 
 
80 
 
Figure 8. Filtered gene presence absence patterns in the M. bovis pangenome. 
The core genome phylogeny from Figure 5 is shown, labeled with clonal complexes. 
The phage PhiRv1 genes are also labeled. Branch lengths are in SNP units.  
 
Gene content variation in outbreaks 
Variable gene content over an outbreak in Minnesota is shown in Figure 9A next to a 
core genome phylogenetic tree. The tree is rooted at the common ancestor to a closely 
related outgroup of sequences from a Texas outbreak (13-3900, 12-5094 and 12-5090) 
as in Glaser et al. [39]. Only three genes were found to be variable over this outbreak, 
group_2069, frdC and frdD. Upon further inspection of contig alignments, group_2069 
was found to be annotated in genomes that lacked a 60 bp deletion between thrB and 
thrC compared to the reference genome AF2122/97. Due to the sparse distribution of 
group_2069 among outbreak samples, a more parsimonious explanation may be that the 
genomes with group_2069 have a 60 bp insertion relative to the other outbreak genomes, 
 
81 
which created a new open reading frame. The inserted sequence is identical to an 
adjacent 57 base pairs in the thrB/C intergenic region, suggesting this sequence was 
duplicated in genomes with group_2069 annotated. The other two genes, frdC and frdD 
were components of the AF2122/97 genes frdBC and frdD which were annotated 
differently depending on the insertion or deletion of base pairs in the homopolomyeric 
tract of frdBC.  An alignment of the sequence differences in the frdBC homopolymeric 
tract are shown next to the gene presence absence matrix in Figure 9A. All gene content 
variation that was identified in the pangenome of the outbreak sequences was driven by 
indels; however, not all indels that occurred during the outbreak were identified in the 
pangenome analysis since only indels that changed an open reading frame were noted 
as variable gene content. Figure 9B shows a the relationships of indel presence and 
absence of outbreak sequences detected by ScarTrek [41]. As expected, there is more 
variation in indels than was captured in the gene presence absence analysis.  
Glaser et al. included two different phylogenetic analyses [39]. The first (Glaser 
et al. Figure 6) was a maximum likelihood phylogeny resulting in the same tree as shown 
in Figure 9A, where two cattle samples from the same herd (05-9955 and 05-9963) were 
placed in a clade with a deer sample (08-0863) because these three samples shared one 
SNP absent in other samples. The second tree included temporal information and 
produced a Bayesian time calibrated tree which placed two samples from the same cattle 
herd (05-9955 and 05-9963) at the base of the tree (Glaser et al. Figure 5), and deer 
sample in a distant clade. The second tree would suggest that the cattle and deer samples 
were not closely related and perhaps were not linked in a transmission event. 
Interestingly, when comparing the SNP tree to the indel tree in Figure 9, there is another 
 
82 
difference in these three samples shown in these highlighted samples in Figure 9. The 
positions of these three samples on the indel tree in Figure 9B adds an additional 
dimension to the two phylogenetic analyses presented in Glaser et al, showing cow 
sample 05-9963 and deer sample 08-0863 share an additional indel not present in 05-
9955. The combined analysis of SNPs and indels suggests that these samples may be 
more closely related than was suggested in the timed tree. In conclusion, although true 
gene content variability may occasionally occur during outbreaks in the form of 
duplication or larger gene deletion events, indels are likely more common, and may be 
useful in outbreak investigation when few SNP differences exist among samples.  
  
 
83 
 
Figure 9. Gene content variation in a Minnesota outbreak. 
Figure 9A shows a core genome phylogeny created with IQ-TREE2 using a GTR 
substitution model, and a gene presence absence matrix. Variation in gene content is 
observed in three genes, not including PE/PPE genes, however all gene content variation 
was driven by indels. An alignment of a variable homopolymeric tract region of M. 
bovis AF2122/97 fdrBC is shown next to the gene presence absence matrix. This 
homopolymeric tract variation resulted in differential gene presence absence annotation 
of frdD and frdB genes among outbreak samples. Indel variation is shown in a tree 
 
84 
created with binary indel presence/absence data in IQ-TREE2 with a JC2 substitution 
model in B. Three samples are highlighted to show disparate topologies of the SNP and 
indel trees. Three samples that had conflicting phylogenetic positions are highlighted in 
green.  
 
Recombination 
To determine if there is evidence of ongoing HGT in the core genome, we used Gubbins 
to infer recombination history in the core gene alignment and fastGEAR to detect 
recombination in a selected set of individual gene alignments. Gubbins inferred 16 
recombination events; however, only 5 occurred on internal branches. Recombination 
events inferred on terminal branches can be unreliable because they only involve one 
genome. In addition to recombination analysis done on the core genome alignment, 
fastGEAR was used to investigate recombination on individual gene alignments with 
recombination events implicated in Reis and Cunha [6], and to confirm the 
recombination events detected by Gubbins. The fastGEAR analysis confirmed 
recombination events identified in Reis and Cunha [6] in narX and pks12. There is a 
high density of SNPs in a 100 bp locus in narX in18 monophyletic genomes. An 
alignment of the genomes with the narX SNP hotspot is shown in Supplemental Figure 
7A. Four of the SNPs are nonsynonymous, resulting in three amino acid changes. An 
additional potential recombination event involving a 16 SNP hotspot was identified in 
pks12. The pks12 SNP cluster is also found in a monophyletic group, shown in 
Supplemental Figure 8A and B. SNPs occurred at the 3’ end of pks12 directly adjacent 
to regions of consistent poor mapping quality. In fact, the region of pks12 that contained 
 
85 
the SNP hotspot was always located at the end of a contig in our draft genomes 
(Supplemental Figure 8C).  Since the SNP hotspot occurred at the end of contigs, near 
gaps, we could not confirm with high certainty that the SNPs were not called 
erroneously. Gubbins and fastGEAR also identified three additional recombination 
events in PPE 21, ppsB, and esxN. These potential recombination events consisted of 
sets of between five and seven SNPs, each shared by a monophyletic group.  
 
Pangenome openness 
To determine the impact of methodological differences on M. bovis pangenome 
representation, we re-constructed the pangenome using the 70 genomes used in Reis 
and Cunha [5]. The counts of accessory genes in the original analysis done by Reis and 
Cunha along with the Panaroo results before and after filtering are shown in Figure 10. 
Even before post-hoc filtering, the Panaroo estimate of accessory genome size was 
much smaller compared to the estimate obtained previously. Reis and Cunha used the 
get-homologs pipeline to analyze the pangenome [50], which unlike Panaroo does not 
employ a draft genome annotation correction method to reduce false positive accessory 
gene annotations. Using our methodology, we can reject the concept that the M. bovis 
pangenome is open, since we identified only 121 accessory genes in the 70 genome 
dataset, compared to the Reis and Cunha estimate of 3997 gene clusters. The gene 
presence/absence matrix of the pangenome created with the Reis and Cunha dataset is 
available in Supplemental File 3. 
 
 
 
86 
 
 
Figure 10. Methodological differences result in drastically different pangenome sizes. 
Panaroo produced a smaller accessory genome than that described in a recent M. bovis 
pangenome study. Soft core genes are present in greater than 95% of samples. Shell 
genes are present in less than 95% of samples but more than 2 samples. Cloud genes are 
present in one or two samples. After filtering redundantly labeled genes, the pangenome 
size was further reduced. The M. bovis pangenome is not open and shows presence and 
absence patterns consistent with clonal evolution.  
 
 
 
87 
Discussion 
 
Since a single genome contains only a portion of the genetic diversity of a species, 
comparing a new set of sequences to just one reference genome alone provides an 
incomplete description of the true genetic diversity in the set. There are currently 
hundreds of thousands of bacterial genomes publicly available on GenBank and other 
sequence repositories, potentially enabling the characterization of the full range of gene 
content diversity in multiple species; however, technical advances in annotating draft 
genomes have not caught up with the enormous number of incomplete genomes being 
uploaded to public databases [51].  One reason draft genomes are difficult to annotate 
is simply that genomes are incompletely sequenced and contain gaps between contigs. 
If the average prokaryotic gene is on the order of 1kb long, any contig with length on 
the same order of magnitude will likely contain fragmented genes. The ends of longer 
contigs also likely contain fragmented genes. Unfortunately, the only solution to this 
problem is to improve the sequence quality by closing the gaps between contigs, which 
perhaps will be done with future genomes sequenced using long read technology but 
may be too time intensive and costly for the thousands of existing draft genomes to be 
re-sequenced [51].   
Methods for characterizing the pangenome typically involve identifying and 
classifying clusters of orthologous genes into core and accessory genes, without 
specifically accounting for erroneous annotations. Panaroo, attempts to correct 
annotation errors by creating a graphical representation of the pangenome and 
employing several quality control steps to remove poorly supported nodes [21]. With 
these additional quality control steps, Panaroo was able to more accurately characterize 
 
88 
the M. tuberculosis pangenome compared to several other pangenome pipelines [21], 
however correctly characterizing gene fragments remains an issue.  
Another perhaps unavoidable issue in pangenome studies is sampling bias. Since 
bacterial population sizes are so vast, even a relatively large sample may not truly 
represent the diversity in gene content present in a species [52]. Additionally, although 
we downloaded thousands of M. bovis genomes available from the NCBI Sequence 
Read Archive, we only retained the highest quality assemblies for analysis to minimize 
errors described above caused by higher contig number assemblies. The resulting 
sample still represents a convenience sample of M. bovis outbreaks in countries with 
well-funded surveillance programs, and may not fully represent M. bovis variants 
present in cattle globally. However, our sample includes genomes from all known major 
M. bovis clonal complexes and may be representative of the true diversity of the 
geographic regions included in this study because of intensive sampling programs in 
those regions.  
Despite the limitations in data quality created by short read sequencing, we were 
able to determine that M. bovis likely evolves clonally. After correcting for redundant 
gene annotations, we found a small pangenome consistent with a general pattern of 
ongoing gene deletion resulting in an increasingly compact genome. This gene deletion 
appears to be driven by random indel-causing pseudogenization. We also show that the 
M. bovis pangenome is likely closed, contradicting recent work by Reis and Cunha 
suggesting the M. bovis pangenome is open [5]. 
Additionally, we determined that accessory gene content is generally stable 
during outbreaks, although some variability in gene content is observed even between 
 
89 
closely related individuals. Since M. bovis and other MTBC species have a slow 
mutation rate (less than one substitution per genome per year) [53], genomes sampled 
in outbreak investigation often have very similar or identical SNP patterns. This low 
SNP diversity makes inferring who infected whom to be difficult or impossible using 
SNP data. Gene content variation could occasionally be useful in outbreak analysis to 
discriminate between samples with very similar SNP patterns, although variability in 
accessory genomes among closely related outbreak sequences is low. One important 
caveat in using gene content variation in outbreak analysis is that if accessory genes are 
not carefully examined, inferences on transmission patterns based on gene content 
variation are likely to lead to incorrect conclusions because of the inherent problem of 
gene detection in incomplete genomes. Indel presence/absence may be a more reliable 
and more common source of sequence variation in outbreak investigations. Using both 
indel and SNP data to infer transmission patterns may result in higher resolution 
transmission networks.   
We also identified frameshift mutations in glpK and in similar homopolmeric 
tracts in other genes that may be related to antimicrobial resistance in M. tuberculosis 
[54]. Recently, Safi et al. demonstrated that nucleotide insertions in homopolymeric C 
tract in glpK at lead to the gene transient gene inactivation, which led to reduced growth 
under nutrient deficient conditions or antibiotic treatment [48]. After relieving the stress 
caused by nutrient deficiency or antibiotic treatment, the inserted base pair was removed 
and colonies were able to restart growth. Therefore, Safi et al. conclude that this phase 
variation represents an additional mechanism for the evolution of antibiotic tolerance 
and could bias evolutionary trajectories toward antimicrobial resistance. Interestingly, 
 
90 
this same C537, 7Cà8C insertion was identified in our dataset. Since cattle are not 
treated with antimicrobials for bovine tuberculosis, glpK phase variation in M. bovis 
may not arise as a compensatory mechanism for antibiotic tolerance, but instead may 
result from drift. In this case, under conditions of antibiotic use, bacteria with the 
frameshift mutation would have a selective advantage and increased survival. The glpK 
gene encodes glycerol kinase, which is necessary for glycerol catabolism; however, 
most M. bovis strains have an inactivating mutation in pyk, the gene that encodes 
pyruvate kinase, which prevents M. bovis from being able to use glycerol as a sole 
carbon source [55]. Therefore, glpK may be dispensable in M. bovis and inactivating 
mutations may arise due to relaxed negative selection or even positive selection to limit 
the energetic cost of protein production for an enzyme that is not required [56]. More 
work is necessary to understand the role of phase variation in MTBC evolution, but 
evidence of glpK phase variation in M. bovis in cattle not treated with antibiotics for 
bovine tuberculosis suggests phase variation, especially in non-essential genes, could 
be a basal evolutionary process that occurs in the absence of antibiotic selection 
pressure. Indel mutations, generally, may be a random mutation process that selection 
can act on in addition to SNPs. Recently indels were implicated in antimicrobial 
resistance in M. tuberculosis [57], and it is possible that those resistance-associated 
indels first arose due to drift, and then were selected due to antimicrobial use.  
Lastly, we show that recombination is not an important driver of evolution in 
the M. bovis core genome. Recently Reis and Cunha present compelling evidence that 
a low level of recombination does occur in M. bovis [6], and we were able to replicate 
their findings in one gene, narX, and detected potential recombination events in an 
 
91 
additional three genes; however, the events in PPE 21, esxN, and ppsB only contained 
five to seven SNPs. Although it is possible that these mutations were caused by 
homologous recombination, other evolutionary forces such as recent positive selection 
could create a mutation hotspot [58]. Furthermore, all genomes with potential 
recombination event  are part of monophyletic lineages, so it is likely that all mutations 
were caused by a single event. However, the mutation pattern in narX is unusual, and 
from this analysis alone we cannot rule out that it was caused by homologous 
recombination.  
Together, the small accessory genome and lack of significant evidence of 
recombination, provides strong evidence against the hypothesis that there is ongoing 
HGT in M. bovis, and instead, M. bovis indeed likely evolves strictly clonally. This 
finding has important implications both in understanding the basic mechanisms of 
MTBC evolution and in the practice of using genomic data in outbreak investigation. 
Strict clonality restricts the ability of M. bovis, and likely all MTBC species, to rapidly 
acquire genes that would allow it to adapt to new environments, including genes that 
would promote antibiotic tolerance and resistance. Although antimicrobial resistance is 
a growing global concern in MTBC and M. tuberculosis in particular, strict clonality 
implies that all antimicrobial resistance must evolve through de novo mutations. In 
outbreak analysis, SNP based bifurcating phylogenetic trees can be used reliably, 
without the need for complex graph structures to understand transmission.   
 
 
 
 
92 
Conclusions 
 
M. bovis is likely a strictly clonally evolving pathogen, and most variation in gene 
content is driven by gene deletion. M. bovis pangenome evolution is trending toward an 
increasingly compact genome, leaving only genes that are necessary for life inside the 
host. We also show that most recombination events and accessory genes identified by 
bioinformatic programs were false positives, although Panaroo’s graph connection 
mechanism appears to outperform other pangenome characterization algorithms in 
reducing the number of false positive genes. Because of the inherent problem of missing 
data created by using incomplete sequences, and the inability for bioinformatic 
programs to detect and rectify all errors in input data, both accessory gene and 
recombination events detected using incomplete MTBC genomes must be scrutinized 
to avoid incorrect inference based on pangenome data. We also found potentially 
reversible homopolymeric tract mutations in multiple genes, providing evidence that 
phase variation may be a general mechanism of gene regulation in MTBC. Lastly, we 
show that a small amount of gene content variation evolves over short time periods 
among closely related sequences, and this gene content variation is driven by indels. 
Indels may be a useful source of sequence variation in outbreak analysis when 
discriminating between genomes with similar SNP patterns.  
 
Acknowledgements 
 
The authors gratefully acknowledge the USDA-NIFA fellowship grant number 2020-
67034-31732, which funded PhD studies for KMC. This work was also funded by 
USDA-NIFA AFRI grant number 2014-67015-2240 as part of the joint USDA-NSF-
 
93 
NIH-BBSRC-BSF Ecology and Evolution of Infectious Diseases Program. The authors 
would like to acknowledge the National Veterinary Service Laboratories at the United 
States Department of Agriculture for their extensive M. bovis sampling and sequencing 
efforts and would like to specifically acknowledge Dr. Suelee Robbe-Austerman, Dr. 
Kathy Orloski, Dr. Tyler Thacker, and Dr. Jason Lombard for helpful discussions 
related to the use of M. bovis pangenomics in outbreak investigation. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94 
References 
 
1. Boritsch EC, Khanna V, Pawlik A, Honore N, Navas VH et al. Key 
experimental evidence of chromosomal DNA transfer among selected tuberculosis-
causing mycobacteria. Proc Natl Acad Sci U S A 2016;113(35):9876-9881. 
2. Chiner-Oms Á, Sánchez-Busó L, Corander J, Gagneux S, Harris SR et al. 
Genomic determinants of speciation and spread of the Mycobacterium tuberculosis 
complex. Science Advances 2019;5. 
3. Patane JS, Martins J, Beatriz Castelao A, Nishibe C, Montera L et al. 
Patterns and processes of Mycobacterium bovis evolution revealed by phylogenomic 
analyses. Genome Biol Evol 2017. 
4. Kavvas ES, Catoiu E, Mih N, Yurkovich JT, Seif Y et al. Machine learning 
and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic 
signatures of antibiotic resistance. Nat Commun 2018;9(1):4306. 
5. Reis AC, Cunha MV. The open pan-genome architecture and virulence 
landscape of Mycobacterium bovis. Microb Genom 2021;7(10). 
6. Reis AC, Cunha MV. Genome-wide estimation of recombination, mutation and 
positive selection enlightens diversification drivers of Mycobacterium bovis. Sci Rep 
2021;11(1):18789. 
7. Namouchi A, Didelot X, Schöck U, Gicquel B, Rocha EP. After the 
bottleneck: Genome-wide diversification of the Mycobacterium tuberculosis complex 
by mutation, recombination, and natural selection. Genome Res 2012;22(4):721-734. 
8. Godfroid M, Dagan T, Kupczok A. Recombination Signal in Mycobacterium 
tuberculosis Stems from Reference-guided Assemblies and Alignment Artefacts. 
Genome Biol Evol 2018;10(8):1920-1926. 
9. Almaw G, Mekonnen GA, Mihret A, Aseffa A, Taye H et al. Population 
structure and transmission of Mycobacterium bovis in Ethiopia. Microb Genom 
2021;7(5). 
10. Loiseau C, Menardo F, Aseffa A, Hailu E, Gumi B et al.  
An African origin for Mycobacterium bovis. Evol Med Public Health 2020;2020(1):49-
59. 
11. Orloski K, Robbe-Austerman S, Stuber T, Hench B, Schoenbaum M. Whole 
Genome Sequencing of Mycobacterium bovis Isolated From Livestock in the United 
States, 1989-2018. Front Vet Sci 2018;5:253. 
12. Rodrigues RA, Ribeiro Araújo F, Rivera Dávila AM, Etges RN, Parkhill J 
et al. Genomic and temporal analyses of Mycobacterium bovis in southern Brazil. 
Microb Genom 2021;7(5). 
13. Zwyer M, Çavusoglu C, Ghielmetti G, Pacciarini ML, Scaltriti E et al. A 
new nomenclature for the livestock-associated Mycobacterium tuberculosis complex 
based on phylogenomics. Open Res Europe; 2021. 
14. Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using 
SPAdes De Novo Assembler. Curr Protoc Bioinformatics 2020;70(1):e102. 
15. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A et al. Pilon: an integrated 
tool for comprehensive microbial variant detection and genome assembly improvement. 
PLoS One 2014;9(11):e112963. 
 
95 
16. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool 
for genome assemblies. Bioinformatics 2013;29(8):1072-1075. 
17. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: 
assessing the quality of microbial genomes recovered from isolates, single cells, and 
metagenomes. Genome Res 2015;25(7):1043-1055. 
18. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 
2014;30(14):2068-2069. 
19. Malone KM, Farrell D, Stuber TP, Schubert OT, Aebersold R et al. Updated 
Reference Genome Sequence and Annotation of. Genome Announc 2017;5(14). 
20. Garrison E, Kronenberg ZN, Dawson ET, Pedersen BS, Prins P. Vcflib and 
tools for processing the VCF variant call format. bioRxiv 2021. 
21. Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G et al. 
Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 
2020;21(1):180. 
22. Mi H, Ebert D, Muruganujan A, Mills C, Albou LP et al. PANTHER version 
16: a revised family classification, tree-based classification tool, enhancer regions and 
extensive API. Nucleic Acids Res 2021;49(D1):D394-D403. 
23. DeJesus MA, Gerrick ER, Xu W, Park SW, Long JE et al. Comprehensive 
Essentiality Analysis of the Mycobacterium tuberculosis Genome via Saturating 
Transposon Mutagenesis. mBio 2017;8(1). 
24. Farrell D, Crispell J, Gordon SV. Updated functional annotation of the 
Mycobacterium bovis AF2122/97 reference genome. Access Microbiol 
2020;2(7):acmi000129. 
25. Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid 
core-genome  alignment and visualization of thousands of  intraspecific microbial 
genomes. Genome Biol 2014;15(11):524. 
26. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and 
high throughput. Nucleic Acids Res 2004;32(5):1792-1797. 
27. Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-
likelihood trees for large alignments. PLoS One 2010;5(3):e9490. 
28. Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA et al. Rapid 
phylogenetic analysis of large samples of recombinant bacterial whole genome 
sequences using Gubbins. Nucleic Acids Res 2015;43(3):e15. 
29. Darling AE, Mau B, Perna NT. progressiveMauve: multiple genome 
alignment with gene gain, loss and rearrangement. PLoS One 2010;5(6):e11147. 
30. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for 
phylogenetic tree display and annotation. Nucleic Acids Res 2021;49(W1):W293-
W296. 
31. Mostowy R, Croucher NJ, Andam CP, Corander J, Hanage WP et al. 
Efficient Inference of Recent and Ancestral Recombination within Bacterial 
Populations. Mol Biol Evol 2017;34(5):1167-1182. 
32. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid 
multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 
2002;30(14):3059-3066. 
 
96 
33. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES et 
al. Integrative genomics viewer. Nat Biotechnol 2011;29(1):24-26. 
34. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T et al. A program for 
annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs 
in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 
2012;6(2):80-92. 
35. Konishi T, Matsukuma S, Fuji H, Nakamura D, Satou N et al. Principal 
Component Analysis applied directly to Sequence Matrix. Sci Rep 2019;9(1):19297. 
36. Bittinger K. abdiv: Alpha and Beta Diversity Measures. R package version 
0.2.0., 2020. https://CRAN.R-project.org/package=abdiv. 
37. Team R. RStudio: Integrated Development for R. Boston, MA; 2020. 
http://www.rstudio.com/. 
38. Team RC. R: A Language and Environment for Statistical Computing. R 
Foundation for Statistical Computing, Vienna, Austria; 2021. 
39. Glaser L, Carstensen M, Shaw S, Robbe-Austerman S, Wunschmann A et 
al. Descriptive Epidemiology and Whole Genome Sequencing Analysis for an Outbreak 
of Bovine Tuberculosis in Beef Cattle and White-Tailed Deer in Northwestern 
Minnesota. PLoS One 2016;11(1):e0145735. 
40. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD et al. 
IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the 
Genomic Era. Mol Biol Evol 2020;37(5):1530-1534. 
41. Gupta A, Alland D. Reversible gene silencing through frameshift indels and 
frameshift scars provide adaptive plasticity for Mycobacterium tuberculosis. Nat 
Commun 2021;12(1):4702. 
42. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-
analysis of large phylogenies. Bioinformatics 2014;30(9):1312-1313. 
43. Zimpel CK, Brandao PE, de Souza Filho AF, de Souza RF, Ikuta CY et al. 
Complete Genome Sequencing of Mycobacterium bovis SP38 and Comparative 
Genomics of Mycobacterium bovis and M. tuberculosis Strains. Front Microbiol 
2017;8:2389. 
44. Zakham F, Sironen T, Vapalahti O, Kant R. Pan and Core Genome Analysis 
of 183 Mycobacterium tuberculosis Strains Revealed a High Inter-Species Diversity 
among the Human Adapted Strains. Antibiotics (Basel) 2021;10(5). 
45. Dar HA, Zaheer T, Ullah N, Bakhtiar SM, Zhang T et al. Pangenome 
Analysis of Mycobacterium tuberculosis Reveals Core-Drug Targets and Screening of 
Promising Lead Compounds for Drug Discovery. Antibiotics (Basel) 2020;9(11). 
46. Yang T, Zhong J, Zhang J, Li C, Yu X et al. Pan-Genomic Study of 
Mycobacterium tuberculosis Reflecting the Primary/Secondary Genes, 
Generality/Individuality, and the Interconversion Through Copy Number Variations. 
Front Microbiol 2018;9:1886. 
47. Meehan CJ, Goig GA, Kohl TA, Verboven L, Dippenaar A et al. Whole 
genome sequencing of Mycobacterium tuberculosis: current standards and open issues. 
Nat Rev Microbiol 2019;17(9):533-545. 
 
97 
48. Safi H, Gopal P, Lingaraju S, Ma S, Levine C et al. Phase variation 
in Mycobacterium tuberculosis glpK produces transiently heritable drug tolerance. Proc 
Natl Acad Sci U S A 2019;116(39):19665-19674. 
49. Fan X, Abd Alla AA, Xie J. Distribution and function of prophage phiRv1 and 
phiRv2 among Mycobacterium tuberculosis complex. J Biomol Struct Dyn 
2016;34(2):233-238. 
50. Contreras-Moreira B, Vinuesa P. GET_HOMOLOGUES, a versatile software 
package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 
2013;79(24):7696-7701. 
51. Salzberg SL. Next-generation genome annotation: we still struggle to get it 
right. Genome Biol 2019;20(1):92. 
52. Brockhurst MA, Harrison E, Hall JPJ, Richards T, McNally A et al. The 
Ecology and Evolution of Pangenomes. Curr Biol 2019;29(20):R1094-R1103. 
53. Menardo F, Duchêne S, Brites D, Gagneux S. The molecular clock of 
Mycobacterium tuberculosis. PLoS Pathog 2019;15(9):e1008067. 
54. Bellerose MM, Baek SH, Huang CC, Moss CE, Koh EI et al. Common 
Variants in the Glycerol Kinase Gene Reduce Tuberculosis Drug Efficacy. mBio 
2019;10(4). 
55. Keating LA, Wheeler PR, Mansoor H, Inwald JK, Dale J et al. The pyruvate 
requirement of some members of the Mycobacterium tuberculosis complex is due to an 
inactive pyruvate kinase: implications for in vivo growth. Mol Microbiol 
2005;56(1):163-174. 
56. Vargas R, Farhat MR. Antibiotic treatment and selection for. Proc Natl Acad 
Sci U S A 2020;117(8):3910-3912. 
57. Godfroid M, Dagan T, Merker M, Kohl TA, Diel R et al. Insertion and 
deletion evolution reflects antibiotics selection pressure in a Mycobacterium 
tuberculosis outbreak. PLoS Pathog 2020;16(9):e1008357. 
58. Chattopadhyay S, Weissman SJ, Minin VN, Russo TA, Dykhuizen DE et al. 
High frequency of hotspot mutations in core genes of Escherichia coli due to short-term 
positive selection. Proc Natl Acad Sci U S A 2009;106(30):12412-12417. 
59.  Ceres KC, Stanhope MJ, Gröhn YT, Mycobacterium bovis pangenomics. 
Figshare 2022. https://figshare.com/s/c9665c733243e68326f7 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98 
Supplementary Materials  
  
 
  
  
Supplemental Figure 1 A. Distribution of contig counts and B. N50 among 
genomes used in this study.  C. Percent completeness and percent 
contamination of each genome as detected by checkM.   
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
99 
  
Supplemental Figure 2. EsxI paralog is absent from reference genome 
AF2122.97 but present in M. tuberculosis H37Rv, M. bovis Mb3601 and 173 
genomes in this study. Of the 173 genomes with the EsxI paralog present, 75 
were found in La1.2, 63 in La1.3 and 43 in La1.4.   
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
100 
  
  
  
  
Supplemental Figure 3. The distribution of redundantly annotated accessory 
genes. The circular core phylogeny shown in Figure 5 is labeled with a gene 
presence absence matrix of redundant gene groups consisting of 6 or more 
genes labeled by mapped reference gene on the circumference of the 
phylogeny. Accessory genes are either found on the same genome such as 
genes mapped to PPE8, or are more diffusely present or absent.  
  
  
  
  
  
  
  
  
  
  
 
101 
  
  
  
  
  
  
  
Supplemental Figure 4. The distribution of glpK homopolymeric tract insertions. A. 
The distribution of 7C->8C insertions in glpK distributed on a circular representation 
of the core phylogeny shown in Figure 5. B. Variable homopolymeric tract lengths 
were found in closely related sequences from an outbreak in New Mexico labeled in 
teal, which occurred between 2007 and 2008. The homoplomyeric tract mutations 
appear to be transient.  
  
  
  
  
  
  
  
  
 
102 
 
Supplemental Figure 5. PCA of the Jaccard distance of the filtered gene 
presence absence matrix. In A. points are colored by Lineage, and in B. points 
are colored by the presence or absence of the phage PhiRv1 genes. The percent 
variation described by each principal component is shown on the x and y axes.   
  
 
103 
 
Supplemental Figure 5. PCA of the Jaccard distance of the filtered gene presence 
absence matrix. In A. points are colored by Lineage, and in B. points are colored by 
the presence or absence of the phage PhiRv1 genes. The percent variation described 
by each principal component is shown on the x and y axes.  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
  
  
 
 
104 
Reference Previously Gubbins FastGEAR SNPs Event 
gene detected  detected detected validate
d 
narX Yes No Yes Yes 14 SNPs (18 
monophyletic 
genomes) 
pks12 Yes Yes Yes Yes 17 SNPs (151 
monophyletic 
genomes) 
pks12 Yes Yes Yes Yes 17 SNPs (4 
monophyletic 
genomes 
espa T No Yes No Ends of contigs 
tatA Yes No No NA NA 
PE PGRS22 Yes No No NA NA 
rrs Yes No No NA  NA 
PPE21 No Yes Yes Yes 5 SNPs (7 
monophyletic 
genomes) 
esxN No Yes Yes Yes 5 SNPs (56 
monophyletic 
genomes) 
ppsB No Yes No Yes 7 SNPs (2 
monophyletic 
genomes) 
  
Supplemental Table 1. Recombination events identified by Gubbins or fastGEAR. 
SNPs were validated if they were detected both by a recombination detection program 
and were identified SNPs by the vSNP pipeline after quality control filtering.  
Previously detected refers to the recombination events detected in Reis & Cunha, 
Scientific Reports 2021. 
  
  
  
  
  
  
  
  
  
  
  
  
  
 
105 
  
  
  
A B
420 430 440 450 460 470 480 490 500 510
10 SNPs
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T89G6GGCCGACATCCAGGCCGACCCGCGGCGGCGCCGCCGCTACCAGCGCGCCCGCGGCAAGGGCGGGCTGGTCCGGGTCAG
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T89G5GGCCGACATCCAGGCCGACCCGCGGCGGCGCCGCCGCTACCAGCGCGCCCGCGGCAAGGGCGGGCTGGTCCGGGTCAG
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T90G4GGCCGACATCCAGGCCGACCCGCGGCGGCGCCGCCGCTACCAGCGCGCCCGCGGCAAGGGCGGGCTGGTCCGGGTCAG
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T75G8GGCCGACATCCAGGCCGACCCGCGGCGGCGCCGCCGCTACCAGCGCGCCCGCGGCAAGGGCGGGCTGGTCCGGGTCAG 1629 1650
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T80G8GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T81G3GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG 10 SNPs
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T84G0GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T79G4GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617619 AACTCGGGCGATCTCAACACCGGATTGTTCAACT
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T41G5GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617476 AACTCGGGCGATCTCAACACCGGATTGTTCAACT
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G5GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617524 AACTCGGGCAACATCAACACCGGCTTCTTCAACT
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T37G1GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G6GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617679 AACTCGGGCAACATCAACACCGGCTTCTTCAACT
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G2GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617479 AACTCGGGCAACATCAACACCGGCTTCTTCAACT
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T84G9GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617535 AACTCGGGCAACATCAACACCGGCTTCTTCAACT
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G1GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617690
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G4GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG AACTCGGGCAACATCAACACCGGCTTCTTCAACT
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G0GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617651 AACTCGGGCAACATCAACACCGGCTTCTTCAACT
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G8GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617463 AACTCGGGCAACATCAACACCGGCTTCTTCAACT
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T41G6GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SYN
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G7GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG Variant effect NSYN
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T41G3GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG
CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T34G8GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG
Variant effect SYNNSYN
C D
10 20 62 93
1809 1834
GCTCATGGCGCCATGATCSRCRG7C8G51C3T16CAGGCGGCGTCGCTTGAGGCGGAGCATCAGGCCATC 50 SNPs
GCTCATGGCGCCATGATCSRCRG1C6G27C8T28C0AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC
GCTCATGGCGCCATGATCSRCRG1C6G27C8T25C1AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1657061 CTTCTCGTTGCACGACGTGCTGGCTAACGGCGAGGAGCTA
GCTCATGGCGCCATGATCSRCRG7C8G51C3T17CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1791816 CTTCTCGTTGCACGACGTGCTGGCTAACGGCGAGGAGCTA
GCTCATGGCGCCATGATCSRCRG7C8G51C3T81CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR6467887 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA
GCTCATGGCGCCATGATCSRCRG7C8G51C3T65CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792438 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA
GCTCATGGCGCCATGATCSRCRG7C8G51C3T47CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792343 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA
GCTCATGGCGCCATGATCSRCRG5C4G86C0T79CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792330 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA
GCTCATGGCGCCATGATCSRCRG5C4G85C7T29CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792329 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA
GCTCATGGCGCCATGATCSRCRG1C7G92C4T85CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1791723 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA
GCTCATGGCGCCATGATCSRCRG1C3G19C9T75C2AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792065 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA
GCTCATGGCGCCATGATCSRCRG1C3G19C9T73C5AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792046 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA
1G0C STNCPAsTGGCGCCATGATCSRCRG1C3G19C9T76C2AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SYNGCTCATGGCGCCATGATCSRCRG1C3G19C9T75C8AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC Variant effect NSYN
GCTCATGGCGCCATGATCSRCRG1C3G19C9T75C6AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC
Variant effect SYNNSYN
   
   Supplemental Figure 7. Potential recombination events confirmed by fastGEAR in A. narX, B. 
  PPE 21, C. esxN, and D. ppsB. A. The recombinant segment in narX contains a hotspot of 14 
 SSuNppPsle smhaerendt ably  F1i8g suerqeu e7n.c ePso ftreonmt iaa sl inregcleo omutbbirneakt.i o Fno uerv SeNnPtss  caroen nfoirnmsyendon byym ofuas taGndE rAesRul ti n A. 
nainrX th,r Bee.  aPmPinEo  2ac1i,d  Cch. aensgxeNs. ,B a. nInd P DPE.  p2p1,s tBh.r eAe m. Tuthaeti ornesc womereb ninonasnytn soengymoeunst.  Cin.  Tnwaor Xou ct onf tains 
a hthoet fsipvoe tm ouft a1ti4on Ss NwePrse  snhonasryendo nbyym 1o8us s, eanqdu ienn Dce. sfi vfer oomut  oaf  sthine gselev eonu mtburtaetaiokn.s   wFeorue r SNPs are 
nonnosnysynnoonnyymmoouus.s    and result in three amino acid changes. B. In PPE 21, three mutations 
we re nonsynonymous. C. Two out of the five mutations were nonsynonymous, and in 
D.  five out of the seven mutations were nonsynonymous.  
   
   
   
   
   
   
    
  
  
  
  
 
.Arg154Gln
Ala22Gl .y Arg1
S 5e 6r2 S3 eL reu
.Arg161His
Arg605H Ai ss p544Asn
Leu545Ile
Ile608L
I ele u608Met
Thr610Asn
   Leu
L 5y 4s 96 P1 h2 eGlu
106 
 
Supplemental Figure 8. SNP hotspot in pks12 is not verified because of low mapping 
quality. The distribution of pks12 SNPs is shown in A. SNPs are shown in B and 
cluster in lineage La1.8. The location of the SNPs in pks12 is shown in C (arrow). 
SNPs occur at the 3’ end of pks12. pks12 generally had poor mapping quality in our 
sample, and the SNP hotspot was always located at the end of a contig.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER 4 
Characterizing the effects of drift, progeny skew and population dynamics on M. bovis 
within host evolution  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108 
Abstract 
 
Rapid evolution can lead to evolutionary trajectories that differ from those predicted 
using standard models of molecular evolution based on the Kingman’s coalescent. Since 
molecular epidemiology relies on population genetic theory to link pathogen genetic 
data to epidemiological inference, thorough understanding of a pathogen’s evolutionary 
biology in relation to basic model assumptions is necessary to ensure models are used 
appropriately. Here we study the within host evolution of Mycobacterium bovis, the 
cause of bovine tuberculosis, using the empirical distribution mutations that arise de 
novo within an animal host and forward genetic simulation to test hypotheses for 
evolutionary mechanisms within the host. We found that the empirical distribution of 
mutations does not deviate from a null expectation of random mutation, suggesting drift, 
not selection, is the primary evolutionary force driving mutation during infection in 
animal hosts. We also found that a within host evolutionary scenario involving skewed 
offspring distributions and panmictic within host populations and a rapid mutation rate 
was best supported by the data. The diversity generating effect of faster mutation rate is 
counteracted by the diversity reducing effect of large variance in offspring distributions, 
where a minority of individuals produces offspring in the next generation. Our 
simulation framework can be applied to other aspects of M. bovis molecular 
epidemiology, including transmission analysis, and studying the evolution of important 
phenotypes like antimicrobial resistance.  
 
 
 
 
109 
Introduction 
 
Pathogens can evolve rapidly under the right conditions, and this rapid evolution can 
produce evolutionary trajectories that significantly deviate from standard models that 
assume relatively slow evolution. One aspect of rapid evolution is that offspring 
distributions may be highly skewed, with fit individuals producing more offspring than 
less fit individuals. This variance in reproductive success may be driven by selection 
induced by antimicrobial treatment or host immune response [1], and by serial 
population bottlenecks [2, 3]. The Mycobacterium tuberculosis complex (MTBC), 
including M. tuberculosis, the cause of tuberculosis in humans, and M. bovis, the cause 
of bovine tuberculosis, is one example of a pathogen system that’s evolutionary 
trajectory is shaped by serial population bottlenecks associated with transmission 
events. Additionally, successful treatment for tuberculosis requires months of 
antimicrobial therapy, and substandard treatment [4] or low patient adherence [5] can 
create strong selective forces for antimicrobial resistance to rapidly evolve. 
Standard models of evolution based on the Kingman’s coalescent [6] that 
assume narrow offspring distributions may not appropriately model the clonal evolution 
of MTBC species dominated by clonal reproduction, frequent population bottlenecks, 
and potential differences in reproductive success [7]. Recent studies have found 
multiple-merger coalescent models [2] and explicit modeling of skewed offspring 
distributions [8, 9] describe M. tuberculosis evolution better than standard Kingman’s 
coalescent based models. Additionally, since the MTBC is highly clonal with little to 
no recombination, the dynamics of neutral mutations in a population may not depend 
 
110 
only on neutral drift, but on genetic draft due to linkage with any site under selection in 
the genome [10]. This genetic hitchhiking may be particularly important in 
understanding the dynamics of mutations linked to resistance-conferring mutations, 
which can sweep to high population frequencies during anti-tuberculosis drug treatment 
[11].  
These theoretical deviations of MTBC evolution from standard coalescent 
models highlight the importance of understanding the biology of the host-pathogen 
system when applying epidemiological models, but also demonstrate the potential 
increased utility of genomic data, once appropriate models are described. For example, 
using standard coalescent models, the evolutionary rate of M. tuberculosis lineages has 
been estimated to between 1e-10 and 1e-7 substitutions per site per year [12] and 
between 1e-8 and 1e-7 substitutions per site per year in M. bovis lineages [12, 13] or 
between about 0.04 to 0.4 substitutions per genome per year assuming a genome length 
of about 4 Mbp. The basal mutation rate of an organism determines the amount of 
genetic variation produced in a population but does not perfectly correlate with 
estimates of evolutionary rate [14]. After mutations arise, positive and negative 
selection and population bottlenecks caused by transmission events, alter the 
distribution of mutations in a population. The combination of these extrinsic effects and 
intrinsic mutation rate produces the observed differences among a sample of genomes. 
In a simulation framework that relaxes the assumptions of narrow offspring 
distributions, Morales-Arce et al found that the combination of a relatively fast mutation 
rate of 1e-7 mutations per site per generation, or about 40 mutations per genome per 
100 generations, and the genetic diversity reducing effect of skewed progeny 
 
111 
distributions, where a subset of individuals produced more offspring than others, 
produced the numbers of segregating sites commonly observed between serial patient 
samples [8]. This higher basal mutation rate could have important implications for the 
potential within host adaptation that is possible at a given timescale, including the 
probability of de novo antimicrobial resistance mutation evolution.  
Here we study the within host microevolution of three geographically distinct 
M. bovis outbreaks using a combination of empirical analysis and forward genetic 
simulation. We describe the empirical distribution of mutations in functional categories 
and use simulation to estimate the de novo mutation rate, infection population bottleneck 
size, and the degree of progeny skew. We find that in the context of within host 
evolution, drift, not selection, appears to be the driving force determining the 
distribution of mutations. We also find that when considering both fixed and variable 
SNPs instead of fixed SNPs alone and the diversity reducing effect of clonal progeny 
skew, the underlying mutation rate is much faster than previous estimates suggest, 
mirroring a recent finding in human M. tuberculosis infections [8]. Our findings can be 
used to develop epidemiological frameworks that better incorporate all variant 
information available in population sequencing samples and may aid in developing 
higher resolution transmission networks in outbreak analysis.  
 
Methods 
 
Sequence selection and variant calling  
Whole genome sequences from Glaser et al.[15], Rodrigues et al. [16],  and Almaw et 
al. [17] were selected because each study included samples from multiple tissue types 
 
112 
within individual hosts. Sequence reads were downloaded from the NCBI Sequence 
Read Archive. SNPs were called using vSNP pipeline (https://github.com/USDA-
VS/vSNP, lofreq [18] and VarScan 2 [19] using AF2122/97 as a reference (NCBI 
accession: NC_002945v4). The bam files produced from step 1 of the vSNP pipeline 
were used to call SNPs using LoFreq and VarScan 2. We retained variants if they were 
called by both LoFreq and VarScan 2, and had at least 10 reads supporting them, a 
coverage of at least 50x, a minimum Phred score of 20, and a minimum allele frequency 
of 0.5 %. SNPs in repetitive regions of the genome including PE/PPE genes mobile 
genetic elements were also removed. Variants were labeled as variable SNPs (vSNPs) 
if the allele frequency was between 5% and 90%. SNPs were considered fixed (fSNPs), 
if the allele frequency was greater than or equal to 90%.  
 
Population structure analysis 
A phylogenetic tree was created using IQtree2 [20] using the GTR+G substitution 
model. The tree was rooted at the AF2122/97 reference. RhierBAPS was used to 
identify up to five levels phylogenetic clusters [21]. Similar to the criteria used in 
Moreno-Molina et al. [22], an infection was determined to be polyclonal if SNPs 
clustered into different RhierBAPS phylogenetic clusters, or if there were 30 or more 
SNPs different between samples. The correlation in allele frequencies between distinct 
samples in the same host was also calculated and plotted using the R package corrplot 
[23]. Pairs of samples were predicted to have highly correlated allele frequencies if they 
were derived from the same source infection population. Samples from polyclonal 
infections were predicted to have less correlated allele frequencies. 
 
113 
Characterization mutation effects 
The functional effects of SNPs (synonymous or nonsynonymous) were predicted using 
snpEff [24] with a custom reference database created with the same reference genome 
used to call SNPs. Genes were categorized into nonessential and essential groups based 
on characterization of gene essentiality in M. tuberculosis. First, genes were labeled as 
essential or non-essential using labels developed by a random saturating transposon 
mutagenesis study in H37Rv M. tuberculosis [25]. Genes were further were further 
characterized into distinct functional categories (antibiotic resistance, antigens, mobile 
genetic elements, other) using the classification used in Vargas et al. for classifying M. 
tuberculosis genes [26]. Genes in the “other” category contained essential and 
nonessential genes that were not otherwise categorized as antibiotic resistance genes, 
antigens, or mobile genetic elements. The gene coordinates of the H37Rv reference gene 
were translated to coordinates in M. bovis AF2122/97 using the annotations provided 
by Farrell et al. [27]. Genotypic resistance was predicted using TB-Profiler [28]. 
 
Mutational density test 
To identify genes with significant mutation burden, we used a mutational density test 
similar to the mutational density test described in Vargas et al.[26]. We applied the 
mutational density test to individual genes, and to gene categories, to determine if any 
gene or gene category contained more mutations than expected under the null model, 
where any location in the genome was equally likely to be mutated. The mutational 
density test assumes the number of SNPs, N, that occur across all samples in gene j, is 
modeled as a Poisson random variable 𝑁#~𝑃𝑜𝑖𝑠(𝜆). For each gene, 𝑁# 	was compared to 
 
114 
a null rate 𝑁5 was assumed to be the mean number of SNPs across all genes analyzed, 
weighted by gene length 𝑔5: 
M
𝑔
𝜆5 = ab𝑁5cd
5 e 
∑M
5:3 5
𝑔5
A gene was determined to be significantly enriched mutations if the p value (𝑃(𝑁# >
𝑁5)) was less than a Bonferroni corrected threshold (p < 0.05 / n_genes), corrected for 
the number of genes analyzed (n_genes = 3,758). When applying the mutational density 
test to gene categories instead of individual genes, genes were classified into functional 
categories described in Section 2.3, and 𝑔5 represented the total length of all genes in 
category i. The p value threshold was adjusted for the six categories tested (p < 0.5 / 6).  
 
Forward genetic simulation 
A forward genetic agent-based model was developed in SLiM 3 v 3.7.1 [29] to model 
the evolution of M. bovis genomes inside animal hosts. The goal of the simulation 
analysis was to estimate the best fitting parameter values for critical parameters that 
describe the within host evolution process: mutation rate 𝜇, infection bottleneck size 𝑏, 
and degree of within clonal progeny skew.  
 
Model structure 
We created four different simulation structures with different combinations of 
parameters. A visual description of the simulation framework is shown in Figure 11 and 
parameter combinations are shown in Table 4.  
 
 
115 
 
A C No progeny skew Progeny skew
 μ
B
Gi Gi+1 Gi Gi+1
B
D
λ
p  μ 0.75 0.1
p
p 0.50 0.5B 1.0
p 0.25
0.00
0 2 4
Number of offspring
 
Figure 11. Within host population simulation schematic. 
Four model structures were created. Models varied in the presence or absence of within 
host population structure (A-B) and the presence or absence of skewed offspring 
distributions (C). For models with progeny skew, at each generation the number of 
offspring produced by an individual was drawn from a Poisson distribution. Example 
distributions in the range of parameter values,	𝜆, explored is shown in D.  
Table 4. Components of each model structure tested  
Model Number Progeny skew Population structure 
1 No No 
2 Yes No 
3 No Yes 
4 Yes Yes 
 
Simulation details 
The generic simulation framework is described below according to the agent-based 
model “PARTE” framework [30]. 
 
Density
116 
Properties: The agents modeled are M. bovis bacteria. Each individual has a haploid 
genome with length 4.3e6 base pairs which can acquire neutral, or deleterious with 
mutation rate 𝜇. Mutations in 2/3 of the genome are considered nonsynonymous, and 
nonsynonymous mutations have a gamma distribution of fitness effects with mean of -
0.3 and shape parameter of 0.2, making nonsynonymous mutations weakly deleterious. 
Synonymous mutations have a fixed fitness effect of zero (no effect). The recombination 
rate is set to zero, to simulate strictly clonal evolution.  
 
Actions: Throughout the course of simulation, individuals may reproduce, die or in the 
structured simulations, migrate to a new subpopulation.  
 
Rules: Each simulated infection begins with 10,000 generations of neutral burn-in. At 
generation 10,001, a population bottleneck occurs, which reduces the population size to 
the value 𝑏, to simulate the infection bottleneck. In the structured simulations, at the 
beginning of simulation, a predefined number of empty subpopulations are created. 
These subpopulations are the in silico representation of potential infection sites within 
the host. During the simulation, individual bacteria from the source subpopulation can 
migrate to empty subpopulations, and probability of migration at each generation is 
determined by the value of 𝑝. In the unstructured simulations, one single subpopulation 
is simulated. Within a subpopulation, fitness of individuals is proportion to population 
size, whereas if the population size reaches the carrying capacity, or the maximum 
number of individuals that can live within a host, the fitness of all individuals in the 
subpopulation decreases. The carrying capacity of the subpopulation is determined by 
 
117 
the within host population size parameter, 𝐾. In the simulations with clonal progeny 
skew, at each generation, the number of offspring generated by a single individual is 
drawn from a Poisson distribution with parameter 𝜆. If 𝜆 < 1, on average some 
individuals will produce 0 offspring, whereas others will reproduce 1 or more.  
 
Time: Time is measured in M. bovis generations, with new reproduction events 
happening at each tick. Each simulation was run for 365 generations after burn-in, 
assuming a generation time of 24 hours.  
 
Environment: Infection sites within the host were represented by a set of 
subpopulations which within SLiM is a potential space where M. bovis individuals may 
reside.  
 
Parameter values 
Each model contained parameters that remained constant between simulations, and 
parameters that were estimated. When comparing model structures, the same model 
parameters were used for each structure. Prior distributions for each parameter are 
shown in Table 5. 
 
 
 
 
 
 
118 
Table 5. Prior distributions for simulation parameters 
Parameter Definition Distribution 
𝜇 Mutation rate 𝑈𝑛𝑖𝑓~(1𝑒*N, 1𝑒*O) 
𝐵 Bottleneck size 𝑈𝑛𝑖𝑓~(10, 1𝑒L) 
𝑝 Probability of subpopulation formation 𝑈𝑛𝑖𝑓~(0, 1𝑒*L) 
𝜆 Parameter of Poisson distribution 𝑈𝑛𝑖𝑓~(0.1, 1) 
controlling offspring count 
 
Fixed model parameters are shown in Table 6, and parameter values used in model 
comparison and model fit analyses are shown in Table 7. 
 
Table 6. Fixed model parameter values.  
Parameter Definition Value 
K0 Initial population size 1000 
𝐾3& Post-bottleneck carrying capacity per 500 
subpopulation, structured model  
𝐾3$& Post-bottleneck carrying capacity, non- 1000 
structured model 
𝜌 Recombination rate 0 
 
Table 7. Parameter values used in model assessment.  
Parameter Definition Value 
𝜇& Mutation rate 2.5𝑒*N 
𝐵& Bottleneck size 100 
𝑝& Probability of subpopulation formation 0.05 
𝜆& Parameter of Poisson distribution 0.5 
controlling offspring count 
 
Model selection and parameter 
The goodness of fit of each simulation structure, and the pairwise comparative fit of all 
models was determined was determined using the R package ABC [31]. First, for each 
model structure 1000 simulations were run using parameter values in Table 7. The mean 
number of neutral mutations, mean number of deleterious mutations, and the mean 
 
119 
number of mutation differences between two samples of 10 genomes was recorded at 
each simulation. These summary statistics were compared to the mean number of 
synonymous and nonsynonymous mutations, and the mean number of differences 
between pairs of samples from the same host in the empirical data. The goodness of fit 
of each model was assessed using the gfit function in ABC. The posterior probabilities 
of each model scenario were calculated using the postpr function and the "mnlogistic” 
method in ABC, and models were compared Bayes Factors. A log10 Bayes Factor value 
greater than 1 indicated strong support for one model compared to the other. Parameter 
values were estimated for the best fitting model using Markov Chain Monte Carlo-
approximate Bayesian computation (MCMC-ABC) using the Wegmann method [32] in 
the R package EasyABC [33]. 
 
Results 
 
Phylogeny and population structure  
159 M. bovis whole genome sequences were analyzed, consisting of 62 samples from 
Glaser et al, 29 from Brazil study, and 68 from Ethiopia, which will now be referred to 
as the US, Brazil, and Ethiopian samples, respectively. The US samples were from a 
monophyletic outbreak in Minnesota in cattle and white-tailed deer (Odocoileus 
virginianus), whereas the Brazil and Ethiopia datasets consisted of multiple different 
outbreaks in different geographic regions in cattle hosts. The phylogenetic tree of all 
samples included in the study labeled by study set and RhierBAPS phylogenetic cluster 
is shown in Figure 12.  
 
120 
 
Figure 12. Phylogeny of study samples colored by geographic region and three levels 
of RhierBAPS phylogenetic clustering.   
          
 
121 
Single infections vs polyclonal infections 
10, 13, and 26 cattle from the US, Brazil, and Ethiopia datasets, respectively, had more 
than one sample from a single host. To determine whether a cow was infected by a 
single clone or multiple clones, each set of samples from a given cow was evaluated by 
calculating a correlation matrix of allele frequencies between samples, phylogenetic 
clustering and a between-sample SNP cutoff. An example correlation matrix from 
polyclonal and single infections is shown in Figure 13. The correlation matrices 
comparing all samples within a group is shown in Supplemental Figure 1. 31 samples 
were determined to be polyclonal (1 in the US dataset, 11 in the Brazil dataset and 19 
in the Ethiopia dataset). Only single source infections were used in subsequent analyses.  
 
 
 
 
 
 
 
122 
 
Figure 13. Distinguishing polyclonal from single source infections using allele 
frequencies. 
Since single source infections began from just population, allele frequencies of 
mutations derived from the infection source tend to be correlated (A). In polyclonal 
infections, allele frequencies between distinct populations within the host are less 
correlated (B).  
 
Distribution of mutation types 
The distribution of mutation types and mutational density was evaluated among de novo 
mutations that arose during single source infections. De novo mutations were defined 
as mutations present in a single individual host, but absent from all other hosts within a 
phylogenetic cluster. Sixty percent of all de novo mutations were nonsynonymous 
mutations and 73% of intragenic mutations occurred nonessential genes, which is 
consistent with random mutation in a highly gene-dense genome; however, both 
synonymous and nonsynonymous mutations also occurred in essential genes and 
 
123 
antigens. Mutations in genes known to be associated with antimicrobial resistance are 
shown in Table 8. 
 
Table 8. Within host mutations in genes associated with antimicrobial resistance 
Gene Nucleotide Amino acid Effect Essential Fixed Associated 
change change with 
resistance 
embB G1486GA Asp496Asn NS ES Yes no 
gyrB C774T Val258Val S ES Yes no 
iniA T771C Val257Val S NES Yes no 
rpoB G1723C Val575Leu NS ES No no 
alr C221T Ala74Val NS ES Yes no 
Effect: S=synonymous mutation, NS=nonsynonymous mutation; Essential: 
ES=essential gene, NES=nonessential gene 
 
Five de novo mutations occurred in genes associated with antibiotic resistance; 
however, no mutations occurred at positions known to cause antibiotic resistance. 
Therefore, these mutations may have no effect on antibiotic resistance phenotypes. 
        
Mutational density test  
The mutational density test was applied to genes individually, and to gene categories to 
test whether the distribution of de novo mutations deviates from the null distribution. 
Deviation from the null distribution suggests regions of the genome are under positive 
or negative selection.  One gene, pcaA had more mutations than expected (p=4.2e-7). 
This gene contained four different vSNPs; however, all mutations were found in a single 
genome in one host animal, J_47. If this animal was removed, no genes had mutation 
counts different from the null distribution. The distribution of de novo SNP counts 
among gene categories is shown in Table 9. 
 
124 
 
Table 9. Distribution of de novo SNPs among gene categories 
Gene category Nj Ni p value p<0.008 
Antigen, ES 3 3.7 0.21 No 
Antigen, NES 18 15.8 0.08 No 
AMR, ES 3 2.1 0.19 No 
AMR, NES 1 1.0 0.37 No 
Other, ES 50 52.5 0.05 No 
Other, NES 167 166.9 0.03 No 
ES= essential; NES = nonessential; AMR = antimicrobial resistance, Ni and Nj = 
expected and observed number of de novo mutations, respectively. 
 
No gene category was found to have mutation densities different than expected under 
the null model of random mutation over the length of the genome. 
 
Simulation results  
 
Model selection 
All four model structures fit are supported by the data (Goodness-of-fit p value > 0.5, 
Supplemental Figure 2). The relative support of model structures given the summary 
statistics of the observed de novo mutations is shown in Figure 14. 
 
 
125 
0.00 −0.72 0.71 −0.21 1.50
1.20
0.90
0.72 0.00 1.42 0.51 0.60
0.30
0.00
−0.30
−0.71 −1.42 0.00 −0.91 −0.60
−0.90
−1.20
0.21 −0.51 0.91 0.00 −1.50
1 2 3 4
Model structure B
 
Figure 14. Model comparison using log10 Bayes factors. 
Bayes factors comparing models A and B were calculated by comparing the posterior 
probability of model A to that of model B. Model 2 was more strongly supported by the 
data than any of the other models.  
 
Generally, models with clonal progeny skew had more support than those without 
(Models 2 and 4 compared to 1 and 3). Model 2 had moderately more support by the 
data than model 4, and model 2 had fewer parameters than model 4, so model 2 was 
chosen as the best fitting model. Posterior distributions for fitted parameter values of 
Model 2 are shown in Figure 15. 
  
 
 
 
 
Model structure A
4 3 2 1
126 
 
Figure 15. Model parameter estimates calculated with Approximate Bayesian 
Computation. 
Bivariate density plots are shown for pairs of parameters in A. Lighter colors indicate 
higher density. Marginal density plots for each parameter values are shown in B.   
 
127 
The mutation rate 𝜇 was estimated to be about 5e-9 mutations per site per generation, 
or about 2 mutations per genome 100 generations. The posterior distributions of the 
bottleneck population size and Poisson offspring distribution parameter, 𝜆 were more 
diffuse. Population bottleneck sizes between 250 and 750 individuals were plausible, as 
were lambda values around 0.4 and 0.8.  
 
Discussion  
 
Selection vs drift 
By studying the mutations that arise de novo within hosts during M. bovis infection, we 
found that M. bovis within host evolution appears to be driven by drift. Most de novo 
mutations were in nonsynonymous genes, and may affect protein function, however, the 
observed proportion nonsynonymous mutations was roughly equal to what would be 
expected for random mutation on a highly gene-dense genome (60% nonsynonymous). 
Although we considered M. bovis samples from three distinct geographic regions and 
two host species (white-tailed deer and cattle), the distribution of mutations in individual 
genes and in gene categories what would be expected if each segment of the genome 
were equally likely to be mutated. Even in gene categories that are potential targets for 
positive selection, like antigens and genes associated with antimicrobial resistance, we 
did not find evidence of mutational patterns associated with selection. Instead, in the 
context of within host evolution over short time scales, the distribution of mutations 
appears to be driven by drift.  
 
 
128 
The MTBC contains six human adapted lineages, and four animal adapted lineages [34]. 
The most recent common ancestor to M. bovis and all animal adapted MTBC lineages, 
was likely a human pathogen, and human to cattle transmission may have been related 
to the domestication of cattle [34]. Using molecular dating analyses, Loiseau et al. dated 
the most recent common ancestor of M. bovis to between about 1000-2000 years ago, 
which suggests M. bovis has been adapted to cattle hosts for at least a millennium [13]. 
Additionally, when comparing pairwise nucleotide diversity between human and animal 
adapted M. bovis lineages, Brites et al found that, on average, there was a similar number 
pairwise differences between strains from two distinct human linages as between 
lineages adapted to two different animal host species, suggesting that the genetic 
diversity in animal adapted strains is not well correlated with the genetic distance 
between host species [34]. From this finding, Brites et al. suggest that host factors may 
not define host tropism in animal adapted MTBC lineages, and instead infection may 
depend on contact and infection opportunity [34]. Although it is expected that animal 
adapted strains would have less nucleotide diversity than human adapted strains if the 
ancestor of the MTBC complex was human adapted, and the animal adapted strains 
went through a population bottleneck, reducing genetic diversity, the animal adapted 
strains have a surprisingly wide host range. Since M. bovis is well adapted to cattle 
hosts, it is possible that the effect of ongoing adaptation is not measurable in short time 
scales within individual hosts. Additionally, cattle are not treated for bovine tuberculosis 
using human important drugs like isoniazid and rifampin, so there is possibly no strong 
selection pressure for M. bovis to rapidly evolve during a single infection.  
 
 
129 
Apart from selection pressures driven by antimicrobial use or recent host 
changes, different conditions in the infection microenvironment may influence mutation 
rates and distributions. A recent study in human tuberculosis patients found a higher 
number of mutations in surgical granulomas compared to sputum samples and suggested 
that increased reactive oxygen species (ROS) activity within granulomas may be linked 
to increased mutation rates, with a bias towards increased transitions [22]. In our study, 
all samples were from granulomatous tissue; however, among samples with three or 
more vSNPs of fSNPs, the transition to transversion ratio (ti:tv) ranged from 0.5 to 5.1 
(Supplementary Table 1). Additionally, samples from one animal, J_47 had the largest 
number of mutations (133 fSNPs and 39 vSNPs), and a mutational bias toward 
transitions (ti:tv = 1.8). It is possible that this individual was infected for a longer period 
than most other animals but could also be possible that this individual experienced a 
higher mutation rate induced by ROS. When excluding mutations found in J_49 samples 
from the mutational density test, no genes were found to be significantly enriched in de 
novo mutations. If we were to compare M. bovis samples within and outside granulomas 
we may observe differences in mutation profiles.  
 
Simulation results 
Despite finding no clear evidence of positive selection, the best fitting models 
describing within host evolution included skewed offspring distributions. It is possible 
that negative selection creates skewed offspring distributions in the host, where 
individuals with more deleterious mutations have reduced reproductive success 
compared to individuals with relatively neutral mutations. The mutation rate was 
 
130 
estimated to be about 5e-9 mutations per site per generation or about 2 mutations per 
genome per 100 generations. This mutation rate estimate is about a factor 10 faster than 
the mutation rate estimated by Morales-Arce et al. [8], although we found the same trend 
that the diversity reducing effect of skewed offspring distributions counteracts the 
diversity generating mechanism of a relatively fast mutation rate. Morales-Arce et al 
also used a different method of modeling progeny distributions, where instead of 
drawing the number of offspring from a Poisson distribution in a SLiM reproduction 
callback in a non-Wright Fisher modeling framework, they modeled variance in 
reproductive success using offspring distribution using two population migration model 
in a Wright-Fisher framework, so differences in mutation rates estimated here and in 
Morales-Arce et al may be driven by this methodological difference. We were not able 
to find clear estimates for bottleneck size or the offspring distribution parameter 𝜆, 
however, the parameter value combinations with the highest support included relatively 
small bottleneck sizes of about 100 individuals. Additional summary statistics are likely 
needed to resolve true parameter values.  
 
Implications on the evolution of antimicrobial resistance 
Our finding that drift dominates M. bovis within host evolution in the context of well 
adapted host species is significant because it highlights the utility of M. bovis in animal 
hosts as a model for evolution of MTBC in the absence of the selective pressure of 
antimicrobial use. Nonsynonymous mutations occurred in genes associated with 
antimicrobial resistance in M. tuberculosis, however, no known resistance-causing 
mutations were identified. Mutations in three AMR-associated genes were fixed within-
 
131 
host, implying the mutation was not deleterious enough to prevent growth in the host. 
Since mutations evolve in genes associated with antimicrobial resistance at 
approximately the same rate as genes in other functional categories, it is possible that a 
combination of drift and population dynamic changes that occur during the course of 
infection provide the necessary genetic diversity in a population that selection can act 
upon, when a human tuberculosis patient is treated with antimicrobials. M. bovis animal 
infections then may provide a useful model for studying the likelihood of developing 
resistance conferring mutations and could be used to describe the basic evolutionary 
processes that influence the trajectory of resistance mutations.  
 
Implications for epidemiological analysis 
Additionally, although it is common to sequence a whole population of bacteria from a 
lesion, the distribution of allele frequencies is often ignored epidemiological studies, or 
variants with low allele frequency are ignored, meaning only fSNPs are analyzed. 
vSNPs that are at low allele frequency are often discarded from population samples in 
epidemiological analysis and phylogenies because each tip in a phylogenetic tree is 
assumed to represent one individual, not a population. We found that using SNP callers 
designed to accurately identify vSNPs, we could characterize not only the variants 
present in a single consensus genome, but also the distribution of alleles in populations 
represented in a whole genome sequence [35], which helped us distinguish single source 
infections from polyclonal infections, and were necessary to generate a more 
comprehensive mutation rate estimate. Using the flexible simulation structure SLiM 
provides, we could include both vSNPs and fSNPs into our simulation and parameter 
 
132 
inference framework. Especially for organisms like M. bovis with relatively slow 
substitution rates compared to other pathogens, analytical frameworks that allow full 
use of the information present in a whole genome sequence could provide a basis for 
developing finer-scale transmission networks or even inferring the time of coalescent 
events with increased resolution.   
 
Limitations 
As we were only interested in de novo mutations that occurred over short time scales, 
our analysis may not provide a complete description of contemporary M. bovis 
evolution. Selection may be an important factor in other contexts, such as adaptation to 
new environments such as new host species, geographic regions, or soil conditions. We 
also did not explicitly test for selection because of the low genetic diversity between 
gene sequences. A larger sample set from a broader genetic background would be more 
appropriate for selection analysis; however, the since the goal of our study was to focus 
specifically on the evolution of mutations that occur inside a host, we were restricted to 
study samples where within host genetic diversity could be assessed, that is, multiple 
samples taken from the same host. Another limitation of our study was the cross-
sectional design, which prevented studying empirical mutational dynamics over time. 
Since there are strict regulations involving culling infected animals upon discovery in 
some countries, a large scale longitudinal analysis of within host M. bovis dynamics in 
animal hosts would be very difficult to accomplish [36]. If longitudinal data became 
available, the simulation structure could be easily adapted to incorporate temporal signal 
into the parameter estimation framework.  
 
133 
Conclusions 
 
Over the relatively short time period of the length of a single infection in cattle and 
white-tailed deer hosts, M. bovis evolution appears to be driven by drift. 
Nonsynonymous mutations were common, even in essential genes; however, the 
number of mutations in different functional categories did not differ from a null model 
of random mutation along the length of the genome. The forward genetic simulation 
framework described here allowed us to test hypotheses for within host dynamics and 
demonstrated the combined effect of skewed offspring distributions and population 
bottlenecks on the observed mutations between individuals. This simulation framework 
could provide useful insights in understanding the pathogen’s epidemiology and the 
evolution of important phenotypes like antimicrobial resistance. 
 
 
 
 
 
 
 
 
 
 
 
 
 
134 
References  
 
1. Wheatley R, Diaz Caballero J, Kapel N, de Winter FHR, Jangir P et al. 
Rapid evolution and host immunity drive the rise and fall of carbapenem resistance 
during an acute Pseudomonas aeruginosa infection. Nat Commun 2021;12(1):2460. 
2. Menardo F, Gagneux S, Freund F. Multiple Merger Genealogies in Outbreaks 
of Mycobacterium tuberculosis. Molecular Biology and Evolution 2020;38(1):290-306. 
3. Irwin KK, Laurent S, Matuszewski S, Vuilleumier S, Ormond L et al. On 
the importance of skewed offspring distributions and background selection in virus 
population genetics. Heredity (Edinb) 2016;117(6):393-399. 
4. Bate R, Jensen P, Hess K, Mooney L, Milligan J. Substandard and falsified 
anti-tuberculosis drugs: a preliminary field analysis. Int J Tuberc Lung Dis 
2013;17(3):308-311. 
5. Nellums LB, Rustage K, Hargreaves S, Friedland JS. Multidrug-resistant 
tuberculosis treatment adherence in migrants: a systematic review and meta-analysis. 
BMC Med 2018;16(1):27. 
6. Kingman JFC. The coalescent. Stochastic Processes and their Applications 
1982;13(3):235-248. 
7. Eldon B, Wakeley J. Coalescent processes when the distribution of offspring 
number among individuals is highly skewed. Genetics 2006;172(4):2621-2633. 
8. Morales-Arce AY, Harris RB, Stone AC, Jensen JD. Evaluating the 
contributions of purifying selection and progeny-skew in dictating within-host 
Mycobacterium tuberculosis evolution. Evolution 2020;74(5):992-1001. 
9. Sabin S, Morales-Arce AY, Pfeifer SP, Jensen JD. The impact of frequently 
neglected model violations on bacterial recombination rate estimation: a case study in 
Mycobacterium canettii and Mycobacterium tuberculosis. G3 (Bethesda) 2022. 
10. Messer PW, Ellner SP, Hairston NG. Can Population Genetics Adapt to Rapid 
Evolution? Trends Genet 2016;32(7):408-418. 
11. Trauner A, Liu Q, Via LE, Liu X, Ruan X et al. The within-host population 
dynamics of Mycobacterium tuberculosis vary with treatment efficacy. Genome Biol 
2017;18(1):71. 
12. Menardo F, Duchene S, Brites D, Gagneux S. The molecular clock of 
Mycobacterium tuberculosis. PLoS Pathog 2019;15(9):e1008067. 
13. Loiseau C, Menardo F, Aseffa A, Hailu E, Gumi B et al.  
An African origin for Mycobacterium bovis. Evol Med Public Health 2020;2020(1):49-
59. 
14. Peck KM, Lauring AS. Complexities of Viral Mutation Rates. J Virol 
2018;92(14). 
15. Glaser L, Carstensen M, Shaw S, Robbe-Austerman S, Wunschmann A et 
al. Descriptive Epidemiology and Whole Genome Sequencing Analysis for an Outbreak 
of Bovine Tuberculosis in Beef Cattle and White-Tailed Deer in Northwestern 
Minnesota. PLoS One 2016;11(1):e0145735. 
16. Rodrigues RA, Ribeiro Araújo F, Rivera Dávila AM, Etges RN, Parkhill J 
et al. Genomic and temporal analyses of Mycobacterium bovis in southern Brazil. 
Microb Genom 2021;7(5). 
 
135 
17. Almaw G, Mekonnen GA, Mihret A, Aseffa A, Taye H et al. Population 
structure and transmission of Mycobacterium bovis in Ethiopia. Microb Genom 
2021;7(5). 
18. Wilm A, Aw PP, Bertrand D, Yeo GH, Ong SH et al. LoFreq: a sequence-
quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity 
from high-throughput sequencing datasets. Nucleic Acids Res 2012;40(22):11189-
11201. 
19. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD et al. VarScan 2: 
somatic mutation and copy number alteration discovery in cancer by exome sequencing. 
Genome Res 2012;22(3):568-576. 
20. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD et al. 
IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the 
Genomic Era. Mol Biol Evol 2020;37(5):1530-1534. 
21. Tonkin-Hill G, Lees JA, Bentley SD, Frost SDW, Corander J. RhierBAPS: 
An R implementation of the population clustering algorithm hierBAPS. Wellcome Open 
Res 2018;3:93. 
22. Moreno-Molina M, Shubladze N, Khurtsilava I, Avaliani Z, Bablishvili N 
et al. Genomic analyses of Mycobacterium tuberculosis from human lung resections 
reveal a high frequency of polyclonal infections. Nat Commun 2021;12(1):2716. 
23. Wei T, Simko V. R package 'corrplot': Visualization of a Correlation Matrix.  
2021. 
24. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T et al. A program for 
annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs 
in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 
2012;6(2):80-92. 
25. DeJesus MA, Gerrick ER, Xu W, Park SW, Long JE et al. Comprehensive 
Essentiality Analysis of the Mycobacterium tuberculosis Genome via Saturating 
Transposon Mutagenesis. mBio 2017;8(1). 
26. Vargas R, Freschi L, Marin M, Epperson LE, Smith M et al. In-host 
population dynamics of. Elife 2021;10. 
27. Farrell D, Crispell J, Gordon SV. Updated functional annotation of the 
Mycobacterium bovis AF2122/97 reference genome. Access Microbiol 
2020;2(7):acmi000129. 
28. Phelan JE, O'Sullivan DM, Machado D, Ramos J, Oppong YEA et al. 
Integrating informatics tools and portable sequencing technology for rapid detection of 
resistance to anti-tuberculous drugs. Genome Med 2019;11(1):41. 
29. Haller BC, Messer PW. SLiM 3: Forward Genetic Simulations Beyond the 
Wright-Fisher Model. Mol Biol Evol 2019;36(3):632-637. 
30. Regulation CotAoA-BMtITP, Practice BoPHaPH, Medicine Io. Assessing 
the Use of Agent-Based Models for Tobacco Regulation.   2015. 
31. Csilléry K, François O, Blum MGB. abc: an R package for approximate 
Bayesian computation (ABC). Methods in ecology and evolution 2012;3(3):475-479. 
32. Wegmann D, Leuenberger C, Excoffier L. Efficient approximate Bayesian 
computation coupled with Markov chain Monte Carlo without likelihood. Genetics 
2009;182(4):1207-1218. 
 
136 
33. Jabot F, Faure T, Dumoulin N. Easy ABC: performing efficient approximate 
B ayesian computation sampling schemes using R. Methods in Ecology and Evolution 
2013;4(7):684-687. 
34. Brites D, Loiseau C, Menardo F, Borrell S, Boniotti MB et al. A New 
Phylogenetic Framework for the Animal-Adapted Mycobacterium tuberculosis 
Complex. Front Microbiol 2018;9:2820. 
35. Goossens SN, Heupink TH, De Vos E, Dippenaar A, De Vos M et al. 
Detection of minor variants in Mycobacterium tuberculosis whole genome sequencing 
data. Brief Bioinform 2022;23(1). 
36. Orloski K, Robbe-Austerman S, Stuber T, Hench B, Schoenbaum M. Whole 
Genome Sequencing of Mycobacterium bovis Isolated From Livestock in the United 
States, 1989-2018. Front Vet Sci 2018;5:253. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137 
Supplementary materials  
 
United States Brazil
1 A_1_156 1
B_2_0 A_1_156
B_2_0 A_8_162 0.8
C_1_44 0.8 A_8_162C_14_166
C_1_44 C_14_166
0.6
Deer_14_3 C_16_362
0.6
C_16_362
Deer_14_3
0.4 C_18_119 0.4
Deer_2_0 C_18_119
Deer_2_0 C_19_455
0.2 C_19_455 0.2
Deer_24_5 C_19_455
Deer_24_5 C_20_434
0 C_20_434 0
Deer_25_4 C_21_327
Deer_25_4 C_21_327
−0.2 C_21_327 −0.2
Deer_26_1 E_30_254
Deer_26_1 E_30_254
−0.4 F_33_261 −0.4Deer_8_0 F_33_261
Deer_8_0 F_35_265
−0.6
E_1_2 F_35_265 −0.6F_35_265
E_1_2
−0.8 I_43_10
J_1_1 I_43_10 −0.8
J_1_1 J_47_3
−1 J_47_3 −1
Ethiopia
AA_1C0001_0 1 1
AA_1C0001_0 SA_2C0007_0
AA_1C0003_1 SA_2C0007_0
AA_1C0003_1 0.8 SA_2C0088_1
AA_1C0003_1 SA_2C0088_1 0.8
B_agr−056_569 SA_2C0217_569
B_agr−056_569 SA_2C0217_569
E_tse−036_2 0.6 SA_2C0217_569 0.6
E_tse−036_2 SA_2C0229_314
E_tse−036_2 SA_2C0229_314
H_pri−063_82 0.4 SA_2C0305_2
H_pri−063_82 SA_2C0305_2 0.4
I_sad−005_282 W_8A0035_347
I_sad−005_282 0.2 W_8A0035_347
I_sad−005_282 X_DAERO_50 0.2
M_ser−002_531 X_DAERO_50
M_ser−002_531 Y_4577_102
M_ser−002_531 0 Y_4577_102 0
M_ser−007_417 Y_4577_102
M_ser−007_417 Y_4589_95
N_ALB720_43 −0.2 Y_4589_95
N_ALB720_43 Y_4589_95 −0.2
N_ALB720_43 Y_4589_95
N_ALB720_43 Y_4594_441
R_211−178_291 −0.4 Y_4594_441 −0.4
R_211−178_291 Y_4594_441
R_25/158_353 Y_4630_194
R_25/158_353 −0.6 Y_4630_194Y_4630_194 −0.6
R_26014_27 Y_6013_5
R_26014_27 Y_6013_5
SA_2C0005_5 −0.8 Z_8A0574_42
SA_2C0005_5 −0.8Z_8A0574_42
SA_2C0007_0 Z_8A0574_42
SA_2C0007_0 −1 Z_8A0574_42 −1  
Supplemental Figure 1. The correlation of allele frequencies of samples within each 
geographic group is shown. Ethiopian samples were broken up into two matrices for 
ease of visualization. Samples within each geographic group are labeled by animal id 
and number of pairwise fSNP differences between all samples taken from one animal. 
For example, label “B_2_0” indicates animal ID B_2 and zero fSNPs. Sample IDs are 
colored in black if all samples were located in the same phylogenetic cluster, and red if 
samples were located in more than one cluster. Polyclonal infections (labeled in red) 
had lower allele frequency correlations compared to single source infection. 
 
 
 
AA_1C0001_0 B_2_0
AA_1C0001_0
AA_1C0003_1 B_2_0
AA_1C0003_1 C_1_44
AA_1C0003_1
B_agr−056_569 C_1_44
B_agr−056_569
E_tse−036_2 Deer_14_3
E_tse−036_2 Deer_14_3
E_tse−036_2
H_pri−063_82 Deer_2_0
H_pri−063_82
I_sad−005_282 Deer_2_0
I_sad−005_282 Deer_24_5
I_sad−005_282
M_ser−002_531 Deer_24_5
M_ser−002_531
M_ser−002_531 Deer_25_4
M_ser−007_417 Deer_25_4
M_ser−007_417
N_ALB720_43 Deer_26_1
N_ALB720_43
N_ALB720_43 Deer_26_1
N_ALB720_43
R_211−178_291 Deer_8_0
R_211−178_291 Deer_8_0
R_25/158_353
R_25/158_353 E_1_2
R_26014_27
R_26014_27 E_1_2
SA_2C0005_5 J_1_1
SA_2C0005_5
SA_2C0007_0 J_1_1
SA_2C0007_0
A_1_156
SA_2C0007_0 A_1_156
SA_2C0007_0
SA_2C0088_1 A_8_162
SA_2C0088_1 A_8_162
SA_2C0217_569 C_14_166
SA_2C0217_569 C_14_166
SA_2C0217_569 C_16_362
SA_2C0229_314
SA_2C0229_314 C_16_362
SA_2C0305_2 C_18_119
SA_2C0305_2 C_18_119
W_8A0035_347 C_19_455
W_8A0035_347 C_19_455
X_DAERO_50
X_DAERO_50 C_19_455
Y_4577_102 C_20_434
Y_4577_102 C_20_434
Y_4577_102 C_21_327
Y_4589_95
Y_4589_95 C_21_327
Y_4589_95 C_21_327
Y_4589_95 E_30_254
Y_4594_441 E_30_254
Y_4594_441 F_33_261
Y_4594_441
Y_4630_194 F_33_261
Y_4630_194 F_35_265
Y_4630_194 F_35_265
Y_6013_5 F_35_265
Y_6013_5 I_43_10
Z_8A0574_42
Z_8A0574_42 I_43_10
Z_8A0574_42 J_47_3
Z_8A0574_42 J_47_3
138 
Model 1 p=0.185 Model 2 p=0.107
Observed distance Observed distance
2 4 6 8 10 2 4 6 8 10
Distance Distance
Model 3 p=0.086 Model 4 p=0.112
Observed distance Observed distance
2 4 6 8 2 4 6 8
Distance Distance
 
Supplemental Figure 2. The null distributions for the test statistic for the goodness-of-
fit assuming each model structure. The test statistic used was the mean of the vector of 
the distance between the observed and expected number of synonymous mutations,  
nonsynonymous mutations, and number of differences between two samples. 
 
 
 
 
 
Density Density
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8
Density Density
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0
139 
Supplementary Table 1. Number of transitions, transversions, fSNPs and vSNPs found 
in the collection of samples from each animal.  
ID Transitions Transversion Ti:Tv fSNPs vSNPs 
B_1 0 2 0 0 2 
Deer_12 0 1 0 0 1 
Deer_5 0 1 0 0 1 
Deer_9 0 1 0 1 0 
E_1 0 2 0 0 2 
E_2 0 1 0 0 1 
E_3 0 2 0 2 0 
L_10 0 1 0 0 1 
B_3 1 2 0.5 0 3 
K_1 1 2 0.5 1 2 
Deer_22 2 2 1 0 4 
Deer_24 2 2 1 3 1 
SA_2C0005 4 4 1 5 3 
Y_6013 5 4 1.25 8 1 
B_2 4 3 1.33 0 7 
Deer_10 7 5 1.4 1 11 
Deer_18 3 2 1.5 0 5 
Deer_7 3 2 1.5 0 5 
AA_1C0003 11 6 1.83 9 8 
J_47 110 60 1.83 131 39 
Deer_16 2 1 2 0 3 
G_1 2 1 2 0 3 
SA_2C0088 11 5 2.2 9 7 
R_26014 8 3 2.67 3 8 
AA_1C0001 26 9 2.89 34 1 
Deer_17 3 1 3 0 4 
Deer_26 3 1 3 0 4 
Deer_8 3 1 3 0 4 
SA_2C0007 99 31 3.19 115 15 
Deer_19 4 1 4 2 3 
Deer_23 4 1 4 0 5 
L_2 5 1 5 0 6 
I_43 61 12 5.08 71 2 
Deer_14 1 0  1 0 
Deer_21 1 0  1 0 
Deer_25 1 0  1 0 
Deer_27 1 0  0 1 
Deer_6 1 0  0 1 
 
140 
F_1 1 0  0 1 
H_2 1 0  0 1 
J_2 1 0  0 1 
L_11 1 0  0 1 
L_6 1 0  0 1 
L_7 1 0  1 0 
L_8 1 0  1 0 
L_9 1 0  1 0 
Deer_20 2 0  0 2 
D_1 3 0  0 3 
Deer_15 5 0  1 4 
Deer_2 5 0  4 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141 
 
 
 
 
 
 
 
CHAPTER 5 
The evolution of antimicrobial resistance in the Mycobacteirum tuberculosis complex 
in the absence of antimicrobial use  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142 
Abstract  
 
Antimicrobial resistant Mycobacterium tuberculosis is a growing threat to global health. 
Antimicrobial treatment selects for resistant mutations to reach high population 
frequencies; however, despite fitness costs associated with protein structure altering 
drug resistance mutations, resistance has also been found in the Mycobacterium bovis 
in animal hosts in the absence of tuberculosis-specific drug treatment. We hypothesized 
that both small effective population sizes during transmission events and the 
coevolution of neutral or low-cost structural enhancing compensatory mutations can 
lead to resistance evolution in the absence of positive selection by antimicrobial 
treatment. Using a global sample of over 3000 M. bovis whole genome sequences, we 
find that genotypic resistance is rare, but resistance to first-line tuberculosis drugs 
occurs in sequences from both human and animal hosts. Using forward genetic 
simulation incorporating predicted structural effects of specific resistance-causing 
mutations, we show that resistance mutations can evolve through mutation and drift and 
can become fixed through population bottlenecks. Our work shows that the epistatic 
interaction of structural stabilizing mutations and resistance-causing destabilizing 
mutations, and the demographic effect of population bottlenecks on allele frequency 
distributions can describe a mechanism for evolution of antimicrobial resistance without 
antimicrobial use. Resistance evolution in M. bovis in the absence of antimicrobial use 
is both theoretically and empirically rare but is possible, and the mechanism described 
here may demonstrate the fundamental processes governing resistance evolution before 
selection in the Mycobacterium tuberculosis complex. 
 
 
143 
Introduction 
 
Antimicrobial resistant tuberculosis is a global health emergency. In 2018 ~1.2 million 
deaths were attributed to tuberculosis, and 10 million new tuberculosis cases were 
reported globally, including over 450,000 multidrug resistant infections [1]. 
Antimicrobial treatment saved 45 million lives between 2000 and 2017 [1], but without 
access to adequate treatment or drugs capable of treating resistant infections, 
tuberculosis can be fatal. Among people with multi drug resistant infections, treatment 
is successful in only 56% of patients [1]. Understanding how antimicrobial resistance 
(AMR) evolves in the Mycobacterium tuberculosis complex (MTBC), the cause of 
tuberculosis in humans and animals, could lead to improved drugs and drug treatment 
plans, and could aid in developing policies that ensure sustainable antimicrobial use.  
The MTBC is monophyletic and includes two main pathogenic ecotypes: 
Mycobacterium bovis the main cause of tuberculosis in cattle and other ungulates, and 
Mycobacterium tuberculosis, the main cause of tuberculosis in humans. M. bovis and 
M. tuberculosis have greater than 99% sequence identity in alignable regions and share 
identical sequences for many genes associated with resistance [2]. Additionally the 
pathology and etiology of bovine tuberculosis in animals is very similar to that of 
tuberculosis in humans, from transmission route, to infection site, and granuloma 
composition [3]; therefore, within-host population dynamics of M. bovis and M. 
tuberculosis are likely to be similar. Although M. bovis and M. tuberculosis are 
genetically similar and susceptible to many of the same drugs, first line antibiotics used 
to treat human tuberculosis infections: isoniazid, rifampin, ethambutol and 
pyrazinamide, are not used in animals. Instead, animals that test positive for M. bovis 
 
144 
are culled and removed from the food chain to prevent zoonotic infection. Since animals 
are not treated with tuberculosis antimicrobials, population dynamics of tuberculosis 
infection without the selection pressure of antimicrobial use can be studied in this 
system. Antimicrobial resistant M. bovis in animal hosts [4-6], but the mechanisms for 
resistance evolution in animal hosts remains unclear. 
AMR tuberculosis results from an individual being infected with an already 
resistant strain (primary infection), or from de novo evolution within the host (acquired 
resistance). Acquired resistance is strongly linked to antimicrobial use; however, 
resistant M. bovis isolates have been found in cattle who have not been treated with 
tuberculosis antimicrobials, so there are likely evolutionary forces other than selection 
by antimicrobial use that results in resistance (2, 3). Unlike many other pathogenic 
bacteria, the MTBC cannot acquire genes through horizontal gene transfer, so resistance 
acquired through single nucleotide polymorphisms (SNPs), or insertions or deletions 
[2]. SNPs that lead to non-synonymous changes are usually deleterious; however, some 
mutations can be low-cost or cost-free [7] and others could co-evolve with 
compensatory mutations that restore fitness lost by the resistance mutations. For 
example, mutations in rpoB, the gene that encodes for the b-subunit of the RNA 
polymerase, specifically those within an 81 base pair rifampin resistance determining 
region (RRDR) lead to rifampin resistance [8]. Although mutations in rpoB cause a 
decrease in fitness associated with decreased RNA polymerase function, secondary 
mutations in rpoA/B/C can reverse the deleterious effect of rifampin resistance, negating 
the fitness cost of resistance [7]. The epistatic interaction of the resistance conferring 
 
145 
rpoB mutation and additional compensatory mutations produces a higher fitness than 
the wild type under rifampin exposure. 
Since AMR mutations occur in cattle hosts where they are likely deleterious in 
the absence of the selection pressure caused by antimicrobial use, we suspect the 
evolution of AMR mutations is also influenced demographic factors, particularly, serial 
population bottlenecks created by transmission. Transmission bottlenecks decrease 
effective population size each time a new host is infected, which could create the 
opportunity for weakly deleterious alleles to expand within a host by drift. If the 
mutations are not deleterious enough to prevent transmission, they could propagate in a 
susceptible host population.  
Our objective is to study evolutionary mechanisms leading to drug resistance 
evolution in M. bovis in cattle. Our central hypothesis is that if drug resistance mutations 
exist in M. bovis isolated from cattle not treated for bovine tuberculosis, then 
mechanisms of drug resistance evolution other than selection by drug use exist. Using 
forward genetic simulation we investigate the impact of transmission bottlenecks on the 
evolution of resistant mutations, incorporating specific amino acid mutation effects 
predicted by Portelli et al.[9]. We compare the simulated results to the empirical 
distribution of genotypic resistance mutations in over 3000 M. bovis whole genome 
sequences to attempt to understand drivers of AMR in an MTBC population that has not 
experienced selection pressure through AMU. Our unique comparative approach can be 
used to better understand the evolutionary mechanisms that drive resistance evolution 
and may be incorporated into treatment plan design to slow the emergence of drug 
resistant MTBC.   
 
146 
Methods 
 
M. bovis sample selection and processing 
We used the collection of M. bovis whole genome sequences described in 
Supplementary Table 1 of Chapter 3 to study the empirical distribution of resistance 
genotypes. This sample contains 3438 genomes and is representative of all known M. 
bovis lineages. Raw fastq files were downloaded from the NCBI Sequence Read 
Archive. SNPs were called using the vSNP pipeline (https://github.com/USDA-VS) 
using the M. tuberculosis H37Rv reference, and SNPs were removed if they had a read 
depth less than 10, an allele frequency less than 0.9, and a mapping quality less than 30 
using vcffilter in vcflib v1.0.2 [10]. Lineages were assigned using SNP markers 
described in Zwyer et al. [11] and genotypic resistance was predicted using TB-Profiler 
[12] run in parallel using GNU Parallel [13]. TB-Profiler predicts genotypic resistance 
to four first-line drugs: isoniazid, rifampicin, ethambutol and pyrazinamide, in addition 
to second-line drugs: aminoglycosides (kanamycin, streptomycin, amikacin, 
capreomycin), fluoroquinolones (ciprofloxacin, levofloxacin, ofloxacin, moxifloxacin), 
ethionamide, delamanid, linezolid, clofazimine, bedaquiline, para aminosalicylic acid 
(PAS), and cycloserine [12, 14]. Pyrazinamide is not considered a first-line drug for 
human M. bovis infections, because most M. bovis lineages are resistant [15]. To remove 
low-coverage or otherwise poor SNP calls associated with resistance prediction, TB-
Profiler resistance predictions were filtered if they were not present in the corresponding 
filtered vcf files.  
 In addition to fixed SNPs present in greater than 90% of reads, we were also 
interested in studying variable SNP (vSNP) dynamics within genes associated with 
 
147 
resistance. vSNPs were called using lofreq [18], and variants were filtered such that 
positions had least 10 reads supporting them, a coverage of at least 50x, a minimum 
Phred score of 20, and a minimum allele frequency of 0.5 %. Variants were labeled as 
vSNPs if the allele frequency was between 5% and 90% and fixed SNPs (fSNPs) if 
allele frequency was greater than or equal to 90%.  
 
Phylogenetic analysis  
Individual vcf files were merged using bcftools [16], and the merged vcf file was 
translated into a SNP alignment using custom scripts in RStudio 2021.09.0 [17] using 
R v 4.1.2 [18]. To better visualize the position of drug resistant samples in a phylogeny, 
susceptible samples were randomly down sampled. Within each lineage the same 
number of pan-susceptible or pyrazinamide resistant genomes and samples resistant to 
pyrazinamide and at least one other drug were selected. A phylogenetic tree was created 
from the SNP alignment using IQ-Tree 2 using a GTR+G substitution model [19]. Trees 
were visualized using iTOL [20]. 
 
Genome-wide epistasis analysis  
We conducted a genome-wide epistasis analysis using SpydrPick to detect signatures of 
epistatic interactions between resistance-conferring mutations and potential 
compensatory mutations [21]. The merged vcf file was fed into SpydrPick, and the 
resulting SNP-SNP relationships were filtered to include only mutations in genes with 
known resistance mutations detected by TB-Profiler, or potential target genes for 
compensatory mutations. The list of compensatory mutations was curated using known 
 
148 
mutations reviewed in Alame Emane et al. [22]. in addition to genes found in the same 
biological pathway as genes with known resistance mutations. Biological pathways 
were predicted using WikiPathways [23]. The candidate set of epistatic interactions was 
filtered to only contain only drug resistance mutations and non-synonymous mutations 
in compensatory genes.  
 
Simulation structure 
We created a nucleotide-based non-Wright Fisher model to simulate the evolution of 
katG during a single infection using SLiM 3 v 3.7.1 [24]. Mutations in katG cause 
isoniazid resistance, and katG was chosen because the in vitro mutational target size for 
isoniazid resistance is larger than that of other first-line drugs [25, 26]. The simulation 
was initialized using the M. bovis AF2122.97 reference nucleotide sequence for katG, 
which is identical to the M. tuberculosis H37Rv katG sequence. At initialization, the 
nucleotide sequence is converted to amino acids in SLiM, and the initial codon sequence 
is recorded. As in chapter 4, simulations consisted of three stages. The first stage 
consisted of 1000 generations of neutral burn-in. The second stage consisted of a 
population bottleneck to simulate a transmission event, and the third stage consisted of 
a population expansion to a constant within-host size for 365 generations. 10,000 
replicates of each simulation structure were run. The effective population size is 
governed by a density dependent function that decreases fitness if the population size 
exceeds carrying capacity. At each generation, every individual has the potential to 
reproduce. The number of offspring generated by a single individual is drawn from a 
Poisson distribution with parameter 𝜆. The mutation rate, bottleneck size, and carrying 
 
149 
capacity parameter were rescaled by a factor of 10 to improve computational efficiency, 
but still approximate the same within-host evolutionary trends as described in chapter 
4. A diagram of the simulation structure is shown in Figure 16. 
 
A
A C
B C
B C
B C
A
B C
C
C C
C
C C
Haplotypes in bottleneck
B C2
D C3
B C
Haplotypes pre-bottleneck Haplotypes post-bottleneck
B C
1.0 C
1
Deleterious
Deleterious + 
Beneficial 
0.5
C
0
Weak Moderate Strong A
Destabilizing effect of mutation A0
Pre-bottleneck Post-bottleneck
 
Figure 16. Simulation structure and graphical hypothesis. 
A. Example haplotypes before, during, and after an infection bottleneck. By chance, low 
frequency haplotypes are sampled in the bottleneck. B. Hypothetical fitness effects of 
individuals with different types of mutations is shown relative to the fitness of wildtype 
individuals. Structurally stabilizing mutations can reverse some of the deleterious effect 
of destabilizing mutations. A,C. One haplotype fixes in the post-bottleneck population, 
whereas the other haplotype is lost.  
 
Relative fitness of haplotype
Allele frequency
150 
 
Mutation effect prediction 
Recently, Portelli et al. predicted and summarized the fitness effect of mutations in three 
genes associated with resistance to isoniazid, rifampicin and D-cycloserine using a 
series protein structural analysis [9]. Mutations were assigned a selection coefficient 
proportional to their predicted effect listed in Supplementary Table 2 of Portelli et al. to 
simulate the effect of specific katG amino acid mutations on bacterial fitness. At each 
iteration, the number the combined effect of all mutations on the genome was calculated 
as the additive effect of the number of nonsynonymous mutations n multiplied by the 
effect e of each mutation. Most mutations are deleterious, but some have stabilizing 
effects when they occur on the same genome as deleterious mutations. Therefore, we 
model both deleterious mutations and beneficial mutations. We used a custom fitness 
callback function to alter the fitness depending on the number and type of mutations: 
𝑠 = 	1 − (𝑛>@ ∗ 𝑒>@ + 𝑛P@ ∗ 𝑒>@ + 𝑛Q@ ∗ 𝑒Q@ + 𝑛=4 ∗ 𝑒=4)
+ (𝑛>J ∗ 𝑒>J + 𝑛PJ ∗ 𝑒>J + 𝑛QJ ∗ 𝑒QJ)	 
Since stabilizing mutations only have an effect in the presence of destabilizing 
mutations, we fixed the maximum value for s to be 1. Model parameters are shown in 
Table 10. Note, the number of mutations in each effect category is given the same 
subscript as the effect category. Since we did not know the actual value of mutations on 
fitness in the absence of antimicrobials (MIC can be used to indicate fitness in the 
presence of antimicrobials), we tested two fitness rescaling parameters, which modified 
the effect of each mutation.  
 
 
151 
Table 10. Simulation model parameters. 
Parameter Description Value 
B Infection bottleneck size 100 (10 rescaled) 
K Carrying capacity 1000 (100 
rescaled) 
𝜇 Mutation rate 5e-8 (5e-7 
rescaled) 
𝜌 Recombination rate 0 
𝜆 Parameter of Poisson distribution 0.3 
controlling offspring count 
𝑟@ Scaling parameter for deleterious [0.01, 0.001] 
mutations 
𝑟J Scaling parameter for beneficial mutations [0.01, 0.001] 
𝑒>@ Effect of weakly deleterious mutation 1 ∗ 𝑟@ 
𝑒>J Effect of weakly beneficial mutation 1 ∗ 𝑟J 
𝑒P@ Effect of moderately deleterious mutation 5 ∗ 𝑟@ 
𝑒PJ Effect of moderately beneficial mutation 5 ∗ 𝑟J 
𝑒Q@ Effect of highly deleterious mutation 10 ∗ 𝑟@ 
𝑒QJ Effect of highly beneficial mutation 10 ∗ 𝑟J 
𝑒=4 Effect of otherwise not categorized 1 ∗ 𝑟@ 
nonsynonymous mutation 
B, K, and 𝜇	were rescaled to improve computational efficiency. 
 
Analysis of simulation results 
The final amino acid sequence for each simulation was recorded and compared to the 
reference sequence. Nonsynonymous mutations were compared to the katG amino acid 
changing mutations listed in the database of known resistance causing mutations utilized 
by TB-profiler [12]. The proportion of resistance mutations among 10,000 iterations of 
different simulation structures was compared using Pearson’s chi squared tests using 
the chisq.test function in R with simulated p-values.   
 
 
 
152 
Results  
 
Genotypic resistance was predicted for 3729 M. bovis whole genome sequences. The 
presence or absence of resistant genotypes is shown with a phylogeny of a reduced 
subset of samples in Figure 17. All samples resistant to pyrazinamide and at least one 
other drug and a subset of pan-susceptible and pyrazinamide monoresistant samples are 
shown in the phylogeny in Figure 17. The prevalence of drug resistance to any drug 
besides pyrazinamide was 4.6 % and was and was zero for five drugs (delaminid, 
clofazimine, bedaquiline, PAS, and cycloserine). Lineage La1.1 was pyrazinamide 
susceptible, and all other lineages were resistant. Although prevalence for all drugs 
tested was low, we detected resistance to all first line drugs except ethambutol (0.64 % 
isoniazid resistance, 0.15 % and rifampicin resistance). Correlations in drug resistance 
genotypes is shown in Supplemental Figure 1.  
 
153 
 
Figure 17. Phylogeny of TB-Profiler predicted resistant samples and closely related 
samples. 
The phylogeny was created using a GTR+G substitution model. M. bovis lineages, host 
species, and drug resistance genotypic predictions are labeled. Tree scale units are in 
substitutions per site.  
 
154 
 
The prevalence of resistance to each drug among human, cattle and deer samples is 
shown in Figure 18. As expected, the resistance prevalence for first-line drugs was 
higher in humans than animals, and the difference in prevalence between human and 
animal host groups is likely directly related to the selection pressure of antimicrobial 
treatment in human hosts. Among second-line drugs, the prevalence of each of the 
fluoroquinolones was highly correlated, likely because of the shared mechanism of 
action of all fluroquinolones, whereas the prevalence of each aminoglycoside was less 
correlated. Among aminoglycosides, the prevalence of streptomycin was high overall 
(1.3 % overall prevalence) and was especially high in cattle (1.6 % in cattle vs 0.88% 
in human).  
 
 
 
 
155 
 
Figure 18. TB-Profiler predicted resistance genotypes for M. bovis samples from 
human, cattle, and deer hosts.  
Human samples had the highest prevalence of resistance for all drugs except 
streptomycin. Drug resistance was generally rare (less than 1% per host species), but 
resistance to first-line tuberculosis drugs was found in both human and animal hosts.   
 
Epistatic associations between resistance causing mutations and potential compensatory 
mechanisms were predicted using SpydrPick. Correlation matrices of mutations located 
in resistance-causing genes or potential compensatory genes are shown in Figure 19. 
 
 
 
 
156 
 
Figure 19. Correlation between resistance-causing mutations and potential 
compensatory mutations.  
Drug resistance mutations are colored in red. Mutations in genes related to isoniazid and 
streptomycin resistance were uncorrelated, whereas some mutations in genes related to 
rifampicin and fluoroquinolone resistance were correlated with other mutations in the 
same drug category. Mutations in rpoB and rpoC were correlated with rpoB mutations 
associated with rifampicin resistance.   
 
157 
Mutations in RNA polymerase genes were correlated with rifampicin resistance-
causing mutations. The distribution of SNPs correlated with rifampicin resistance is 
shown in Figure 20.  
Figure 20. Distribution of mutations associated with rifampicin 
resistant, and other correlated mutations in RNA polymerase 
subunit genes rpoB and rpoC. Clades without correlated rpoB/C 
mutations were collapsed for clarity. 
 
 
158 
In addition to fSNPs, we analyzed vSNPs in RNA polymerase subunit genes to better 
understand the dynamic evolution of rifampicin resistance. Variable and fixed mutations 
in rpoB, rpoC, and rpoA are shown in Figure 21. 
 
 
 
 
 
 
 
 
Figure 21. Variable and fixed mutations in RNA polymerase subunit genes plotted 
along a radial depiction of the M. bovis phylogeny shown in Figure 17.  
Genomes with mutations associated with resistance are labeled in blue, and genomes 
with fixed or variable SNPs in rpoB, rpoC, and rpoA are shown in yellow, green, and 
teal respectively.  
 
Although only five M. bovis samples, all from human hosts, had fSNPs associated with 
rifampicin resistance (Figure 17, Figure 20), an additional ten samples had vSNPs 
associated with rifampicin resistance including eight samples from cattle hosts, and one 
sample from a deer host. All genomes with vSNPS associated with resistance also had 
potentially compensatory vSNPs or fSNPs in rpoA and rpoC. Allele frequencies of a 
known compensatory mutation, Val431Met in RpoC, and linked resistance mutations 
are shown in Figure 22. 
 
159 
0.639175
0.550898
0.470588
0.408805
0.360544
0.313253
0.260274
0.251142
0.239631
0.214876
0.210526
0.202532 RifR
0.200000 R
0.168950
0.166667 S
0.163934
0.144654
0.131707
0.114754
0.114583
0.095652
0.094488
0.090909
0.089286
0.062827
0.033163
Va
l er hr la et
804 39
S 8T 1A M
e 4 r42 r4
4 l43
1
p.I
l
_ p.P
ro
p.S
e
p.S
e
p.V
a
oB B_ B_rp o o oB
_ C_
rp rp rp rpo
gene_mut  
Figure 22. Allele frequencies of variable SNPs associated with resistance and linked 
rpoC compensatory mutations. 
Solid lines connect mutations found on the same genome. Allele frequencies of 
mutations associated with resistance are colored in red, and the allele frequencies of the 
putative compensatory mutation is colored in blue.  
 
 
 
AF
160 
In genomes with both the RpoC Val431Met compensatory mutation and rifampicin-
resistance associated mutations in RpoB, the compensatory mutation was usually at 
higher allele frequency than the resistance-associated mutation (5/6 genomes). Since 
there are low recombination rates in M. bovis it is likely that the compensatory mutation 
evolved before the resistance-associated mutation. It is possible that the compensatory 
mutation decreased the fitness cost for resistance mutations to evolve.  
 
Simulation results  
We developed codon-explicit forward genetic simulations were to simulate the effect of 
population bottlenecks and linked structurally stabilizing mutations on the evolution of 
resistance mutations in katG. Two different mutation effect scaling factors were used 
(0.01 or 0.001) and two demographic structures (population bottleneck or constant 
population size) were tested. The prevalence of resistance mutations in each simulation 
structure is shown in Figure 23.  
 
 
161 
 
Figure 23. Prevalence of resistance-causing mutations katG that emerged during 
forward genetic simulations. 
More resistance mutations evolved in scenarios with population bottlenecks, and more 
resistance evolved in scenarios with a smaller mutation scaling effect.  
We compared the proportion of resistance mutations that evolved between simulations 
with the same scaling factor and between simulations with and without population 
bottlenecks. Population bottlenecks caused an extra 0.3 percent of resistance mutations 
in simulations with a scaling factor of 0.01 and an extra 0.4 percent of resistance 
mutations in simulations with a scaling factor of 0.001. 
 
Discussion 
 
All knowledge gained on M. tuberculosis genome dynamics since whole genome 
sequencing became available was discovered either using clinical isolates that have been 
exposed to 50 years of antimicrobial therapy, or laboratory strains. By studying the 
evolution of antimicrobial resistance in M. bovis sequences using a combination of 
 
162 
genotypic resistance prediction and forward genetic simulation we show the resistance 
in an MTBC population that has not experienced selection pressure through AMU. We 
found that genotypic resistance is rare among a global dataset of M. bovis sample, but 
generally more prevalent in samples from human hosts than from samples from animal 
hosts. Although prevalence of resistance was higher in human hosts, resistance to first-
line tuberculosis drugs was found in animal hosts including wildlife.  
The higher prevalence of resistance in human hosts was expected because 
tuberculosis treatment selects for drug resistance to evolve. Since animals are not treated 
for bovine tuberculosis with first-line drugs, resistance in genotypes could have either 
been transmitted from human to animal hosts or have evolved de novo. Human to animal 
MTBC transmission is well documented [27-30], but to date, only one study has 
documented animal to human transmission [31]. Human M. bovis infections are usually 
caused by zoonotic transmission or by eating contaminated food sources; however, 
person to person transmission has been documented in immunocompromised people but 
is very rare [32]. Since human to animal transmission is likely very rare, at least some 
of the resistance to first-line tuberculosis drugs here likely evolved de novo. We also 
only evaluated genotypic resistance predicted by TB-Profiler, which has high sensitivity 
and specificity for most drugs (see Table 2 in Pelhan et al. [12]); however, since we do 
not have phenotypic confirmation for most isolates, we cannot rule out that resistance 
predictions are not altered by contamination.   
Probability of resistance evolution is related to mutational target size [25]. 
Because of the larger mutational target size, isoniazid resistance often evolves before 
rifampicin resistance. Since more mutations can cause isoniazid resistance, we 
 
163 
hypothesized that we would find a higher isoniazid resistance prevalence in animal hosts 
than other first line drugs. Isoniazid resistance was 14.9 times more common among 
cattle and deer hosts than rifampin resistance, and 7.5 times more common than 
ethambutol resistance. Additionally, KatG may not be essential in vivo [33], so slightly 
deleterious mutations may be tolerated. Epistatic interactions between resistance 
conferring mutations and protein stabilizing mutations may also increase the probability 
of resistance evolution. In our simulations, stabilizing mutations were extremely rare 
and did not influence the dynamics of resistance evolution considerably; however, we 
chose arbitrary values for the fitness effect of beneficial mutations based on the effects 
described in Portelli et al.[9], and it is possible that with increased selection coefficients 
the simulation dynamics would change. In the empirical dataset we found two separate 
fixed putative stabilizing mutations in rpoC associated with rifampin resistance-causing 
mutations in rpoB. Additionally, we found different rpoC mutations correlated with 
rpoB mutations in animal hosts, however samples with these mutations were predicted 
to be rifampicin susceptible. We also found multiple vSNPs associated with rifampicin 
resistance and linked mutations in rpoA and rpoC in both human and animal hosts and 
show that putative compensatory mutations may evolve before resistance-associated 
mutations. Random rpoB/C mutation combinations may be tolerated in the absence of 
antimicrobial use and could potentially prime individuals for rapid resistance 
development upon exposure to antimicrobials.  
In addition to allelic interactions, resistance dynamics may be shaped by 
transmission bottlenecks. If resistance alleles are present at low frequency in one host 
and by chance are transmitted to another host, they may expand to high frequencies in 
 
164 
the new host. We consistently found more resistance mutations in simulations that 
modeled a population bottleneck than those with a constant population size. Although 
the resistance mutations we simulated decreased the overall fitness of the individual, the 
resistance mutations could still fix in the post-bottleneck population because of the high 
variance in allele frequencies in small populations. The combined effect of a relatively 
fast mutation rate, skewed offspring distributions, and small population sizes during 
transmission events can lead to high variance in mutation distributions within host. 
Under the presence of antimicrobial use, selection can act to sweep these mutations, and 
any mutations linked to the resistance-conferring mutations, to high population 
frequency.  
Lastly, the distribution of second-line resistant genotypes suggests that there 
may be selection for aminoglycoside and fluoroquinolone resistant M. bovis in farmed 
animal hosts. The difference in prevalence between isoniazid resistance and second-line 
drug resistance may be related to off-target selection pressure driven by drug use for 
purposes other than M. bovis infection. Streptomycin resistance was the most common 
resistant genotype in cattle (1.3% prevalence), followed by fluoroquinolone resistance 
(0.87 % prevalence), isoniazid (0.64% prevalence), and other aminoglycosides (0.03 % 
prevalence). Additionally, despite perfect correlation in resistance genotypes among all 
fluroquinolones tested, which is expected for members of a drug class with the same 
mechanism of action, there was a much higher resistance prevalence in streptomycin 
than other aminoglycosides in animal hosts. Although some mutations confer resistance 
to all aminoglycosides [34], the number of known mutations associated with resistance 
 
165 
in the TB-Profiler database is larger for streptomycin than other aminoglycosides (193 
for streptomycin, 88 for other aminoglycosides) [12], so the mutational target size for 
streptomycin may be larger than that of other aminoglycosides. Streptomycin is used in 
veterinary medicine to treat gram-negative infections in large animals medicine to treat 
gram negative enteric infections in poultry, cattle and swine [35], and is also used as a 
pesticide to treat fire blight (Erwinia amylorva) on apple and pear trees [36], so animals 
infected with M. bovis may be exposed to streptomycin through treatment or through 
environmental sources.  
Fluoroquinolones are also used in veterinary medicine to treat and control 
respiratory disease in cattle [35], and like with streptomycin use, fluoroquinolone use 
for purposes other than M. bovis may select for resistance. Fluoroquinolones target DNA 
gyrase and II topoisomerases, and act to block the DNA replication apparatus and 
generate double stranded breaks in bacterial DNA [37]. At sub-lethal concentrations, 
fluoroquinolone exposure has been shown to induce a faster mutation rate in 
Escherichia coli [38] Staphylococcus aureus [39], and M. fortuitum [40], and to induce 
increased expression of DNA protection and repair genes in M. tuberculosis H37Rv in 
vitro [41]. If the use of fluoroquinolones in animal hosts is responsible for resistant M. 
bovis, and if sub-lethal doses can promote resistance evolution generally by increasing 
the mutation rate, the use of fluoroquinolones in animals in regions with a high M. bovis 
prevalence should be cautioned.  
 
 
 
166 
Conclusions  
 
Genotypic resistance is rare among M. bovis sequences, but we observed resistance to 
all first-line tuberculosis drugs in sequences from both human and animal hosts and 
found resistance to aminoglycosides and fluroquinolones in production animals that 
may be related to drug exposure for non-M. bovis infections. Using forward genetic 
simulation incorporating structural effects of specific amino acid mutations, we show 
that resistance mutations can evolve through mutation and drift and can become fixed 
in the population through population bottlenecks. Studying the evolutionary dynamics 
of M. bovis in animal hosts is not only necessary to understand the risk of antimicrobial 
resistance in zoonotic M. bovis infections but can also elucidate the underlying 
evolutionary mechanisms that cause resistance alleles to emerge in MTBC populations.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167 
References  
1. Global tuberculosis report: World Health Organization2018. 
2. Gillespie SH. Evolution of drug resistance in Mycobacterium tuberculosis: 
clinical and molecular perspective. Antimicrob Agents Chemother 2002;46(2):267-274. 
3. Cassidy JP. The pathogenesis and pathology of bovine tuberculosis with 
insights from studies of tuberculosis in humans and laboratory animal models. Vet 
Microbiol 2006;112(2-4):151-161. 
4. Vázquez-Chacón CA, Rodríguez-Gaxiola FJ, López-Carrera CF, Cruz-
Rivera M, Martínez-Guarneros A et al. Identification of drug resistance mutations 
among Mycobacterium bovis lineages in the Americas. PLoS Negl Trop Dis 
2021;15(2):e0009145. 
5. Franco MMJ, Ribeiro MG, Pavan FR, Miyata M, Heinemann MB et al. 
Genotyping and rifampicin and isoniazid resistance in Mycobacterium bovis strains 
isolated from the lymph nodes of slaughtered cattle. Tuberculosis (Edinb) 2017;104:30-
37. 
6. Sweetline Anne N, Ronald BSM, Senthil Kumar TMA, Thangavelu A. 
Conventional and molecular determination of drug resistance in Mycobacterium 
tuberculosis and Mycobacterium bovis isolates in cattle. Tuberculosis (Edinb) 
2019;114:113-118. 
7. Comas I, Borrell S, Roetzer A, Rose G, Malla B et al. Whole-genome 
sequencing of rifampicin-resistant Mycobacterium tuberculosis strains identifies 
compensatory mutations in RNA polymerase genes. Nat Genet 2011;44(1):106-110. 
8. Zaw MT, Emran NA, Lin Z. Mutations inside rifampicin-resistance 
determining region of rpoB gene associated with rifampicin-resistance in 
Mycobacterium tuberculosis. J Infect Public Health 2018;11(5):605-610. 
9. Portelli S, Phelan JE, Ascher DB, Clark TG, Furnham N. Understanding 
molecular consequences of putative drug resistant mutations in Mycobacterium 
tuberculosis. Sci Rep 2018;8(1):15356. 
10. Garrison E, Kronenberg ZN, Dawson ET, Pedersen BS, Prins P. Vcflib and 
tools for processing the VCF variant call format. bioRxiv 2021. 
11. Zwyer M, Çavusoglu C, Ghielmetti G, Pacciarini ML, Scaltriti E et al. A 
new nomenclature for the livestock-associated Mycobacterium tuberculosis complex 
based on phylogenomics. Open Res Europe; 2021. 
12. Phelan JE, O'Sullivan DM, Machado D, Ramos J, Oppong YEA et al. 
Integrating informatics tools and portable sequencing technology for rapid detection of 
resistance to anti-tuberculous drugs. Genome Med 2019;11(1):41. 
13. Tange O. Gnu parallel-the command-line power tool. 
14. Peloquin CA, Davies GR. The Treatment of Tuberculosis. Clin Pharmacol 
Ther 2021;110(6):1455-1466. 
15. Lan Z, Bastos M, Menzies D. Treatment of human disease due to 
Mycobacterium bovis: a systematic review. Eur Respir J 2016;48(5):1500-1503. 
16. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V et al. Twelve years 
of SAMtools and BCFtools. Gigascience 2021;10(2). 
17. Team R. RStudio: Integrated Development for R. Boston, MA; 2020. 
http://www.rstudio.com/. 
 
168 
18. Team RC. R: A Language and Environment for Statistical Computing. R 
Foundation for Statistical Computing, Vienna, Austria; 2021. 
19. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD et al. 
IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the 
Genomic Era. Mol Biol Evol 2020;37(5):1530-1534. 
20. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for 
phylogenetic tree display and annotation. Nucleic Acids Res 2021;49(W1):W293-
W296. 
21. Pensar J, Puranen S, Arnold B, MacAlasdair N, Kuronen J et al. Genome-
wide epistasis and co-selection study using mutual information. Nucleic Acids Res 
2019;47(18):e112. 
22. Alame Emane AK, Guo X, Takiff HE, Liu S. Drug resistance, fitness and 
compensatory mutations in Mycobacterium tuberculosis. Tuberculosis (Edinb) 
2021;129:102091. 
23. Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN et al. 
WikiPathways: connecting communities. Nucleic Acids Res 2021;49(D1):D613-D621. 
24. Haller BC, Messer PW. SLiM 3: Forward Genetic Simulations Beyond the 
Wright-Fisher Model. Mol Biol Evol 2019;36(3):632-637. 
25. Castro RAD, Borrell S, Gagneux S. The within-host evolution of antimicrobial 
resistance in Mycobacterium tuberculosis. FEMS Microbiol Rev 2021;45(4). 
26. Ford CB, Shah RR, Maeda MK, Gagneux S, Murray MB et al. 
Mycobacterium tuberculosis mutation rate estimates from different lineages predict 
substantial differences in the emergence of drug-resistant tuberculosis. Nat Genet 
2013;45(7):784-790. 
27. Romero B, Rodríguez S, Bezos J, Díaz R, Copano MF et al. Humans as 
source of Mycobacterium tuberculosis infection in cattle, Spain. Emerg Infect Dis 
2011;17(12):2393-2395. 
28. Ameni G, Tadesse K, Hailu E, Deresse Y, Medhin G et al. Transmission of 
Mycobacterium tuberculosis between farmers and cattle in central Ethiopia. PLoS One 
2013;8(10):e76891. 
29. Mittal M, Chakravarti S, Sharma V, Sanjeeth BS, Churamani CP et al. 
Evidence of presence of Mycobacterium tuberculosis in bovine tissue samples by 
multiplex PCR: possible relevance to reverse zoonosis. Transbound Emerg Dis 
2014;61(2):97-104. 
30. Adesokan HK, Akinseye VO, Streicher EM, Van Helden P, Warren RM et 
al. Reverse zoonotic tuberculosis transmission from an emerging Uganda I strain 
between pastoralists and cattle in South-Eastern Nigeria. BMC Vet Res 2019;15(1):437. 
31. Lombard JE, Patton EA, Gibbons-Burgener SN, Klos RF, Tans-Kersten JL 
et al. Human-to-Cattle. Front Vet Sci 2021;8:691192. 
32. Evans JT, Smith EG, Banerjee A, Smith RM, Dale J et al. Cluster of human 
tuberculosis caused by Mycobacterium bovis: evidence for person-to-person 
transmission in the UK. Lancet 2007;369(9569):1270-1276. 
33. Pym AS, Saint-Joanis B, Cole ST. Effect of katG mutations on the virulence 
of Mycobacterium tuberculosis and the implication for transmission in humans. Infect 
Immun 2002;70(9):4955-4960. 
 
169 
34. Reeves AZ, Campbell PJ, Sultana R, Malik S, Murray M et al. 
Aminoglycoside cross-resistance in Mycobacterium tuberculosis due to mutations in the 
5' untranslated region of whiB7. Antimicrob Agents Chemother 2013;57(4):1857-1865. 
35. US Food and Drug Administration: Approved Animal Drug Products (Green 
Book). 
36. Aćimović SG, Zeng Q, McGhee GC, Sundin GW, Wise JC. Control of fire 
blight (Erwinia amylovora) on apple trees with trunk-injected plant resistance inducers 
and antibiotics and assessment of induction of pathogenesis-related protein genes. Front 
Plant Sci 2015;6:16. 
37. Hooper DC, Jacoby GA. Topoisomerase Inhibitors: Fluoroquinolone 
Mechanisms of Action and Resistance. Cold Spring Harb Perspect Med 2016;6(9). 
38. Phillips I, Culebras E, Moreno F, Baquero F. Induction of the SOS response 
by new 4-quinolones. J Antimicrob Chemother 1987;20(5):631-638. 
39. Fung-Tomc J, Kolek B, Bonner DP. Ciprofloxacin-induced, low-level 
resistance to structurally unrelated antibiotics in Pseudomonas aeruginosa and 
methicillin-resistant Staphylococcus aureus. Antimicrob Agents Chemother 
1993;37(6):1289-1296. 
40. Gillespie SH, Basu S, Dickens AL, O'Sullivan DM, McHugh TD. Effect of 
subinhibitory concentrations of ciprofloxacin on Mycobacterium fortuitum mutation 
rates. J Antimicrob Chemother 2005;56(2):344-348. 
41. O'Sullivan DM, Hinds J, Butcher PD, Gillespie SH, McHugh TD. 
Mycobacterium tuberculosis DNA repair in response to subinhibitory concentrations of 
ciprofloxacin. J Antimicrob Chemother 2008;62(6):1199-1202. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170 
Supplementary Materials  
 
Supplementary Figure 1. Pairwise correlation of resistance patterns among drugs tested. 
A. Correlation of resistance patterns among all samples. B. Non-human samples, C. 
Human samples  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171 
CONCLUSIONS AND FUTURE DIRECTIONS 
Conclusions 
 
The Mycobacterium avium complex and Mycobacterium tuberculosis complex (MTBC) 
have infected human and animal populations for thousands of years and remain an 
important cause of morbidity and mortality today [1]. Both Mycobacterium avium 
subsp. paratuberculosis (MAP), the cause of Johne’s disease in ruminants and 
Mycobacterium bovis, the cause of bovine tuberculosis impact cattle production in the 
United States and globally; however, the policy for disease control is distinct. The 
combined impact of M. bovis on public health and animal created an impetus for 
national eradication in the United States and many other countries around the world 
[2][14]; however, eradication efforts are complicated by wildlife reservoirs and by 
animal movement from endemic regions into M. bovis free regions. Johne’s disease 
policy is aimed toward control rather than eradication, with emphasis on mitigating the 
economic costs of reduced production. 
 Johne’s disease is difficult to control because of environmental persistence and 
long latency periods, which prevent rapid detection of infected animals based on clinical 
signs [3, 4]. Some sub-clinically infected animals will progress to a high-shedding 
disease state, at which time they may increase environmental contamination, 
transmission and persistence on the farm [5]. In chapter 2 we developed a hidden 
Markov model that classifies animals into progression categories based on fecal 
shedding. We found most MAP infected animals (81-93%) progress to a high shedding 
disease state, and about 9% of animals maintain a low shedding disease state. In practice 
our model could be used to predict which animals will become high shedders and which 
 
172 
will maintain a low shedding disease state, which could aid in management decision 
making.  
 Because M. bovis is not endemic in the United States, disease control relies on 
outbreak identification followed by testing and culling. Whole genome sequencing is 
used to identifying the likely source of an outbreak by comparing outbreak sequences 
to a global database of whole genome sequences. Genetic diversity of M. bovis outbreak 
sequences could also be useful in understanding local transmission patterns [6], 
however, M. bovis has a low substitution rate of less than one mutation per genome per 
year [7]. Therefore, multiple animals may be infected with M. bovis with identical or 
very similar whole genome sequences, limiting the ability to decipher transmission 
pathways from whole genome sequences [8]. Previous pangenome studies found that 
MTBC species have large accessory genomes, ranging from 36 to 69% of all genes in 
the pangenome, and a recent study suggested that M. bovis had an open pangenome [9-
15]. If M. bovis has a large accessory genome, gene content variation could be useful as 
an additional source of genetic diversity in outbreak analysis; however, there is no 
known mechanism for gene acquisition through horizontal gene transfer, so a large 
accessory genome is inconsistent with current knowledge of MTBC evolutionary 
biology [16]. In chapter 3 we identified methodological issues inherent in pangenome 
analysis using draft genome sequences constructed with short read data, and developed 
filtering procedures to accurately characterize the M. bovis pangenome. We found that 
PE/PPE genes which contain highly repetitive DNA sequence were often located at the 
ends of contigs and were annotated as distinct genes even though the fragmented 
sequences mapped to a single reference gene. After removing these erroneous 
 
173 
annotations, we found that M. bovis has a compact pangenome and that gene content 
variation was driven by pseudogenization and gene deletion. We also used our 
pangenome analysis pipeline to analyze the pangenome constructed with genomes used 
in the recent that suggested the M. bovis pangenome was open and found just 70 
accessory genes, compared to the nearly 4000 genes identified in the original study [15]. 
Although gene content is highly consistent among M. bovis genomes, we found that 
there is indel variation that evolves during outbreaks. Indels may be useful in 
discriminating between genomes that have identical single nucleotide polymorphisms.  
In chapter 4 we analyze the microevolution that occurs within hosts during 
infection. In contrast to similar studies of M. tuberculosis in human hosts where 
selection induced by antimicrobial treatment affects evolutionary trajectories [17, 18], 
we found that within host evolution was dominated by random genetic drift. 
Nonsynonymous mutations were common and occurred in both essential and 
nonessential genes and genes related to antimicrobial resistance, suggesting slightly 
deleterious mutations are able to persist within hosts. We also used forward genetic 
simulation and approximate Bayesian computation to estimate the effective population 
size of M. bovis infections, and the parameter value governing offspring distribution 
size, and the M. bovis mutation rate. We found that a relatively fast mutation rate of 5e-
7 mutations per position per generation combined with the diversity reducing effect of 
clonal progeny skew and population bottlenecks caused by transmission reduce 
diversity, create the slow apparent substitution rate that results in low genetic diversity 
measured in closely related outbreak sequences. In chapter 5 we studied the effect of a 
rapid mutation rate on the evolution of antimicrobial resistance genes in M. bovis 
 
174 
genomes that were not exposed to tuberculosis-specific antimicrobial treatment. We 
found that genotypic resistance is rare in M. bovis; however, we identified highly 
sensitive markers associated with resistance [19] to first line tuberculosis drugs in both 
in human and animal hosts. We also showed that resistance can evolve through mutation 
and drift and can become fixed in demonstrated a potential mechanism of resistance 
evolution in the absence of antimicrobial use. Studying the evolutionary dynamics of 
M. bovis in animal hosts is not only necessary to understand the risk of antimicrobial 
resistance in zoonotic M. bovis infections but can also elucidate the underlying 
evolutionary mechanisms that cause resistance alleles to emerge in MTBC populations. 
The mechanism we described for the evolution of resistance without antibiotic treatment 
selection may describe the underlying evolutionary processes that result in the 
generation of antimicrobial resistant-associated alleles in populations, which selection 
can act upon.  
 
Future directions 
 
Building on this knowledge developed in this thesis on M. bovis evolution from the 
microevolution that occurs within hosts, to the long-term evolutionary history that 
shaped pangenomes, we are developing a modeling framework that uses deep learning 
to predict characteristics of an outbreak from whole genome sequencing data. We are 
developing a convolutional neural network (CNN) to predict the relationship between 
herd infection population whole genome alignments and outputs of interest including 
the time since a herd was infected. The CNNs are a class of deep learning models 
commonly used in image processing and are designed to recognize patterns in data 
 
175 
without a predefined feature vector. We selected a CNN framework was selected for 
this problem because of the ability of CNNs to accurately identify evolutionary 
processes from labeled population genetic data, with equal or greater accuracy than 
other methods [20]. Additionally, CNNs take matrices as inputs, and sequence 
alignments can be easily formatted as matrices, and unlike other modeling approaches 
based on coalescent theory, CNNs can incorporate all information in genomes, 
including low frequency alleles, indels and gene content variation into the same 
prediction framework. We believe our model will outperform existing methodologies 
in describing M. bovis outbreak dynamics because we explicitly model forces that shape 
the evolution of this intracellular, highly structured pathogen population. Combining 
the evolutionary theory developed in this dissertation with new modeling approaches 
that incorporate all genetic variation present in outbreak sequences may enable the 
development of high resolution transmission networks that can better describe the 
transmission within herds, between herds, or between host species.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176 
References  
 
1. Loiseau C, Menardo F, Aseffa A, Hailu E, Gumi B et al.  
An African origin for Mycobacterium bovis. Evol Med Public Health 2020;2020(1):49-
59. 
2. Scott C, Cavanaugh JS, Pratt R, Silk BJ, LoBue P et al. Human Tuberculosis 
Caused by Mycobacterium bovis in the United States, 2006-2013. Clin Infect Dis 
2016;63(5):594-601. 
3. Whittington RJ, Sergeant ES. Progress towards understanding the spread, 
detection and control of Mycobacterium avium subsp paratuberculosis in animal 
populations. Aust Vet J 2001;79(4):267-278. 
4. Grewal SK, Rajeev S, Sreevatsan S, Michel FC, Jr. Persistence of 
Mycobacterium avium subsp. paratuberculosis and other zoonotic pathogens during 
simulated composting, manure packing, and liquid storage of dairy manure. Appl 
Environ Microbiol 2006;72(1):565-574. 
5. Smith RL, Schukken YH, Grohn YT. A new compartmental model of 
Mycobacterium avium subsp. paratuberculosis infection dynamics in cattle. Prev Vet 
Med 2015;122(3):298-305. 
6. Salvador LCM, O'Brien DJ, Cosgrove MK, Stuber TP, Schooley AM et al. 
Disease management at the wildlife-livestock interface: Using whole-genome 
sequencing to study the role of elk in Mycobacterium bovis transmission in Michigan, 
USA. Mol Ecol 2019;28(9):2192-2205. 
7. Menardo F, Duchene S, Brites D, Gagneux S. The molecular clock of 
Mycobacterium tuberculosis. PLoS Pathog 2019;15(9):e1008067. 
8. Glaser L, Carstensen M, Shaw S, Robbe-Austerman S, Wunschmann A et 
al. Descriptive Epidemiology and Whole Genome Sequencing Analysis for an Outbreak 
of Bovine Tuberculosis in Beef Cattle and White-Tailed Deer in Northwestern 
Minnesota. PLoS One 2016;11(1):e0145735. 
9. Zakham F, Sironen T, Vapalahti O, Kant R. Pan and Core Genome Analysis 
of 183 Mycobacterium tuberculosis Strains Revealed a High Inter-Species Diversity 
among the Human Adapted Strains. Antibiotics (Basel) 2021;10(5). 
10. Dar HA, Zaheer T, Ullah N, Bakhtiar SM, Zhang T et al. Pangenome 
Analysis of Mycobacterium tuberculosis Reveals Core-Drug Targets and Screening of 
Promising Lead Compounds for Drug Discovery. Antibiotics (Basel) 2020;9(11). 
11. Kavvas ES, Catoiu E, Mih N, Yurkovich JT, Seif Y et al. Machine learning 
and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic 
signatures of antibiotic resistance. Nat Commun 2018;9(1):4306. 
12. Yang T, Zhong J, Zhang J, Li C, Yu X et al. Pan-Genomic Study of 
Mycobacterium tuberculosis Reflecting the Primary/Secondary Genes, 
Generality/Individuality, and the Interconversion Through Copy Number Variations. 
Front Microbiol 2018;9:1886. 
13. Patane JS, Martins J, Beatriz Castelao A, Nishibe C, Montera L et al. 
Patterns and processes of Mycobacterium bovis evolution revealed by phylogenomic 
analyses. Genome Biol Evol 2017. 
 
177 
14. Zimpel CK, Brandao PE, de Souza Filho AF, de Souza RF, Ikuta CY et al. 
Complete Genome Sequencing of Mycobacterium bovis SP38 and Comparative 
Genomics of Mycobacterium bovis and M. tuberculosis Strains. Front Microbiol 
2017;8:2389. 
15. Reis AC, Cunha MV. The open pan-genome architecture and virulence 
landscape of Mycobacterium bovis. Microb Genom 2021;7(10). 
16. Boritsch EC, Khanna V, Pawlik A, Honore N, Navas VH et al. Key 
experimental evidence of chromosomal DNA transfer among selected tuberculosis-
causing mycobacteria. Proc Natl Acad Sci U S A 2016;113(35):9876-9881. 
17. Trauner A, Liu Q, Via LE, Liu X, Ruan X et al. The within-host population 
dynamics of Mycobacterium tuberculosis vary with treatment efficacy. Genome Biol 
2017;18(1):71. 
18. Moreno-Molina M, Shubladze N, Khurtsilava I, Avaliani Z, Bablishvili N 
et al. Genomic analyses of Mycobacterium tuberculosis from human lung resections 
reveal a high frequency of polyclonal infections. Nat Commun 2021;12(1):2716. 
19. Phelan JE, O'Sullivan DM, Machado D, Ramos J, Oppong YEA et al. 
Integrating informatics tools and portable sequencing technology for rapid detection of 
resistance to anti-tuberculous drugs. Genome Med 2019;11(1):41. 
20. Flagel L, Brandvain Y, Schrider DR. The Unreasonable Effectiveness of 
Convolutional Neural Networks in Population Genetic Inference. Mol Biol Evol 
2019;36(2):220-238.