EVOLUTION AND POPULATION DYNAMICS OF MYCOBACTERIAL PATHOGENS WITH APPLICATIONS FOR DISEASE CONTROL A Dissertation Presented to the Faculty of the Graduate School of Cornell University In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Kristina Ceres August 2022 © 2022 Kristina Ceres EVOLUTION AND POPULATION DYNAMICS OF MYCOBACTERIAL PATHOGENS WITH APPLICATIONS FOR DISEASE CONTROL Kristina Ceres, Ph.D. Cornell University 2022 The field of mathematical epidemiology aims to understand the distribution of diseases in populations using mathematical models. Building models involves distilling the biological facts of a disease system and using mathematical theory to describe the system dynamics. Similarly, molecular epidemiology relies on population genetic theory to link pathogen genetic data to epidemiological inference. This thesis aims to both understand the biological principles that drive the evolution of mycobacterial pathogens, but to also determine which biological processes can be described by current theory, and which questions require new analytic approaches to fully describe. Here we describe a new hidden Markov model to characterize the progression of Mycobacterium avium subsp. paratuberculosis infection, the cause of Johne’s disease in ruminants through disease states. We find a minority of infected animals remain low- shedders throughout the duration of infection, whereas others progress to a high shedding disease state. Our modeling framework could be used to predict future progression patterns based on curing shedding state, which could aid in decision making on the farm. Next we study the pangenome of Mycobacterium bovis, the cause of bovine tuberculosis. Recent studies suggest that M. bovis has a large accessory genome, yet there is no known mechanism for horizontal gene transfer in M. bovis. We found significant errors in accessory gene classification, correcting for errors, we show that M. bovis has a much smaller accessory genome than previously described with little gene content variation generated over outbreaks. Finally, we study the within-host evolution of M. bovis and employ flexible forward genetic simulation tools to evolutionary processes that deviate from standard evolutionary models. We found that the distribution of mutations that evolve de novo within animal hosts is random, suggesting M. bovis evolution is driven by drift at short time scales. We also find that the combined effect relatively rapid mutation rates and diversity reducing infection bottlenecks and skewed offspring distributions best describe within host evolution M. bovis. Lastly, we explore the effects of a rapid mutation rate on the evolution of antimicrobial resistance genotypes in the absence of antimicrobial use. BIOGRAPHICAL SKETCH Kristina pursued a bachelor’s in animal science degree at Cornell University, with the intention becoming a small animal veterinarian after graduation. During her time as an undergraduate, Kristina had the opportunities to take international courses in field research in Kenya, and evolutionary biology in the Galapagos, and studied conservation medicine in South Africa experience. These experiences shaped her interests in wildlife conservation and sparked her curiosity for research. Kristina participated in undergraduate research in the Behling-Kelly laboratory studying novel immunohistochemistry markers for canine lymphoma. After starting veterinary school in 2015, Kristina joined the Gröhn laboratory first as a part time research assistant, and then as a summer student through the veterinary leadership program to develop skills in epidemiology. During her summer in the Gröhn laboratory, Kristina developed skills in mathematical modeling and a passion for mathematical biology and decided to apply for the Cornell combined DVM-PhD program to pursue a PhD in infectious disease epidemiology. Before she was accepted into the combined degree program, Kristina rotated through the Ivanek laboratory, where she was involved in a scoping review on Listeria monocytogenes sampling strategies in food processing facilities, and in the design of a survey used to understand perceptions of antimicrobial use among dairy farmers. In 2017 Kristina officially joined the Gröhn laboratory as a PhD student, and began working on developing mathematical modeling of Mycobacterium avium subsp. paratuberculosis infections in dairy cattle. Kristina later transitioned into molecular epidemiology work with the assistance of collaborators at the USDA National Veterinary Services Laboratories and was awarded a USDA NIFA predoctoral fellowship to study M. bovis evolution and epidemiology. In addition to Kristina’s dissertation work, she has had the pleasure to work with the Stanhope and Goodman laboratories on the epidemiology, zoonotic transmission and gene distribution of v Escherichia coli infections in dogs and human hosts, and on the temporal dating of coronaviruses in a multi-host system. Kristina also collaborated with the USDA National Veterinary Services Laboratories on studying the phylogeographic distribution and temporal dating of M. bovis introductions into the Dominican Republic. Lastly, Kristina has greatly enjoyed working with Michael Stanhope on shark genomics projects, which allowed her to develop skills in vertebrate genomics and impact wildlife conservation through studying the historical distributions of shark species including the critically endangered great hammerhead. In the fall of 2022 Kristina will re-join the veterinary program to complete her doctorate in veterinary medicine degree. During her remaining time in school, Kristina plans to continue work in M. bovis model development and plans to further develop skills in genomics, evolutionary biology and conservation medicine. After graduation she intends to pursue a research career at the intersection of infectious disease epidemiology, and host-pathogen evolution with an emphasis on wildlife conservation and One Health. vi This work is dedicated to my family, whose never-ending support has allowed me to develop my perfect career, and for the animals, particularly Oreo, Timbi and Olive who have set me on this path. vii ACKNOWLEDGMENTS First, I would like to thank my advisor Yrjö Gröhn, who first took a chance on me when I had no experience in mathematical biology, statistics or epidemiology. His support and gentile mentoring style over the past six years has allowed me to explore multiple disciplines and gain expertise in a variety of fields to develop my own unique research niche. Although my research plan was not always clear, Yrjö consistently worked hard to ensure I could develop a project that I was passionate about, and I am very thankful for his consistent guidance and for always being available for advice. I would also like to thank my committee members for their support during my PhD. I would like to thank Michael Stanhope for not only helping me learn the ropes in comparative genomics and, but for also treating me as a collaborator, which helped me develop confidence in bioinformatics and as a scientist. Our many conversations about pathogen evolution, electroreception and interesting shark genes helped me realize the broad applications for the skills I was developing in molecular evolution, and has enriched my PhD experience. I would also like to thank Brian VanderVen for pushing me to dig into the basic biology of M. bovis, and for using his expertise in tuberculosis biology to provide helpful feedback. Lastly, I would like to thank Marty Wells for his guidance in statistics, and for his course Theory of Statistics, which allowed me to develop an intuitive understanding of statistics. I would also like to thank Dr. Suelee Robbe-Austerman and her colleagues at the USDA National Veterinary Services Laboratories for brainstorming ideas for M. bovis research, helping me transition into bioinformatics work, and providing exceptional support in helping secure a large travel grant, and in supporting my fellowship application. viii Next I would like to thank my non-research mentors in the veterinary school and in the BBS graduate program, whose career advice has helped me develop my dream career, and my wonderful peers in the veterinary and graduate programs, for their support. I would particularly like to thank my friends in the combined degree program, especially Frances Chen and Erica Lachenauer who immediately made me feel welcome in the combined degree program, and Eileen Troconis and Kieran Koch- Laskowski for both their friendship and their leadership in our program. I would also like to thank Alison Stout, Casey Cazer and Scarlett Lee for their friendship, career advice and shared research interests that provided me with a sense of community in the BBS program. Lastly, I would like to thank my friends outside of Cornell: Maggie Meurer, Leah Oliver, Pam Skelton, and especially Biagio DiSalvo for always being there for me – thank you for being a (best) friend. Lastly, I would like to thank my family for their exceptionally valuable support outside of school, which enabled my success in school. I would like to particularly thank my parents and my sister and brother-in-law (and my newest family member, my niece Harper), for giving me the opportunity to pursue my dreams and for their love and support. Finally, I would like to thank Shay and our pets Timbi and Olive for their everyday support and love, which keeps me pushing forward. I would like to thank Shay for not only the providing the typical support system expected of a great partner, but also for teaching me the basics of writing code, which gave me a foundation in computer science that helped me in every aspect of my PhD work. Without him, I would not have had the confidence to pursue computational biology research and for his unique support, I will always be grateful. ix TABLE OF CONTENTS CHAPTER 1 ........................................................................................................................................... 16 PATHOGENIC MYCOBACTERIA .............................................................................................................. 16 IMPACT OF MYCOBACTERIAL DISEASES OF CATTLE .............................................................................. 16 GENOMIC EPIDEMIOLOGY OF M. BOVIS ................................................................................................ 19 WITHIN HOST EVOLUTION AND THE EMERGENCE OF RESISTANCE ........................................................ 21 DISSERTATION OBJECTIVES ................................................................................................................. 22 REFERENCES ........................................................................................................................................ 24 CHAPTER 2 ........................................................................................................................................... 27 ABSTRACT ............................................................................................................................................ 28 INTRODUCTION .................................................................................................................................... 29 METHODS ............................................................................................................................................. 31 RESULTS AND DISCUSSION ................................................................................................................... 39 FUTURE DIRECTIONS AND CONCLUSIONS ............................................................................................. 47 ACKNOWLEDGEMENTS ......................................................................................................................... 50 REFERENCES ........................................................................................................................................ 51 SUPPLEMENTARY MATERIALS ............................................................................................................. 53 CHAPTER 3 ........................................................................................................................................... 59 ABSTRACT ............................................................................................................................................ 60 IMPACT STATEMENT ............................................................................................................................. 61 DATA SUMMARY .................................................................................................................................. 61 INTRODUCTION .................................................................................................................................... 62 METHODS ............................................................................................................................................. 65 RESULTS .............................................................................................................................................. 70 DISCUSSION ......................................................................................................................................... 87 CONCLUSIONS ...................................................................................................................................... 92 ACKNOWLEDGEMENTS ......................................................................................................................... 92 REFERENCES ........................................................................................................................................ 94 SUPPLEMENTARY MATERIALS ............................................................................................................. 98 CHAPTER 4 ......................................................................................................................................... 107 ABSTRACT .......................................................................................................................................... 108 INTRODUCTION .................................................................................................................................. 109 METHODS ........................................................................................................................................... 111 RESULTS ............................................................................................................................................ 119 SIMULATION RESULTS ........................................................................................................................ 124 DISCUSSION ....................................................................................................................................... 127 CONCLUSIONS .................................................................................................................................... 133 REFERENCES ...................................................................................................................................... 134 SUPPLEMENTARY MATERIALS ............................................................................................................ 137 CHAPTER 5 ......................................................................................................................................... 141 ABSTRACT .......................................................................................................................................... 142 INTRODUCTION .................................................................................................................................. 143 METHODS ........................................................................................................................................... 146 RESULTS ............................................................................................................................................ 152 DISCUSSION ....................................................................................................................................... 161 CONCLUSIONS .................................................................................................................................... 166 REFERENCES ...................................................................................................................................... 167 SUPPLEMENTARY MATERIALS ........................................................................................................... 170 x CONCLUSIONS AND FUTURE DIRECTIONS ............................................................................. 171 CONCLUSIONS .................................................................................................................................... 171 FUTURE DIRECTIONS .......................................................................................................................... 174 REFERENCES ...................................................................................................................................... 176 xi LIST OF FIGURES Figure 1. MAP fecal shedding patterns from five seven cows. .................................... 33 Figure 2. Diagram of hidden Markov model 1. ............................................................ 36 Figure 3. Transient distributions and emission distributions in two and three hidden state models. ................................................................................................................. 42 Figure 4. Progression patterns through disease states. ................................................. 45 Figure 5 M. bovis core phylogeny and sample geography. .......................................... 71 Figure 6. Population structure assessed in SNP variation and gene presence absence patterns. ........................................................................................................................ 73 Figure 7. Accessory gene labels are created through redundant gene annotation. ....... 77 Figure 8. Filtered gene presence absence patterns in the M. bovis pangenome. .......... 80 Figure 9. Gene content variation in a Minnesota outbreak. ......................................... 83 Figure 10. Methodological differences result in drastically different pangenome sizes. ...................................................................................................................................... 86 Figure 11. Within host population simulation schematic. .......................................... 115 Figure 12. Phylogeny of study samples colored by geographic region and three levels of RhierBAPS phylogenetic clustering. ..................................................................... 120 Figure 13. Distinguishing polyclonal from single source infections using allele frequencies. ................................................................................................................. 122 Figure 14. Model comparison using log10 Bayes factors. ........................................... 125 Figure 15. Model parameter estimates calculated with Approximate Bayesian Computation. .............................................................................................................. 126 Figure 16. Simulation structure and graphical hypothesis. ........................................ 149 Figure 17. Phylogeny of TB-Profiler predicted resistant samples and closely related samples. ...................................................................................................................... 153 xii Figure 18. TB-Profiler predicted resistance genotypes for M. bovis samples from human, cattle, and deer hosts. ..................................................................................... 155 Figure 19. Correlation between resistance-causing mutations and potential compensatory mutations. ............................................................................................ 156 Figure 20. Distribution of mutations associated with rifampicin resistant, and other correlated mutations in RNA polymerase subunit genes rpoB and rpoC. Clades without correlated rpoB/C mutations were collapsed for clarity. ............................... 157 Figure 21. Variable and fixed mutations in RNA polymerase subunit genes plotted along a radial depiction of the M. bovis phylogeny shown in Figure 17. .................. 158 Figure 22. Allele frequencies of variable SNPs associated with resistance and linked rpoC compensatory mutations. ................................................................................... 159 Figure 23. Prevalence of resistance-causing mutations katG that emerged during forward genetic simulations. ...................................................................................... 161 xiii LIST OF TABLES Table 1 Descriptions and initial values of parameters in three hidden state models. ... 38 Table 2. Model comparison using AIC. ....................................................................... 39 Table 3. Transition and emission probability estimates. .............................................. 40 Table 4. Components of each model structure tested ................................................. 115 Table 5. Prior distributions for simulation parameters ............................................... 118 Table 6. Fixed model parameter values. ..................................................................... 118 Table 7. Parameter values used in model assessment. ............................................... 118 Table 8. Within host mutations in genes associated with antimicrobial resistance .... 123 Table 9. Distribution of de novo SNPs among gene categories ................................. 124 Table 10. Simulation model parameters. .................................................................... 151 xiv LIST OF ABBREVIATIONS MAC Mycobacterium avium complex MTBC Mycobacterium tuberculosis complex MAP Mycobacterium avium subsp paratuberculosis SFS Site frequency spectrum AMR Antimicrobial resistance ROS Reactive oxygen species RDQMA Regional Dairy Quality Alliance Management CFU Colony forming units HMM Hidden Markov Model SNP Single nucleotide polymorphism fSNP Fixed single nucleotide polymorphism vSNP Variable single nucleotide polymorphism indel Insertion/deletion MCMC Markov Chain Monte Carlo ABC Approximate Bayesian Computation PAS para-aminosalicylic acid CNN Convolutional Neural Network xv CHAPTER 1 Introduction Pathogenic mycobacteria The genus Mycobacteria contains medically important human and animal pathogens including the Mycobacterium avium complex (MAC) and the Mycobacterium tuberculosis complex (MTBC) and environmental, opportunistic pathogens including Mycobacterium canetti, and environmental non-pathogenetic species. The MAC and MTBC fall within two distinct clades within the slow growing mycobacteria, that evolved from a rapidly growing ancestor [1]. The MTBC, including M. tuberculosis, the cause of tuberculosis in humans, and M. bovis, the cause of bovine tuberculosis evolved from a free-living ancestor, which, through a combination of gene gain and gene loss adapted to an obligate intracellular lifestyle characterized by clonal growth and little to no recombination [2]. Similarly, Mycobacterium avium subsp. paratuberculosis (MAP), the cause of Johne’s disease in ruminants evolves clonally; however horizontal gene transfer is common in other M. avium species [3]. Impact of mycobacterial diseases of cattle Both MAP and M. bovis cause granulomatous disease in ruminants with long latency periods, which prevents rapid infection detection by clinical signs. MAP causes Johne’s disease, which is characterized by a severe granulomatous enteritis. MAP infection causes a progressive loss of milk production that precedes clinical signs of disease driven by intestinal malabsorption including diarrhea, wasting and eventually death [4]. 16 17 Johne’s disease is endemic in the United States and is responsible for severe economic losses to the dairy industry through decreased milk production [5, 6]. Once a farm is infected with MAP, it is difficult to control because of its environmental persistence and because of the long latency period of 2-5 years before clinical signs of disease appear [7-9]. Because of its endemic status in the United States, much of the current disease control efforts are toward disease control rather than eradication. Additionally, there is variability in fecal shedding patterns among infected animals, and high shedding animals increase environmental contamination and MAP persistence on dairy farms [10]. Mathematical modeling has been used extensively to aid in disease management on farms [11-13] and has also elucidated the impact of high-shedding animals on the MAP infection on dairy farms [12]. In contrast, M. bovis, the cause of bovine tuberculosis, has been the object of a National Eradication Program since its inception in 1917. Although eradication efforts successfully reduced the prevalence of M. bovis infection from 5% to less than 0.001% [14], bovine tuberculosis remains an important agricultural disease from both an economic and a public health perspective. M. bovis has a wide host range including cattle, white-tailed deer (Odocoileus virginianus), and humans [15], and causes decreased milk and meat production in cattle. Humans can become infected by contact with infected animals or, more commonly through consumption of unpasteurized milk [16]. Risk of zoonotic M. bovis infection has been greatly reduced in developed countries by a combination of effective test and slaughter control programs and milk pasteurization; however, M. bovis is still responsible for 1-2% of human tuberculosis cases in the US 18 [17]. Despite over 100 years of directed national control, outbreaks still occur in cattle every year. There are two explanations for the difficulty in eliminating bovine tuberculosis. First, M. bovis infection in cattle remains subclinical, and potentially undetected for months to years until lesions become large enough to impair organ function [18]. Second, there are multiple sources of M. bovis infection in cattle including spillover from wildlife, animal movement, zoonotic transmission, and fomite transmission. The only known wildlife reservoir of bovine tuberculosis in the US is a population of white- tailed deer in Michigan, and cross species transmission remains an important source of cattle infection in this region [19]. Cattle trade between Mexico and the US is common and bovine tuberculosis prevalence is 14.2% in some regions in Mexico [20]. Movement of animals from Mexico to the US could result in transmission; however, animal movement and wildlife do not account for all outbreaks in the US. Remaining cases may be linked to zoonotic transmission [21, 22]. Due to the potential lag time between infection and detection, as well as the multiple possible sources of infection, identifying the source of infection in a herd is difficult. Although thorough sampling is routine in traceback investigations, animal movement records are often incomplete, and other potential sources of infection, including zoonotic transmission, are difficult to characterize. To maximize the information available from outbreak investigations, whole genome sequencing of M. bovis isolates from infected herds was incorporated into the bovine tuberculosis eradication program by the National Veterinary Services Laboratories in 2013 [21]. 19 Genomic epidemiology of M. bovis Pathogen whole genome sequencing can provide useful information on outbreak dynamics that is difficult or impossible to discern using non-molecular epidemiologic methods including determining single vs multiple introductions of pathogen into a population [23], understanding the role of different host species in multi-host transmission events [19], characterizing the role of within-host evolution on transmission dynamics [24] and estimating estimate the time of transmission events [25]. Molecular epidemiology, at its core is based on the foundational mathematical theory developed by Wright [26], Fisher [27], Kimura [28] and others, that describes how genomes evolve through mutation, drift and selection. Models including the Wright-Fisher model and the Kingman’s coalescent theory [29] enable linking evolutionary theory to real polymorphism data, enabling inference of epidemic history through the evolution of pathogens. The power of molecular data to understand how pathogens evolve and move through host populations has exploded with the increased accessibility of pathogen whole genome sequencing, however, model assumptions must be evaluated to ensure proper application to host-pathogen systems. In particular, methods that rely on the Kingman coalescent theory based on the Wright-Fisher population model for inference on epidemiologically relevant evolutionary processes may be inappropriate for some host-pathogen systems. Specifically, strong linkage disequilibrium in Mycobacterium tuberculosis complex (MTBC) species including M. bovis due to low recombination rates violates the independent sites assumption of the Wright Fisher model. Additionally, population bottlenecks caused by transmission and 20 potential differences in reproductive success of individuals violates the small variance offspring distribution assumption of the Kingman’s coalescent [30]. Violations of these basic assumptions in the evolution of MTBC can be visualized in the site frequency spectrum (SFS), where an excess of rare variants is seen compared to the neutral SFS calculated under the infinite mutations model and the Kingman’s coalescent [30, 31]. Both clonal progeny skew caused by wide variance offspring distribution and exponential population growth, consistent with demography in an outbreak setting, have been demonstrated to cause a shift in the SFS towards an excess of rare variants [32]. Additionally, the effects of progeny skew and exponential growth can be distinguished by differing right tail distributions in the SFS. Lastly, negative selection can also cause a left shift in the SFS because deleterious alleles are prevented from reaching high frequencies in the population. Recent studies have found multiple-merger coalescent models [33] and explicit modeling of skewed offspring distributions [34] describe M. tuberculosis evolution better than standard Kingman’s coalescent based models. Additionally, since both M. bovis and MAP evolve clonally with no horizontal gene transfer, the pangenome of both species should reflect this trend toward genome reduction with compact accessory genomes. Recently, the MAP pangenome was found to have a compact pangenome with no detected recombination [35] however MTBC pangenome estimates suggest both M. tuberculosis and M. bovis have rather large pangenomes, inconsistent with an inability to acquire new genes through horizontal gene transfer [36, 37]. Applying models that do not adequately account for the evolutionary biology of a pathogen system may lead to erroneous conclusions. For example, using a simulation structure that incorporated high variance 21 offspring distributions, Morales-Arce et al. found that the observed distribution of genetic variants within hosts was best described by a mutation rate orders of magnitude greater than that described using coalescent methods [34]. Within host evolution and the emergence of resistance A consequence of a rapid mutation rate may be that antimicrobial resistant genotypes may evolve by drift more commonly than expected assuming a slow mutation rate. Antimicrobial resistance (AMR) is a global health crisis in the control of tuberculosis in humans, and since there is no known mechanism for horizontal gene transfer, all new resistance evolution occurs through mutation. Although it is well understood that mutations arising through drift or a combination of drift and selection cause AMR, the cellular mechanisms of mutation generation in the context of host and pathogen interaction are less understood. Bacterial mutations may arise due to random errors in replication but may also result from host driven interactions. For example, upon phagocytosis, the macrophage produces reactive oxygen species (ROS) which can cause double stranded DNA breaks and can kill engulfed pathogens. If a bacterial cell survives the macrophage ROS defense, it may need to repair in double stranded breaks. MTBC species can repair double stranded breaks through homologous recombination or non- homologous ends joining, however these mechanisms, particularly non-homologous ends joining, are error prone and can create mutations. Additionally, if ROS driven DNA damage occurs in DNA repair genes, a strain can become hypermutable [38]. Therefore, ROS driven DNA damage could be an additional source of potential AMR causing mutations. Variability in ROS production by the host or ROS tolerance by bacteria could 22 also influence intracellular survival and the evolution of AMR, and mutations have been identified in M. tuberculosis strains related to ROS tolerance [39]. Similarly, variability in T cell response may also influence bacterial mutations and within host diversity. In a mouse model of tuberculosis, comparing strain diversity in T cell knock out mice to wild type mice, Copin et al. found that mice without T cells had clonal infections, whereas mice with T cells contained strains with multiple different mutations; therefore, T cells may play a role in generating within host diversity [40]. These random or host- driven diversity generating mechanisms coupled with population dynamics influenced by small population sizes during transmission events may lead to the evolution of AMR genotypes without targeted positive selection by antimicrobial treatment. A low level of AMR may evolve in M. bovis in animal hosts in the absence of antimicrobial use, and this basal AMR prevalence may reflect the underlying distribution of AMR genotypes in M. tuberculosis that selection acts upon under exposure to antimicrobials. Dissertation Objectives In this dissertation we aim to both understand the evolutionary mechanisms that drive genetic variation in mycobacterial pathogens and to develop tools to aid in disease control. In chapter 2 we discuss a time-series analysis of MAP fecal shedding data to predict future progression patterns by classing individual animals into MAP shedding groups based on current and past colony forming unit counts. In chapter 3 we dive into the methodological challenges associated with using short read data in constructing accurate pangenome estimates, and find that M. bovis has a compact genome, consistent with clonal evolution. In chapter 4 we investigate the distinct roles of selection, 23 demography, and progeny skew in M. bovis evolution over the course of an outbreak in a cattle herd and find that the combined diversity reducing effects of skewed offspring distributions and population bottlenecks and the diversity generating effect of a relatively rapid mutation rate describe the within-host evolution of M. bovis in animal hosts. Lastly, in chapter 5 we explore the effects of a rapid mutation rate and population bottlenecks on the evolution of antimicrobial resistance in the absence of antimicrobial treatment. 24 References 1. Bachmann NL, Salamzade R, Manson AL, Whittington R, Sintchenko V et al. Key Transitions in the Evolution of Rapid and Slow Growing. Front Microbiol 2019;10:3019. 2. Veyrier FJ, Dufort A, Behr MA. The rise and fall of the Mycobacterium tuberculosis genome. Trends Microbiol 2011;19(4):156-161. 3. Reva O, Korotetskiy I, Ilin A. Role of the horizontal gene exchange in evolution of pathogenic Mycobacteria. BMC Evol Biol 2015;15 Suppl 1:S2. 4. Rathnaiah G, Zinniel DK, Bannantine JP, Stabel JR, Gröhn YT et al. Pathogenesis, Molecular Genetics, and Genomics of. Front Vet Sci 2017;4:187. 5. USDA-APHIS. NAHMS. Johne’s Disease on U.S. Dairies, 1991-2007. 2008. 6. Ott SL, Wells SJ, Wagner BA. Herd-level economic losses associated with Johne's disease on US dairy operations. Prev Vet Med 1999;40(3-4):179-192. 7. Whittington RJ, Sergeant ES. Progress towards understanding the spread, detection and control of Mycobacterium avium subsp paratuberculosis in animal populations. Aust Vet J 2001;79(4):267-278. 8. Cocito C, Gilot P, Coene M, de Kesel M, Poupart P et al. Paratuberculosis. Clin Microbiol Rev 1994;7(3):328-345. 9. Whitlock RH, Buergelt C. Preclinical and clinical manifestations of paratuberculosis (including pathology). Vet Clin North Am Food Anim Pract 1996;12(2):345-356. 10. Slater N, Mitchell RM, Whitlock RH, Fyock T, Pradhan AK et al. Impact of the shedding level on transmission of persistent infections in Mycobacterium avium subspecies paratuberculosis (MAP). Vet Res 2016;47:38. 11. Mitchell RM, Whitlock RH, Grohn YT, Schukken YH. Back to the real world: connecting models with data. Prev Vet Med 2015;118(2-3):215-225. 12. Smith RL, Schukken YH, Grohn YT. A new compartmental model of Mycobacterium avium subsp. paratuberculosis infection dynamics in cattle. Prev Vet Med 2015;122(3):298-305. 13. Al-Mamun MA, Smith RL, Schukken YH, Grohn YT. Use of an Individual- based Model to Control Transmission Pathways of Mycobacterium avium Subsp. paratuberculosis Infection in Cattle Herds. Sci Rep 2017;7(1):11845. 14. Humphrey HM, Orloski KA, Olea-Popelka FJ. Bovine tuberculosis slaughter surveillance in the United States 2001-2010: assessment of its traceback investigation function. BMC Vet Res 2014;10:182. 15. Fitzgerald SD, Kaneene JB. Wildlife reservoirs of bovine tuberculosis worldwide: hosts, pathology, surveillance, and control. Vet Pathol 2013;50(3):488-499. 16. Pérez-Lago L, Comas I, Navarro Y, González-Candelas F, Herranz M et al. Whole genome sequencing analysis of intrapatient microevolution in Mycobacterium tuberculosis: potential impact on the inference of tuberculosis transmission. J Infect Dis 2014;209(1):98-108. 17. Scott C, Cavanaugh JS, Pratt R, Silk BJ, LoBue P et al. Human Tuberculosis Caused by Mycobacterium bovis in the United States, 2006-2013. Clin Infect Dis 2016;63(5):594-601. 25 18. Domingo M, Vidal E, Marco A. Pathology of bovine tuberculosis. Res Vet Sci 2014;97 Suppl:S20-29. 19. Salvador LCM, O'Brien DJ, Cosgrove MK, Stuber TP, Schooley AM et al. Disease management at the wildlife-livestock interface: Using whole-genome sequencing to study the role of elk in Mycobacterium bovis transmission in Michigan, USA. Mol Ecol 2019;28(9):2192-2205. 20. Álvarez JBD, Clifford JR. United States – Mexico Joint Strategic Plan for Collaboration on Bovine Tuberculosis In: APHIS, editor.: USDA; 2013. 21. Orloski K, Robbe-Austerman S, Stuber T, Hench B, Schoenbaum M. Whole Genome Sequencing of Mycobacterium bovis Isolated From Livestock in the United States, 1989-2018. Front Vet Sci 2018;5:253. 22. Lombard JE, Patton EA, Gibbons-Burgener SN, Klos RF, Tans-Kersten JL et al. Human-to-Cattle. Front Vet Sci 2021;8:691192. 23. Pandey P, Bhatnagar AK, Mohan A, Sachdeva KS, Samantaray JC et al. Mycobacterium tuberculosis polyclonal infections through treatment and recurrence. PLoS One 2020;15(8):e0237345. 24. Theys K, Libin P, Pineda-Peña AC, Nowé A, Vandamme AM et al. The impact of HIV-1 within-host evolution on transmission dynamics. Curr Opin Virol 2018;28:92-101. 25. Salje H, Wesolowski A, Brown TS, Kiang MV, Berry IM et al. Reconstructing unseen transmission events to infer dengue dynamics from viral sequences. Nat Commun 2021;12(1):1810. 26. Wright S. Evolution in Mendelian Populations. Genetics 1931;16(2):97-159. 27. Fisher RA. Darwinian evolution of mutations. Eugen Rev 1922;14(1):31-34. 28. Kimura M. The neutral theory of molecular evolution. Sci Am 1979;241(5):98- 100, 102, 108 passim. 29. Kingman JFC. The coalescent. Stochastic Processes and their Applications 1982;13(3):235-248. 30. Eldon B, Wakeley J. Coalescent processes when the distribution of offspring number among individuals is highly skewed. Genetics 2006;172(4):2621-2633. 31. Li LM, Grassly NC, Fraser C. Genomic analysis of emerging pathogens: methods, application and future trends. Genome Biol 2014;15(11):541. 32. Irwin KK, Laurent S, Matuszewski S, Vuilleumier S, Ormond L et al. On the importance of skewed offspring distributions and background selection in virus population genetics. Heredity (Edinb) 2016;117(6):393-399. 33. Menardo F, Duchene S, Brites D, Gagneux S. The molecular clock of Mycobacterium tuberculosis. PLoS Pathog 2019;15(9):e1008067. 34. Morales-Arce AY, Harris RB, Stone AC, Jensen JD. Evaluating the contributions of purifying selection and progeny-skew in dictating within-host Mycobacterium tuberculosis evolution. Evolution 2020;74(5):992-1001. 35. Richards VP, Nigsch A, Pavinski Bitar P, Sun Q, Stuber T et al. Evolutionary genomic and bacteria GWAS analysis of Mycobacterium avium subsp. paratuberculosis and dairy cattle Johne's disease phenotypes. Appl Environ Microbiol 2021. 36. Reis AC, Cunha MV. The open pan-genome architecture and virulence landscape of Mycobacterium bovis. Microb Genom 2021;7(10). 26 37. Kavvas ES, Catoiu E, Mih N, Yurkovich JT, Seif Y et al. Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance. Nat Commun 2018;9(1):4306. 38. Gorna AE, Bowater RP, Dziadek J. DNA repair systems and the pathogenesis of Mycobacterium tuberculosis: varying activities at different stages of infection. Clin Sci (Lond) 2010;119(5):187-202. 39. Mestre O, Hurtado-Ortiz R, Dos Vultos T, Namouchi A, Cimino M et al. High throughput phenotypic selection of Mycobacterium tuberculosis mutants with impaired resistance to reactive oxygen species identifies genes important for intracellular growth. PLoS One 2013;8(1):e53486. 40. Copin R, Wang X, Louie E, Escuyer V, Coscolla M et al. Within Host Evolution Selects for a Dominant Genotype of Mycobacterium tuberculosis while T Cells Increase Pathogen Genetic Diversity. PLoS Pathog 2016;12(12):e1006111. 27 CHAPTER 2 Characterizing infectious disease progression through discrete states using hidden Markov models Kristina M. Ceres1, Ynte H. Schukken2, Yrjö T. Gröhn1 1 Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY 2 Department of Animal Sciences, Wageningen University, Wageningen, the Netherlands Chapter 2 is published in PLOS ONE 28 Abstract Infectious disease management relies on accurate characterization of disease progression so that transmission can be prevented. Slowly progressing infectious diseases can be difficult to characterize because of a latency period between the time an individual is infected and when they show clinical signs of disease. The introduction of Mycobacterium avium ssp. paratuberculosis (MAP), the cause of Johne’s disease, onto a dairy farm could be undetected by farmers for years before any animal shows clinical signs of disease. In this time period infected animals may shed thousands of colony forming units. Parameterizing trajectories through disease states from infection to clinical disease can help farmers to develop control programs based on targeting individual disease state, potentially reducing both transmission and production losses due to disease. We suspect that there are two distinct progression pathways; one where animals progress to a high-shedding disease state, and another where animals maintain a low-level of shedding without clinical disease. We fit continuous-time hidden Markov models to multi-year longitudinal fecal sampling data from three US dairy farms, and estimated model parameters using a modified Baum-Welch expectation maximization algorithm. Using posterior decoding, we observed two distinct shedding patterns: cows that had observations associated with a high-shedding disease state, and cows that did not. This model framework can be employed prospectively to determine which cows are likely to progress to clinical disease and may be applied to characterize disease progression of other slowly progressing infectious diseases. 29 Introduction Slowly progressing infectious diseases, like tuberculosis in humans and animals, HIV/AIDS in humans, and Johne’s disease in cattle, are difficult to characterize because they are often associated with a latency period between the time an individual is infected and when they show clinical signs or symptoms of disease. Treatment efficacy and transmission dynamics are dependent on the state of infection, and understanding an individual’s infection state, and probability of transitioning among infectious states at the population level, can help provide better treatment protocols and targeted epidemiologic interventions. For example, only patients with active tuberculosis infections are likely to transmit Mycobacterium tuberculosis to others; however, targeting latently infected individuals that have a high probability to progress to active disease with preventative therapeutics could prevent transmission from occurring [1]. Johne's disease is caused by Mycobacterium avium subsp. paratuberculosis (MAP) and is characterized by a chronic granulomatous enteritis leading to diarrhea, wasting and eventually death. Johne’s disease is considered an important production disease, especially in the dairy industry, and is estimated to cost 200 million dollars per year in production losses and early culling in the United States [2]. MAP has also been implicated as a possible cause or contributing factor to Crohn’s disease in humans; therefore, zoonotic MAP transmission may be relevant to public health [3]. Johne's Disease is pervasive in dairy farms in the United States. In 2007, 68 percent of dairy operations tested positive for MAP in at least one environmental sample [4]. MAP is mainly transmitted among cattle through the fecal-oral route and animals can become infected by eating contaminated material from their environment [5]. Calves are more 30 susceptible than other age groups, and often become infected either in-utero or soon after birth; however, they may not show clinical signs of disease for 2-5 years [5-7]. MAP can also survive for up to a year in certain environmental conditions, which could allow new infections even after an infected animal is removed from the herd [5, 8]. Due to, among other reasons, persistence in the environment and a long latency period before infected animals show clinical signs, Johne's disease is very difficult to control. In order to effectively control Johne's disease, it is essential to understand how the infection progresses. It is currently hypothesized that animals in certain disease states transiently shed MAP in their feces, and it is known that some animals shed much more MAP than others [9]. Schukken et al. observed two distinct patterns of disease progression among infected animals [10]; animals that progressed to a high shedding disease state, and those that controlled infection. Furthermore, Mitchell et al. observed that most animals that shed MAP intermittently never progress to a high shedding state, but those that do progress maintain a consistent high level of shedding [11]. Johne's disease infection dynamics have been studied extensively using compartmental and agent-based models [12-14]; however, these models contain a large number of parameters, and it is sometimes necessary to rely on assumptions instead of data in order to fully parameterize the model. In order to study Johne's and other disease dynamics it is usually necessary for the investigator to divide disease progression into a series of discrete states and determine reasonable rates of movement between the states to parameterize the model. Some model parameters can be estimated using field data, but parameters regarding transition among disease states are less obviously estimable, especially in the case of Johne's disease, where animals can maintain an infectious state 31 for years without any visible signs of disease. With slowly progressing diseases like Johne's, it may be difficult or in many animals impossible to observe direct signs of subclinical disease states, so determining parameters for rates of transition among disease states is non-trivial. Hidden Markov models (HMMs) offer one solution to the problem of estimating these parameters by using unsupervised machine learning to estimate model parameters including disease state transition rates. A validated HMM can also be used as a diagnostic to predict disease progression outcomes in infected cows. We constructed sequences of MAP fecal culture results from repeated samples of individual animals sampled biannually up to 6 years from three Northeast US dairy herds, and used these sequences to infer the underlying hidden Johne's disease states in a HMM framework. Our objectives were to estimate disease state and transition parameters among infected cows, and to determine if two or more types of disease progression existed. We hypothesized that the most probable path through hidden states identified by posterior decoding would show two distinct most probable state progression paths that represent separate MAP infection progression types. Methods Data description Fecal samples were taken from all adult animals on three northeastern farms between 2004 and 2009 biannually, totaling 6530 samples from 1714 cows as part of the Regional Dairy Quality Alliance Management (RDQMA) program [15]. The overall period prevalence was 5.9%, and fecal shedding was variable among infected cows 32 ranging from 0 to 6360 colony forming units (CFU). Cows were sampled between 2 and 11 times with only one cow sampled 11 times over 5.5 years. Fecal samples from each cow were placed in four 15 ml solid medium slant tubes, and were processed and cultured as described previously [15, 16]. The sum of CFU cultured from the four tubes was multiplied by 5.3 to account for dilutions, and the log base 10 of the resulting quantity was used as the observed CFU count for model fitting. Positive samples were defined as greater than 0 CFU in a fecal sample. Samples were taken on average 15 weeks apart with a median of 14 weeks apart; however, sampling intervals ranged from 1 to 40 weeks. To avoid misclassification of healthy cows as infected, we only included cows that had at least two positive fecal culture samples out of three consecutive samples. Samples that were not readable due to contamination with other bacterial growth or fungal growth were removed from the analysis. After removing samples that did not meet inclusion criteria, we were left with 129 observations. Fecal shedding patterns of seven representative cows are shown in Figure 1. 33 Figure 1. MAP fecal shedding patterns from five seven cows. Shedding patterns were variable among individuals. The number of samples taken from each cow was also variable, ranging from 2 to 11 samples. The variation in both CFU count and number of samples may be related to disease progression, as animals with high CFU counts may be culled early due to clinical MAP infection related production losses. 34 Model description A HMM is a probabilistic model which describes the transition through unobserved states in a Markovian system. Although the hidden state of the system cannot be observed directly, information about the state can be inferred indirectly through model emissions. This framework is useful for modeling MAP infection progression because the underlying disease state is not readily observable, but MAP shedding in feces can be easily measured and can be used in the HMM framework as model emissions. We developed four candidate continuous time HMMs to model MAP infection progression in dairy cows. Since samples were taken at irregular intervals, we chose to model disease progression using a continuous time HMM framework as opposed to a discrete time framework. In a continuous time HMM, both the hidden state at the time a sample was taken and the number and types of transitions between hidden states between two consecutive observations are unknown [17]. Continuous time HMMs are defined by a vector of initial hidden state probabilities, 𝝅𝟎 that define the probability of being in each state at time 0, a state transition rate matrix Q that defines the rate of transition among hidden states, and emission probability distributions, 𝑬. The emission probability distributions, which are defined as the probability of observing a log10 CFU given an underlying hidden disease state k, were modeled using gamma distributions, 𝐸"~𝐺𝑎𝑚𝑚𝑎(𝛼" , 𝜃"), where 𝛼" and 𝜃"are gamma distribution shape and scale parameters, respectively. The probability of transitioning from state j to state k depends on the interval of time t and is defined as 𝑃#"(𝑡) = 𝑒$%. The number of true disease states is unknown, but models with very large state spaces may not provide a useful representation of MAP infection from veterinary 35 standpoint since as the number of states increases the differences in disease phenotype between states may diminish, and states without a remarkable associated change in disease phenotype do not have obvious clinical relevance. Additionally, the potential model state space that can be explored is limited by our relatively low sample size of cows with MAP infection. Therefore, we chose to create a candidate model set including models with two to five hidden states to capture clinically relevant HMM structures. A graphical description of an example three hidden state model is shown in Figure 2. In Figure 2, the state labels “0”, “1” and “2” represent disease states associated with low, moderate, and high CFU counts, respectively. 36 Figure 2. Diagram of hidden Markov model 1. This diagram represents a hidden Markov model with three hidden states labeled 0, 1, and 2. The model is shown with initial state probabilities, π&, time dependent transition probabilities, P and emissions, E, which have gamma distributions. 37 Parameter learning Maximum likelihood estimates for model parameters, 𝜋&, 𝑄, 𝛼, 𝜃, were computed using a modified Baum-Welch expectation maximization (EM) algorithm developed for continuous time HMMs by Liu et al. [17]. We used the “Expm” method, which relies on the matrix exponential of an auxiliary matrix A to calculate EM estimates of model parameters. The auxiliary matrix, 𝐴 = 7𝑄 𝐵0 𝑄:, was calculated for each state i, and edge i,j. B is a square matrix with the same dimensions of Q and has 1 in the (i,j) position and 0 elsewhere. This method was determined to be the most robust EM algorithm of the three presented by Liu et al. At each iteration of Expm EM, gamma shape and scale parameters were updated using the Newton-Raphson method, which is further described in the supplemental S1 file. The algorithm guarantees an increase in model likelihood with each iteration, and was terminated when the change in likelihood was less than 0.001. All algorithms were implemented in Python 3.7.0 using open source libraries: Scipy [18], NumPy[19], and pandas[20]. Initial parameterization and model selection Each model was initialized with 250 sets of parameter values drawn from uniform distributions shown in Table 1. Initial state probabilities 𝝅𝟎 were normalized such that each set of initial state probabilities summed to 1. The fit of each model state structure was assessed using Akaike’s information criterion (AIC) calculated as 2𝑝 − 2𝑙𝑜𝑔(𝐿), where L is the model likelihood and p is the number of parameters estimated in a specific model. 38 Table 1 Descriptions and initial values of parameters in three hidden state models. Parameter Description Initial Value 𝜋&! Probability of starting in each hidden state k Unif(0,1) 𝛼" Gamma distribution shape parameter for state k Unif(1,20) 𝜃" Gamma distribution scale parameter for state k Unif(0.01, 1) 𝑞#" Transition rate from state j to state k Unif(0.01,5) Four hidden Markov model structures with two, three, four or five hidden states were initialized with 250 sets of random parameter values drawn from uniform distributions. We chose parameter values for transition rates and emission distributions that would produce reasonable state transition probabilities and gamma distributions to fit the observed CFU values. We allowed the initial state probabilities to range from 0 to 1 reflecting our limited knowledge on the proportion of animals in each underlying disease state. Bootstrap confidence intervals The CFU shedding dataset of 129 cows was sampled with replacement to generate 1000 bootstrap samples of the same size as the original dataset using the resample function from sklearn[21]. Model transition and emission parameters were estimated for each sample using the initialization parameter combination that produced the lowest AIC score. 95% confidence intervals were calculated for each transition and emission probability using the 2.5 and 97.5 percentiles of the parameter values generated using the bootstrap samples. 39 Results and discussion Model likelihood and AIC The AIC values for each model type (two, three, four, five states) is shown in Table 2. Table 2. Model comparison using AIC. Model pi log(Li) AICi Rel. Li w(AIC)i 3 State 14 -95.83 219.66 1 0.993 2 State 7 -107.86 229.72 6.5 E -3 0.006 4 State 23 -106.24 258.48 3.7 E -9 3.7 E -9 5 State 34 -104.38 276.76 4.0 E -13 4.0 E -13 Rel. Li represents the relative log likelihood for model i and is calculated as 𝑒𝑥𝑝 E'()"#$*'()#F where is AICmin the minimum AIC value in the candidate set. w(AIC)+ i is the AIC weight for model i and is calculated as ,-..0#∑ . pi is the number of parameters !,-. 0! in model i, and log(Li) is the log likelihood for model i. The three state model had the lowest AIC and had the largest AIC weight so it was determined to be the best fitting model out of the four types tested to minimize information loss. This suggests that the three state model fits the data best with our limited sample size, but it is possible that with a larger dataset a different state structure would be preferred. For example, if there were more consecutive samples taken for each cow over a longer period of time, there may be stronger support for a model with a larger number of states as this dataset may contain finer scale changes in shedding patterns that were not evident in our comparatively sparse dataset. On the other hand, factors such as individual heterogeneity and outliers may lead AIC to favor models with 40 larger state spaces when smaller state spaces are more biologically realistic[22]. Although the three state model had the lowest AIC, since AIC may favor larger state spaces, and since both the two and three state models had AIC weights greater than zero, for the remainder of this paper will focus on results for both the two and three state models. The initial parameter combination that produced the lowest AIC value for the two, three, four and five state models are shown in supplementary S1 Table. Parameter estimates for the four and five state models are shown in supplementary S2 Table. Parameter estimates EM estimates for two and three hidden state model parameters are shown in Table 3. Table 3. Transition and emission probability estimates. Parameter Description Estimate 95% CI Two state models 𝜋& Probability of starting in state 0 0.45 (0.30, 0.64) 𝛼& Gamma distribution shape parameter 17.56 (4.81, 26.33) for state 0 𝛼3 Gamma distribution shape parameter 17.86 (13.86, 184.37) for state 1 𝜃& Gamma distribution scale parameter 0.14 (0.09, 0.82) for state 0 𝜃3 Gamma distribution scale parameter 0.39 (0.04, 0.48) for state 1 𝛼&𝜃& Gamma distribution mean for state 0 2.44 (2.30, 3.95) 𝛼3𝜃3 Gamma distribution mean for state 1 6.87 (6.49, 8.03) 𝑞&3 Transition rate from state 0 to 1 0.03 (0.01, 2.90) 𝑞3& Transition rate from state 1 to 0 2.00e-3 (1.22e-07, 1.02) Three state models 𝜋& Probability of starting in state 0 0.43 (0.29, 0.62) 𝜋3 Probability of starting in state 1 0.21 (4.11e-36, 0.39) 𝛼& Gamma distribution shape parameter 19.41 (4.79, 30.07) for state 0 41 𝛼3 Gamma distribution shape parameter 16.24 (12.29, 15307.72) for state 1 𝛼+ Gamma distribution shape parameter 173.20 (36.00, 239.35) for state 2 𝜃& Gamma distribution scale parameter 0.12 (0.08, 0.78) for state 0 𝜃3 Gamma distribution scale parameter 0.35 (3.27e-4,0.39) for state 1 𝜃+ Gamma distribution scale parameter 0.05 (0.03, 0.20) for state 2 𝛼&𝜃& Gamma distribution mean for state 0 2.40 (2.17, 3.75) 𝛼3𝜃3 Gamma distribution mean for state 1 5.64 (3.83, 6.18) 𝛼+𝜃+ Gamma distribution mean for state 2 7.80 (7.14, 8.12) 𝑞&3 Transition rate from state 0 to 1 0.03 (0.01, 0.98) 𝑞&+ Transition rate from state 0 to 2 1.13e-40 (1.89e-46, 0.48) 𝑞3& Transition rate from state 1 to 0 4.65e-39 (1.86e-131, 1.46) 𝑞3+ Transition rate from state 1 to 2 0.03 (0.02, 6.15) 𝑞+& Transition rate from state 2 to 0 3.00e-3 (5.12e-143, 0.60) 𝑞+3 Transition rate from state 2 to 1 1.50e-06 (1.50e-8, 1.49) Parameter estimates for the two and three hidden disease models were generated using the modified Baum-Welch Expectation Maximization algorithm and bootstrap confidence intervals, using initial values in supplementary Table S1. Transient distributions of the Markov chain are shown in Figure 3A. A histogram of the transformed CFU data is shown in Figure 3B and fitted emission distributions for the two and three state models are shown in Figure 3C. The CFU count data shows a large number of low CFU observations, fewer mid-range CFU counts between 3 and 5 CFU and a larger number of observations between 5 and 9 CFU. The CFU count data are well approximated by the fitted gamma emissions, which show higher densities for low and high CFU counts than for mid-range CFU counts. 42 Figure 3. Transient distributions and emission distributions in two and three hidden state models. The transient distribution for each Markov chain was calculated for time t=0,..,300 weeks (A). Fitted emission distributions are shown in (C), and the observed data are shown in grey (B). The number of random samples used to create each plotted state emission distribution was proportional to the stationary distribution value for that hidden state. The stationary distributions 𝜋4 are calculated by solving 𝜋4𝑄 = 0, and can be approximated by the equilibrium distribution shown in Figure 3A, where for large t, 𝜋*$%& ≈ 𝜋4. The stationary distribution for the two state model was 𝜋4 = [0.07, 0.93], and for the three state model was 𝜋4 = [0.09, 0.09, 0.81]. For both the stationary 43 distributions in the two and three state models, there was a very high probability of being in the highest disease state (state 1 in the two state model and state 2 in the three state model). Thus, the majority of cows are expected to reach a disease state associated with high CFU shedding; however, a minority of cows are expected to have observations associated with a low-shedding disease state. The sojourn time before transitioning between hidden states is exponentially distributed with parameter -𝑞55, where 𝑞55 is a diagonal element of the Q matrix. In both the two and three state models, the probability of remaining within a state is much higher than transitioning between states. The average sojourn time for state 0 in the two state model was 37.7 weeks, and the average sojourn time in the three state model was 38.3 weeks for state 0 and 37.8 weeks for state 2; therefore, on average cows remained in a low CFU-emitting disease hidden state for about 38 weeks before transitioning to a higher CFU-emitting disease state. Unlike the two state model, the three state model contains an intermediate state (state 1) that is associated with a moderate level of MAP shedding. In the three state model, the transition rate from state 0 to 1 is much higher than the transition rate from state 2 to 1, and the transition rate from state 1 to 2 is much higher than the transition rate from state 1 to 0. Similarly, in both the two and three state models, the transition rate between lowest shedding disease states to the highest shedding disease state was much higher that the transition rate from a high shedding disease state to lower shedding disease states. Together, this indicates that transition from any higher shedding state to a lower shedding state is less likely than either remaining in the current state or transitioning to a higher shedding disease state, which is consistent with the known progression of MAP infection from subclinical infection 44 associated with low MAP shedding to clinical Johne’s disease associated with high MAP shedding. In addition to transitioning from low shedding to high shedding states, transition rate estimates in the three state model suggest that it is possible to transition from state 2 to state 0, and that this transition is more likely than transitioning from state 2 to 1. This type of transition may exist in an intermittent high shedding progression pattern, where cows have observations associated with a high shedding disease state flanked by observations associated with a low shedding disease state; however, this pattern is only clearly demonstrated by one cow, C_106, shown in Figure 4B, where between 50 and 65 weeks there is a transition from state 2 to state 0. The maximum posterior probabilities of a cow being in a particular disease state at each sampling time were constructed to visualize progression patterns of individual cows, and these individual progression patterns are shown in Figure 4A. The observed and predicted log(CFU) MAP values for representative cows is shown in Figure 4B. The predicted log(CFU) MAP values were generated by taking 1000 random samples from the fitted gamma emission distributions from the state with the highest posterior probability for each observation. The majority of the predicted log(MAP) 10-90 percentiles overlapped the observed log(MAP) values; however, the three state model 10-90 percentile range predictions were narrower and overlapped observed values more often than the two state model predictions indicating that the three state model fit the data better, especially for intermediate log(CFU) values. 45 Figure 4. Progression patterns through disease states. Progression patterns of individual cows through disease states estimated using posterior decoding are shown for both the two state and the three state models (A). The color of the line indicates the state with the highest posterior probability at the last observation. 46 The lines were jittered vertically so that overlapping lines could be distinguished. The empirical and predicted log(CFU) MAP value from seven representative cows shows variability in fecal shedding over the sampling period, with some cows shedding very high CFU counts over the sampling period and others maintaining a lower CFU count (B) The median, and 10-90 percentile range for the predicted log(CFU) distributions are shown as points and bars, respectively. The predicted CFU values are colored according to the fitted state emission distribution for each sample. The empirical log(MAP) CFU counts are shown in black. In both the two and three state model structures, the majority of cows had increasing fecal CFU counts over the sampling period; however, a small group of cows maintained a low CFU count. This group, labeled as having an observation at the last sampling interval in the lowest disease state in Figure 4A, represents the proportion of cows with consecutive low CFU counts that did not progress to a high-shedding disease state. The stationary distributions of the Markov chains and the differential posterior decoded patterns demonstrate that there are at least two types of MAP infection progression patterns, where most cows progress to a disease state associated with higher MAP shedding in both the two and three state models. These “progressors” shed higher MAP loads over the course of their infection than “non-progressors” that remain in lower CFU-emitting disease states. This suggests that there may be a minority of infected cattle that do not shed large amounts of MAP, and thus may play a smaller role in the transmission chain within a farm. If these animals are high producers, it may be reasonable to keep them on the farm despite their positive MAP infection status. 47 Potential factors related to disease progression path Our study provides additional evidence to support that at least two progression patterns exist among cows infected with MAP. Previous studies have suggested that age at time of infection plays a role in disease progression type [11]; however, we were not able to determine the age at the time of infection since all animals were infected naturally, and we only had samples from adult animals. There is also evidence that host immune factors are related to Johne’s disease progression in cattle and other species. De Silva et al. found that high fecal MAP DNA quantity and lower interferon gamma response early in life were positively associated with progressing to clinical disease in sheep, suggesting that early immune response may play a role in disease progression path [23]. Koets et al. found that cows with mutations in the Toll-like receptor 2 gene may show an increased macrophage activity, increased T cell activation and reduced susceptibility to MAP infection [24]. Future research incorporating disease path progression using HMM and immune monitoring could lead to breeding cattle that are resistant to progression to the state of high shedding. Future directions and Conclusions Mathematical modeling has been an important tool to study Johne’s disease control due to the complex nature of the slowly progressing disease dynamics and farm dynamics. Some previous modeling studies have included non-progressing and progressing patterns through discrete disease states [13, 25]; however no study has estimated probabilities of transition among those disease states, or the probability of 48 shedding CFU count given membership in a specific disease state. The addition of state transition probabilities from our model can improve the accuracy of predicted progression patterns in mathematical models, and the addition of emission probability estimates to simulation models can more accurately model disease transmission dynamics. In addition to the utility of our model estimates in simulation studies, our HMM can also be used for disease progression path prediction on infected farms. After validation in an external population, the model can be incorporated into Johne’s disease control programs to predict which progression paths are more likely given a series of fecal culture samples from individual cows. With this knowledge, and with knowledge from future simulation studies that take advantage of a multi-progression path disease structure, farmers can make more informed decisions on which animals to cull depending on their predicted disease paths. The proposed model includes simplifying assumptions that do not fully characterize the complex nature of disease progression. One such simplification is creating discrete disease states and fecal shedding states, where in reality disease progression is a continuous process through many possible disease states. Our ability to resolve fine-scale differences between disease states is limited by the intervals between samples, and by the number of samples taken from each cow. It is possible that including covariates such as milk production data or sampling cows more frequently and over a longer period of time could provide a more accurate description of disease progression. Our analysis may also underestimate the true prevalence of MAP infected cows at each sampling interval because some positive animals may not have been identified using fecal culture. Other diagnostic tests exist relying on serological data, but fecal culture 49 may be the most reliable marker for disease progression [26]. Another potential limitation is that expectation maximization may identify local maxima, and may not find global maxima; however, initializing the model with a range of parameter values allowed us to explore a variety of potential maxima. Additionally, our analysis may be influenced by survival bias because cows with more advanced MAP infection may produce less milk than cows with subclinical infection, which results in higher culling rates in lower producing cows. This could result in an underrepresentation of severe cases with higher MAP CFU counts than would be expected if no culling decisions were related to MAP infection stage. Although we expect this survival bias to reduce the expected sojourn time of the hidden states associated with higher CFU counts, we believe our data are representative of MAP infection patterns on typical dairy farms that make culling decisions based on milk production statistics. Lastly, the parameter estimates generated by the model reflect progression patterns among cows included in our study, and the model needs to be validated in an external population to determine if the progression patterns are true of all infected animals. Johne’s disease is difficult to control due to a number of factors including its slow progression and is extremely difficult to eliminate from a farm due to environmental persistence. We determined that there are at least two distinct progressing patterns, non-progressing and progressing, among infected animals in the three herds studied. Non-progressors are likely to have low fecal MAP CFU counts associated with a low-shedding disease state, whereas progressors are more likely to transition from low to high disease state, which is associated with a high (> 600) CFU count. Parameters from this model can be used to inform disease simulation models to test control 50 strategies that include methods targeted at managing progressors. After model validation, this model can also be used on infected farms to predict future disease states after a relatively small number of observed fecal culture counts. This modeling framework can be used to classify progression in other slowly progressing infectious diseases. Acknowledgements We would like to acknowledge the RDQMA research team for their support in collecting the samples and providing farm data. We would also like to acknowledge Dr. Annete Nigsch for her assistance in the MAP database curation. 51 References 1. Ehlers S, Schaible UE. The granuloma in tuberculosis: dynamics of a host- pathogen collusion. Front Immunol. 2012;3:411. Epub 2013/01/12. doi: 10.3389/fimmu.2012.00411. 2. Ott SL, Wells SJ, Wagner BA. Herd-level economic losses associated with Johne's disease on US dairy operations. Prev Vet Med. 1999;40(3-4):179-92. Epub 1999/07/29. doi: 10.1016/s0167-5877(99)00037-9. 3. McNees AL, Markesich D, Zayyani NR, Graham DY. Mycobacterium paratuberculosis as a cause of Crohn's disease. Expert Rev Gastroenterol Hepatol. 2015;9(12):1523-34. Epub 2015/10/17. doi: 10.1586/17474124.2015.1093931. 4. USDA-APHIS. NAHMS. Johne’s Disease on U.S. Dairies, 1991-2007. 2008. 5. Whittington RJ, Sergeant ES. Progress towards understanding the spread, detection and control of Mycobacterium avium subsp paratuberculosis in animal populations. Aust Vet J. 2001;79(4):267-78. Epub 2001/05/15. doi: 10.1111/j.1751- 0813.2001.tb11980.x. 6. Cocito C, Gilot P, Coene M, de Kesel M, Poupart P, Vannuffel P. Paratuberculosis. Clin Microbiol Rev. 1994;7(3):328-45. Epub 1994/07/01. doi: 10.1128/cmr.7.3.328. 7. Whitlock RH, Buergelt C. Preclinical and clinical manifestations of paratuberculosis (including pathology). Vet Clin North Am Food Anim Pract. 1996;12(2):345-56. Epub 1996/07/01. doi: 10.1016/s0749-0720(15)30410-2. 8. Grewal SK, Rajeev S, Sreevatsan S, Michel FC, Jr. Persistence of Mycobacterium avium subsp. paratuberculosis and other zoonotic pathogens during simulated composting, manure packing, and liquid storage of dairy manure. Appl Environ Microbiol. 2006;72(1):565-74. Epub 2006/01/05. doi: 10.1128/AEM.72.1.565- 574.2006. 9. Slater N, Mitchell RM, Whitlock RH, Fyock T, Pradhan AK, Knupfer E, et al. Impact of the shedding level on transmission of persistent infections in Mycobacterium avium subspecies paratuberculosis (MAP). Vet Res. 2016;47:38. Epub 2016/03/02. doi: 10.1186/s13567-016-0323-3. 10. Schukken YH, Whitlock RH, Wolfgang D, Grohn Y, Beaver A, VanKessel J, et al. Longitudinal data collection of Mycobacterium avium subspecies Paratuberculosis infections in dairy herds: the value of precise field data. Vet Res. 2015;46:65. Epub 2015/06/21. doi: 10.1186/s13567-015-0187-y. 11. Mitchell RM, Schukken Y, Koets A, Weber M, Bakker D, Stabel J, et al. Differences in intermittent and continuous fecal shedding patterns between natural and experimental Mycobacterium avium subspecies paratuberculosis infections in cattle. Vet Res. 2015;46:66. Epub 2015/06/21. doi: 10.1186/s13567-015-0188-x. 12. Mitchell RM, Whitlock RH, Grohn YT, Schukken YH. Back to the real world: connecting models with data. Prev Vet Med. 2015;118(2-3):215-25. Epub 2015/01/15. doi: 10.1016/j.prevetmed.2014.12.009. 13. Smith RL, Schukken YH, Grohn YT. A new compartmental model of Mycobacterium avium subsp. paratuberculosis infection dynamics in cattle. Prev Vet Med. 2015;122(3):298-305. Epub 2015/11/02. doi: 10.1016/j.prevetmed.2015.10.008. 52 14. Al-Mamun MA, Smith RL, Schukken YH, Grohn YT. Use of an Individual- based Model to Control Transmission Pathways of Mycobacterium avium Subsp. paratuberculosis Infection in Cattle Herds. Sci Rep. 2017;7(1):11845. Epub 2017/09/21. doi: 10.1038/s41598-017-12078-z. 15. Pradhan AK, Van Kessel JS, Karns JS, Wolfgang DR, Hovingh E, Nelen KA, et al. Dynamics of endemic infectious diseases of animal and human importance on three dairy herds in the northeastern United States. J Dairy Sci. 2009;92(4):1811-25. Epub 2009/03/25. doi: 10.3168/jds.2008-1486. 16. Pradhan AK, Mitchell RM, Kramer AJ, Zurakowski MJ, Fyock TL, Whitlock RH, et al. Molecular epidemiology of Mycobacterium avium subsp. paratuberculosis in a longitudinal study of three dairy herds. J Clin Microbiol. 2011;49(3):893-901. Epub 2011/01/07. doi: 10.1128/JCM.01107-10. 17. Liu YY, Li S, Li F, Song L, Rehg JM. Efficient Learning of Continuous-Time Hidden Markov Models for Disease Progression. Adv Neural Inf Process Syst. 2015;28:3599-607. Epub 2015/01/01. 18. Jones Eea. Open source scientific tools for Python. 2001. 19. van der Walt S, Colbert SC, Varoquaux G. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering. 2011;13(2):22-30. doi: 10.1109/mcse.2011.37. 20. W. M. Data Structures for Statistical Computing in Python. 2010:51-6. 21. Pedregosa al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825-30. 22. Pohle J, Langrock R, van Beest FM, Schmidt NM. Selecting the Number of States in Hidden Markov Models: Pragmatic Solutions Illustrated Using Animal Movement. Journal of Agricultural, Biological and Environmental Statistics. 2017;22:270-93. 23. de Silva K, Begg DJ, Plain KM, Purdie AC, Kawaji S, Dhand NK, et al. Can early host responses to mycobacterial infection predict eventual disease outcomes? Prev Vet Med. 2013;112(3-4):203-12. Epub 2013/09/17. doi: 10.1016/j.prevetmed.2013.08.006. 24. Koets A, Santema W, Mertens H, Oostenrijk D, Keestra M, Overdijk M, et al. Susceptibility to paratuberculosis infection in cattle is associated with single nucleotide polymorphisms in Toll-like receptor 2 which modulate immune responses against Mycobacterium avium subspecies paratuberculosis. Prev Vet Med. 2010;93(4):305-15. Epub 2009/12/17. doi: 10.1016/j.prevetmed.2009.11.008. 25. Marquetoux N, Mitchell R, Ridler A, Heuer C, Wilson P. A synthesis of the patho-physiology of Mycobacterium avium subspecies paratuberculosis infection in sheep to inform mathematical modelling of ovine paratuberculosis. Vet Res. 2018;49(1):27. Epub 2018/03/09. doi: 10.1186/s13567-018-0522-1. 26. Magombedze G, Shiri T, Eda S, Stabel JR. Inferring biomarkers for Mycobacterium avium subsp. paratuberculosis infection and disease progression in cattle using experimental data. Sci Rep. 2017;7:44765. Epub 2017/03/21. doi: 10.1038/srep44765. 53 Supplementary Materials File S1. Description of gamma parameter estimation Newton-Raphson Estimation of gamma emission parameters. MAP CFU emissions 𝑥, were modeled with gamma distributions, with the following pdf: 𝑥6*3𝑒*7/9 𝑓(𝑥|𝛼, 𝜃) = Γ(𝛼)𝜃6 And log likelihood function: 𝑥 log(𝐿) = Π;5:3𝑓(𝑥|𝛼, 𝜃) = −𝑁𝑙𝑜𝑔Γ(𝛼) − 𝛼𝑁𝑙𝑜𝑔𝜃 + (𝛼 − 1) log(∑𝑥 5 5) − ∑ 𝜃 The maximum likelihood estimator for the scale parameter 𝜃X is 7̅, however there is no 6 closed form maximum likelihood estimate for the shape parameter 𝛼. Therefore, an estimate was generated using Newton Raphson estimation. The methods of moments estimator was used as the initial estimate for 𝛼: 𝑁𝑥+ 𝛼& = ∑; (𝑥 − ?̅?)+ 5:3 5 Following the initial estimation of 𝛼&, Newton Raphson Estimation was used to iteratively calculate 𝛼[" for each state k: log(𝛼 ) − 𝜓(𝛼 ) − log(?̅?) + 𝑙]𝑜]]𝑔]]𝑥] 𝛼=-> = 𝑎?.@ − ?.@ ?.@ 1 − 𝜓A𝛼 (𝛼?.@)?.@ where 𝑥]]] = ∑ B#,!7# , 𝑙]𝑜]]𝑔]]𝑥] = ∑ B#,!CDE 7#" 5 ∑ 5 ∑ and 𝑝 G!(5)J!(5) B B 5," = . ! #,! ! #,! ∑!G!(5)J!(5) Estimation was stopped when the 𝛼 value changed by less than 0.0001 or if the number of Newton-Raphson iterations exceeded 10000. If the number of iterations exceeded this threshold, the run was terminated. 54 Table S1. Initial parameter values Parameter Description Initial Value Two state models 𝜋& Estimate for probability of starting in state 0 0.44 𝛼& Gamma distribution shape estimate for state 0 4.42 𝛼3 Gamma distribution shape estimate for state 1 10.18 𝜃& Gamma distribution scale estimate for state 0 0.84 𝜃3 Gamma distribution scale estimate for state 1 0.45 𝑞&3 Transition rate estimate from state 0 to 1 4.35 𝑞3& Transition rate estimate from state 1 to 0 0.38 Three state models 𝜋& Estimate for probability of starting in state 0 0.28 𝜋3 Estimate for probability of starting in state 1 0.40 𝛼& Gamma distribution shape estimate for state 0 1.09 𝛼3 Gamma distribution shape estimate for state 1 9.88 𝛼+ Gamma distribution shape estimate for state 2 19.8 𝜃& Gamma distribution scale estimate for state 0 0.42 𝜃3 Gamma distribution scale estimate for state 1 0.76 𝜃+ Gamma distribution scale estimate for state 2 0.51 𝑞&3 Transition rate estimate from state 0 to 1 0.60 𝑞&+ Transition rate estimate from state 0 to 2 0.20 𝑞3& Transition rate estimate from state 1 to 0 2.26 𝑞3+ Transition rate estimate from state 1 to 2 4.60 𝑞+& Transition rate estimate from state 2 to 0 0.99 𝑞+3 Transition rate estimate from state 2 to 1 0.66 Four state models 𝜋& Estimate for probability of starting in state 0 0.23 𝜋3 Estimate for probability of starting in state 1 0.29 𝜋+ Estimate for probability of starting in state 2 0.45 𝛼& Gamma distribution shape estimate for state 0 2.83 𝛼3 Gamma distribution shape estimate for state 1 8.65 𝛼+ Gamma distribution shape estimate for state 2 13.4 𝛼K Gamma distribution shape estimate for state 3 18.9 𝜃& Gamma distribution scale estimate for state 0 0.62 𝜃3 Gamma distribution scale estimate for state 1 0.97 55 𝜃+ Gamma distribution scale estimate for state 2 0.96 𝜃K Gamma distribution scale estimate for state 3 0.45 𝑞&3 Transition rate estimate from state 0 to 1 2.13 𝑞&+ Transition rate estimate from state 0 to 2 1.15 𝑞&K Transition rate estimate from state 0 to 3 2.45 𝑞3& Transition rate estimate from state 1 to 0 1.71 𝑞3+ Transition rate estimate from state 1 to 2 1.11 𝑞3K Transition rate estimate from state 1 to 3 1.36 𝑞+& Transition rate estimate from state 2 to 0 4.12 𝑞+3 Transition rate estimate from state 2 to 1 1.02 𝑞+K Transition rate estimate from state 2 to 3 3.82 𝑞K& Transition rate estimate from state 3 to 0 4.83 𝑞K3 Transition rate estimate from state 3 to 1 4.01 𝑞K+ Transition rate estimate from state 3 to 2 1.26 Five state models 𝜋& Estimate for probability of starting in state 0 0.05 𝜋3 Estimate for probability of starting in state 1 0.14 𝜋+ Estimate for probability of starting in state 2 0.31 𝜋K Estimate for probability of starting in state 3 0.13 𝛼& Gamma distribution shape estimate for state 0 9.69 𝛼3 Gamma distribution shape estimate for state 1 11.4 𝛼+ Gamma distribution shape estimate for state 2 15.0 𝛼K Gamma distribution shape estimate for state 3 15.0 𝛼L Gamma distribution shape estimate for state 4 18.5 𝜃& Gamma distribution scale estimate for state 0 0.74 𝜃3 Gamma distribution scale estimate for state 1 0.60 𝜃+ Gamma distribution scale estimate for state 2 0.36 𝜃K Gamma distribution scale estimate for state 3 0.22 𝜃L Gamma distribution scale estimate for state 4 0.88 𝑞&3 Transition rate estimate from state 0 to 1 4.79 𝑞&+ Transition rate estimate from state 0 to 2 3.66 𝑞&K Transition rate estimate from state 0 to 3 3.29 𝑞&L Transition rate estimate from state 0 to 4 2.05 𝑞3& Transition rate estimate from state 1 to 0 0.12 𝑞3+ Transition rate estimate from state 1 to 2 3.53 56 𝑞3K Transition rate estimate from state 1 to 3 1.58 𝑞3L Transition rate estimate from state 1 to 4 1.13 𝑞+& Transition rate estimate from state 2 to 0 2.88 𝑞+3 Transition rate estimate from state 2 to 1 0.84 𝑞+K Transition rate estimate from state 2 to 3 4.81 𝑞+L Transition rate estimate from state 2 to 4 1.93 𝑞K& Transition rate estimate from state 3 to 0 0.68 𝑞K3 Transition rate estimate from state 3 to 1 2.54 𝑞K+ Transition rate estimate from state 3 to 2 0.59 𝑞KL Transition rate estimate from state 3 to 4 3.72 𝑞L& Transition rate estimate from state 4 to 0 1.46 𝑞L3 Transition rate estimate from state 4 to 1 4.36 𝑞L+ Transition rate estimate from state 4 to 2 2.58 𝑞LK Transition rate estimate from state 4 to 3 4.11 These input parameter values produced the lowest AIC values for the two, three, four and five state models. 57 S2 Table. Four and five state model parameter estimates. Parameter Description 4 state model 5 state model Four state models 𝜋& Probability of starting in state 0 0.448 1.728e-51 𝜋3 Probability of starting in state 1 0.196 0.211 𝜋+ Probability of starting in state 2 0.195 0.173 𝜋K Probability of starting in state 3 0.160 0.427 𝜋L Probability of starting in state 4 -- 0.190 𝛼& Gamma distribution shape parameter 17.67 81742 for state 0 𝛼3 Gamma distribution shape parameter 21.72 229.0 for state 1 𝛼+ Gamma distribution shape parameter 4923 19.09 for state 2 𝛼K Gamma distribution shape parameter 336.8 19.29 for state 3 𝛼L Gamma distribution shape parameter -- 5055 for state 4 𝜃& Gamma distribution scale parameter 0.139 7.526e-5 for state 0 𝜃3 Gamma distribution scale parameter 0.261 0.033 for state 1 𝜃+ Gamma distribution scale parameter 0.002 0.271 for state 2 𝜃K Gamma distribution scale parameter 0.023 0.124 for state 3 𝜃L Gamma distribution scale parameter -- 0.002 for state 4 𝛼&𝜃& Gamma distribution mean for state 0 2.448 6.152 𝛼3𝜃3 Gamma distribution mean for state 1 5.674 7.581 𝛼+𝜃+ Gamma distribution mean for state 2 8.643 5.182 𝛼K𝜃K Gamma distribution mean for state 3 7.642 2.400 𝛼L𝜃L Gamma distribution mean for state 4 -- 8.646 𝑞&3 Transition rate from state 0 to 1 2.373 6.419 𝑞&+ Transition rate from state 0 to 2 1.561 5.271 𝑞&K Transition rate from state 0 to 3 3.263 3.628 𝑞&L Transition rate from state 0 to 4 -- 2.566 𝑞3& Transition rate from state 1 to 0 1.537 0.086 𝑞3+ Transition rate from state 1 to 2 1.351 3.798 𝑞3K Transition rate from state 1 to 3 1.626 1.298 58 𝑞3L Transition rate from state 1 to 4 -- 1.057 𝑞+& Transition rate from state 2 to 0 3.036 1.997 𝑞+3 Transition rate from state 2 to 1 0.838 0.784 𝑞+K Transition rate from state 2 to 3 3.755 3.683 𝑞+L Transition rate from state 2 to 4 -- 1.679 𝑞K& Transition rate from state 3 to 0 3.628 0.619 𝑞K3 Transition rate from state 3 to 1 3.348 3.087 𝑞K+ Transition rate from state 3 to 2 1.283 0.777 𝑞KL Transition rate from state 3 to 4 -- 4.209 𝑞L& Transition rate from state 4 to 0 -- 1.165 𝑞L3 Transition rate from state 4 to 1 -- 4.681 𝑞L+ Transition rate from state 4 to 2 -- 2.974 𝑞LK Transition rate from state 4 to 3 -- 3.633 Parameter estimates for the two and three hidden disease models were generated using the modified Baum-Welch Expectation Maximization algorithm 59 CHAPTER 3 A critical evaluation of Mycobacterium bovis pangenomics, with reference to its utility in outbreak investigation Kristina M. Ceres1,2, Michael J. Stanhope1,2 & Yrjö T. Gröhn1,2 1. Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University 2. Population and Ecosystem Health, College of Veterinary Medicine, Cornell University Chapter 3 is accepted for publication in Microbial Genomics 60 Abstract The increased accessibility of next generation sequencing has allowed enough genomes from a given bacterial species to be sequenced to describe the distribution of genes in the pangenome, without limiting analyses to genes present in reference strains. Although some taxa have thousands of whole genome sequences available on public databases, most genomes were sequenced with short read technology, resulting in incomplete assemblies. Studying pangenomes could lead to important insights into adaptation, pathogenicity, or molecular epidemiology, however given the known information loss inherent in analyzing contig-level assemblies, these inferences may be biased or inaccurate. In this study we describe the pangenome of a clonally evolving pathogen, Mycobacterium bovis, and examine the utility of gene content variation in M. bovis outbreak investigation. We constructed the M. bovis pangenome using 1463 de novo assembled genomes. We tested the assumption of strict clonal evolution by studying evidence of recombination in core genes and analyzing the distribution of accessory genes among core monophyletic groups. To determine if gene content variation could be utilized in outbreak investigation, we carefully examined accessory genes detected in a well described M. bovis outbreak in Minnesota. We found significant errors in accessory gene classification. After accounting for these errors, we show that M. bovis has a much smaller accessory genome than previously described and provide evidence supporting ongoing clonal evolution and a closed pangenome, with little gene content variation generated over outbreaks. We also identified frameshift mutations in multiple genes, including a mutation in glpK, which has recently been associated with 61 antibiotic tolerance in Mycobacterium tuberculosis. A pangenomic approach enables a more comprehensive analysis of genome dynamics than is possible with reference- based approaches; however, without critical evaluation of accessory gene content, inferences of transmission patterns employing these loci could be misguided. Impact statement Dissecting true gene content variation, from noise generated by analysis of incomplete genomes, is a necessary step to understand the role of horizontal gene transfer in the evolution of the M. bovis pangenome. Here we show that analysis of incomplete genomes can lead to incorrect identification of accessory genes, which artificially inflates the pangenome size. When correcting for these annotation errors, we show that M. bovis evolves clonally, and has a very compact accessory genome. We also show that there is limited variation in gene content among outbreak sequences and that this variation is driven by indels, suggesting indel variation could be useful in distinguishing between highly similar SNP profiles in epidemiological analysis. Data summary Sequence reads are available on the National Centre for Biotechnology Information (NCBI) Sequence Reads Archive. A full list of accession numbers for sequences used in this study are available in Supplemental File 1. The M. bovis gene presence/absence matrix is available in Supplemental File 2. The quality of genome assemblies was assessed using contig count, N50 and checkM completeness and contamination measures in addition to manual screening. Data processing scripts are publicly available 62 on https://github.com/kmceres/Mbovis_pangenome. Supplementary materials are available on Figshare: https://figshare.com/s/c9665c733243e68326f7 Introduction Mycobacterium bovis is a member of the Mycobacterium tuberculosis complex (MTBC), the collection of pathogenic bacteria including M. tuberculosis, which causes tuberculosis in humans and M. bovis, which causes bovine tuberculosis in cattle and other mammalian species. MTBC species are unique among bacterial taxa because unlike many species, they have highly conserved genomes and are thought to evolve strictly clonally [1, 2]. A consequence of strict clonality is relatively little gene content variation compared to species that can generate diversity through horizontal gene transfer (HGT). In addition to HGT, variability in gene content can arise through de novo gene formation, or through gene deletion. The competing forces of gene gain and gene loss create a U-shaped gene frequency spectrum, with many genes shared by most genomes on one end of a gene frequency spectrum, and a similar number of rare genes on the other end. Since MTBC genomes are thought to evolve clonally, without HGT, MTBC accessory genomes are hypothesized to be very small, and have fewer rare genes compared to other taxa. Recent MTBC studies, however, challenge the idea of a small accessory MTBC genome, with estimates of MTBC accessory genome size ranging from 2086 genes (36% of all genes) [3] to 7620 genes (69% of all genes) [4]. Importantly, a recent M. bovis study by Reis and Cunha suggests that M. bovis has an open pangenome [5], implying new M. bovis genes would be discovered with each new 63 genome sequenced. If M. bovis does have an open pangenome, it would suggest M. bovis does not evolve clonally. Similarly, studies investigating homologous recombination in MTBC have yielded conflicting results. Experimental evidence has shown that while the closely related free-living Mycobacterium canettii could accept donor DNA via homologous recombination, MTBC could not [1]. However, despite this lack of experimental evidence, computational methods sometimes identify recombination in MTBC genomes [3, 6, 7]. Some of the recombination in MTBC species identified through computational methods stems from poor quality data or unreliable alignments. For example, Godfroid et al. identified an imbalance in the number of recombination events between terminal branches and internal nodes, with a skew towards more recombination events on terminal branches and an unequal distribution of recombination events among genomes [8]. Furthermore, phylogenies created with these erroneous regions conflicted with the established M. tuberculosis phylogenies, leading the authors to conclude that any true recombination signal that was identified was greatly outweighed by noise stemming from poor data quality [8]. Resolving the conflict between bioinformatic detection of recombination and large accessory genomes and experimental evidence suggesting clonality is necessary to understand MTBC genome dynamics and their implication for outbreak analysis and the evolution of important phenotypes like antimicrobial resistance. Whole genome sequencing is an indispensable tool in outbreak investigation, enabling the identification of epidemiological links between individuals or groups that would otherwise be difficult to ascertain by comparing pathogen sequences. However, accurate reconstruction of a 64 phylogenetic tree relies on identifying regions of the genome that evolve clonally because phylogenetic trees are usually bifurcating trees where each node has exactly one parent. To accurately model evolution of a pathogen with proportions of the sequence evolving through HGT, nodes must be allowed to have more than one parent reflecting the inheritance of genetic material from more than one parental source, leading to a network structure or ancestral recombination graph, rather than a bifurcating tree structure. Therefore, to infer transmission relationships using a bifurcating tree, regions of the genome that are affected by recombination are often excised from alignments. If there is ongoing HGT in M. bovis and other MTBC species, alignments used to create trees relevant to transmission analysis must be carefully curated to ensure only clonally evolving genome segments are included. Additionally, if MTBC species have true accessory genomes gene content variation could be used in outbreak analysis to discriminate between highly similar core genome sequences. Here we investigate the evolutionary forces that shape the M. bovis pangenome and explore evolutionary scenarios that could have resulted in observed gene content variation using de novo assembled draft M. bovis. We look for evidence of homologous recombination and true gene content variation, and carefully examine accessory genes for assembly or annotation errors. Since there is no known mechanism for HGT in the MTBC, we suspect that gene deletion and gene duplication are the major drivers of pangenomic diversity in the MTBC. Consequently, after accounting for technical errors created during sequencing or assembly or bioinformatic misclassification, we expect that gene content variation patterns will closely resemble core genome single nucleotide polymorphism (SNP) variation patterns. We show that M. bovis has a much smaller 65 accessory genome than previously described and provide evidence supporting ongoing clonal evolution. This finding suggests careful examination of pangenome and recombination results from bioinformatic analyses is necessary to accurately describe MTBC genome dynamics. Methods Sequence assembly and annotation Raw M. bovis sequence files from Almaw et al. [9], Loiseau et al [10], Orloski et al. [11], Reis & Cunha [6], Rodrigues et al. [12], Zwyer et al. [13] and from BioProject PRJNA769553 were downloaded from the NCBI Sequence Read Archive. A full list of downloaded samples can be found in Supplemental File 1. Sequences were assembled using SPAdes genome assembler v3.14.0[14] in careful mode, and polished using Pilon v1.23 [15]. The quality of assemblies was assessed using QUAST v5.0.2 [16] and checkM v1.1.3 [17]. To minimize annotation issues inherent with draft genomes, only sequences with less than or equal to 150 contigs and less than 5% contamination were used for further analysis. It was noted upon visual inspection that some contigs contained duplicate gene sequences that were the reverse compliment of each other, such that the entire contig was one palindromic sequence. Our view is this was likely caused by assembly error, and therefore systematically identified all such palindromic sequences (81 contigs total) by searching for contigs that contained reverse compliments of the same gene with identical DNA sequences. Each identified contig was visually inspected, and sequences that had any confirmed palindromes were removed. Sequences 66 were annotated using prokka v1.14.16 [18] using the AF2122/97 (NCBI accession NC_002945.4) M. bovis reference genome as a guide [19]. Sequence selection Before assembling genomes, we determined which lineage each sequence belonged to according to the new M. bovis lineage classification scheme suggested by Zwyer et al. [13]. We called SNPs for all downloaded fastq using the vSNP pipeline (https://github.com/USDA-VS/vSNP) against the Mycobacterium tuberculosis H37Rv reference genome (NCBI accession NC_000962.3). SNPs were filtered out if they had a read depth less than 10, an allele frequency less than 0.9, and a mapping quality less than 30 using vcffilter in vcflib v1.0.2 [20]. Lineages were assigned using H37Rv genome positions in Table 4 of Zwyer et al. [13]. Pangenome analysis Panaroo was used to characterize the M. bovis pangenome [21]. To identify the reference coordinates of partial genes undergoing pseudogenization, the consensus gene sequence created by Panaroo for each gene was blasted to the AF2122/97 M. bovis reference, using a similarity e-value cutoff of 1x10-10. A network of strong reference gene – accessory gene BLASTn matches was created using the IRanges R package v2.24.1, and this network was filtered to contain only gene-reference matches where at least 75% of the length of the Panaroo-identified gene overlapped with a reference gene. The resulting network was used to identify gene fragments that all mapped to the same reference gene, indicating that each fragment was part of the same original reference 67 gene undergoing pseudogenization. Since we expected to identify gene content variation driven by deletion, we investigated whether certain functional categories of genes were more likely to be pseudogenized. We used PANTHER and the PANTHER GO Slim annotation set to analyze whether certain Gene Ontology groups were overrepresented in genes undergoing pseudogenization compared to core genes [22]. Additionally, as a negative control for detecting pseudogenization, we determined whether any of the pseudogenes identified in this analysis were essential in the H37Rv M. tuberculosis reference strain [23]. Recently updated AF2122/97 functional annotations were used to map AF2122/97 annotated genes to orthologous coding sequences in M. tuberculosis H37Rv [24]. Core genome recombination detection A core genome alignment was constructed using parsnp v 1.2 [25], using the M. tuberculsosis H37Rv genome as a reference and outgroup. Parsnp uses MUSCLE [26] to create gene alignments and FastTree2 [27] to create a core phylogeny. Recombination was detected using Gubbins [28] with default parameters and a starting tree created by parsnp using a GTR substitution model. For each gene with detected recombination, whole-contig alignments were created using progressiveMauve [29] and reviewed in Geneious Prime v 2021.2.2 to rule out detection by assembly or alignment error. All regions of the genome that Gubbins determined to be recombinant were excised from the core gene alignment to ensure only regions of clonal evolution were included in the population structure analysis. Trees visualizations were created using iTOL v5 [30] and FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/). In the process of creating a 68 core genome alignment, parsnp excises regions of the alignment that are missing from at least one genome. Since this resulting alignment is conservative, some potential recombination events may be excised from the alignment before it is fed to Gubbins. Therefore, in addition to the recombination analysis on the core genome alignment, individual gene alignments that had previously been implicated in recombination events in M. bovis [6] were also assessed for recombination using fastGEAR [31]. The individual gene alignments were created using the built-in MAFFT aligner included in Panaroo [21, 32]. SNP validation analysis Filtered SNPs were used to validate potential recombination hotspots identified by Gubbins and fastGEAR. SNPs were considered valid if they passed the filtering criteria described in section 5.2. SNP quality was also verified by manually inspecting read mapping files using the Integrative Genome Viewer [33]. The functional consequence of SNPs was assessed using snpEff [34]. Population structure assessed by core and accessory genes M. bovis population structure was compared using core SNPs and accessory gene content. To analyze population structure using the core genome, principal component analysis (PCA) was applied directly to Boolean vector transformed SNP vectors following the methods presented in Konishi et al. [35]. In accessory genes population structure was assessed using PCA on the Jaccard distance calculated using binary gene presence absence patterns using the R package abdiv v0.2.0 [36]. A dendrogram was 69 created from accessory gene clusters using Ward’s hierarchical clustering on the Jaccard distance matrix. All post-processing analysis was done in RStudio 2021.09.0 [37] using R v 4.1.2 [38]. Outbreak analysis To investigate the utility of gene presence absence data in outbreak investigation, we analyzed the pangenome of 62 M. bovis genomes from well-described outbreak that occurred in Minnesota from 2005-2012 and was documented in Glaser et al. [39]. The outbreak pangenome was assessed using Panaroo and the methods described in section 5.3. In addition to gene presence absence patterns, we also created a core genome phylogenetic tree using the SNP matrix from Table S2 in Glaser et al [39] and IQ-TREE- 2 v 2.1.1 [40] with a GTR substitution model. We also used ScarTrek [41] to call indels. Indels were removed if they occurred in both the outgroup, sequences from a Texas outbreak (13-3900, 12-5094 and 12-5090), and the Minnesota sequences. Binary indel presence/absence information was used to create an indel tree with a JC2 substitution model in IQ-TREE-2 [40]. Investigating pangenome openness To compare the impact of methodological artifacts on M. bovis pangenome analysis, we reanalyzed the genomes included in a recent M. bovis pangenome study by Reis and Cunha [5]. We downloaded raw or complete sequence files listed in Supplementary Table 1 of Reis and Cunha [5] from NCBI. We processed the non-complete genomes and annotated all genomes using the sequence assembly and annotation methods 70 described in section 5.1. We assembled the pangenome using Panaroo and the post-hoc filtering methods described in section 5.3. All R scripts and parameters used for other bioinformatic programs are available on https://github.com/kmceres/Mbovis_pangenome. Results Sequence assembly 3626 raw fastq files were downloaded from the NCBI Sequence Read Archive. SNPs were called on all download samples and were used to assign clonal complex lineage labels. After lineage labels were determined, 1781 samples were chosen to be assembled to maximize representation of the known M. bovis clonal complexes in our final dataset. After assembly and quality filtering, 1463 whole genome sequences remained. All sequences were near complete (>= 90% completeness), and all sequences had low or moderate contamination (<10% contamination). 949 sequences had no contamination. Assembly quality figures are shown in Supplemental Figure 1. Core phylogeny The core phylogeny shown in Figure 5 contained eight clonal complexes labeled using the lineage designations, described in Zwyer et al 2018 [13]. The recombination free core phylogeny was created using RAxML [42] with a GTR substitution model in Gubbins and was rooted at the M. tuberculosis reference, H37Rv. Each core phylogenetic group contained samples from multiple geographic regions, reflecting multiple transmission events between geographic regions over the complex history of 71 M. bovis evolution [43]. Our sample includes genomes from every continent except Antarctica (Figure 5C and D), and in a variety of host species (Figure 5C), however, most samples were from cattle hosts. Lineages La1.7 and La1.8 and samples from the United States, Mexico, and New Zealand were over-represented, reflecting the intensive ongoing bovine tuberculosis molecular surveillance programs in those regions. Figure 5 M. bovis core phylogeny and sample geography. A. A recombination-free core phylogenetic tree constructed in Gubbins with a GTR substitution model is shown with clonal complex labels. Sample collection continent distribution is shown in B. and host distribution is shown in C. The number of samples included from each country is shown in D. 72 Population structure analysis Panaroo identified 3681 core genes present in greater than 95% of isolates, and 1441 accessory genes including 970 genes present in less than 15% of isolates. The Panaroo estimate of accessory genome size was smaller than previous studies [3-5, 43-46], which is more consistent with a hypothesis of clonal evolution. The population structure assessed using core SNPs and accessory gene presence absence patterns is shown in Figure 6. Principal components analysis of the core SNPs and pangenome content are shown in Figure 6B and 6C, respectively. A dendrogram was created to further visualize the similarity between gene content variation and core SNP variation. If the pangenome evolved clonally we would expect to see high concordance between core phylogenetic groups and accessory gene content patterns, although incomplete concordance was observed. 73 Figure 6. Population structure assessed in SNP variation and gene presence absence patterns. A. Gene content similarity dendrogram created using the Jaccard distance of the pangenome gene presence absence matrix. Core clonal complexes are labeled along the circumference of the tree. B-C. Core SNP and gene content variation principal components analysis colored by core clonal complexes created with core SNP principal components. SNP variation is shown in B and gene content variation is shown in C. Gene content variation is not concordant with SNP variation. 74 To test whether this lack of perfect correlation between genome content variation and core phylogenetic grouping was true evidence of a departure from clonality or if there was classification error generating the discordant signal, we investigated if and where accessory gene consensus sequences mapped to the M. bovis reference genome AF2122/97. Only 38 genes did not map to the AF2122/97 reference genome. Two out of the 38 had no BLAST matches, 29 mapped to Bos taurus reference genomes and are most likely host genome contamination, five mapped to non-MTBC microbes, including the Escherichia phage PhiX174 and one, espI_2 mapped to Esx-1 protein EspI on several MTBC genomes including M. bovis Mb3601. Similarly, esxM did not blast to AF2122/97, but did map to several other MTBC reference genomes. An alignment of espI_2 between genomes included in this study, AF2122/97, Mb3601 and M. tuberculosis H37rv shows that AF2122/97, and many samples included in this study have a 2049 base pair deletion of a duplicated copy of espI (Supplemental Figure 2). This one gene deletion is the only example we found of a gene present in our sample but not in AF2122/97, indicating that gene content variation is not generated by the addition of genes from external sources. We cannot rule out that the phage PhiX174 infected some M. bovis genomes, however all contigs that contained genes that mapped to PhiX174 did not contain any M. bovis genes, suggesting the sequence fragments containing PhiX174 genes were not part of M. bovis genomes. Since M. bovis does not contain plasmids, it is more likely that these partial phage genes were remnants of the standard Illumina phiX internal sequencing control that did not get fully removed as part of their automated demultiplexing process. Because we did not find genes contained 75 in M. bovis genomes that did not map to a M. bovis reference genome, it is unlikely that M. bovis acquires new genes from other prokaryotes via HGT. Redundant labeling of accessory genes Further inspection of alignments of contigs with and without accessory genes showed that most accessory genes were redundant annotations of the same gene undergoing pseudogenization. A subset of redundant gene groups with at least six genes mapped to the same reference genome are shown in Supplemental Figure 3. The gene presence absence pattern of each redundant gene group is plotted along the circumference of the core phylogeny and is colored by the reference gene to which the redundant gene group maps. There are two distinct patterns of gene content variation among redundant accessory genes. Either the redundant genes are mostly found on the same genome, or the genes within a group are more diffusely present or absent in genomes. This gene content variation pattern was further explored in Figure 7A, which shows the distribution of gene groups with at least two redundant genes mapped to a single reference genome, colored by the fraction of the time that all of the genes in a given group were found on the same genome. A total of 29 out of the 60 reference genes that had three or more mapped redundant gene groups were PE and PPE family genes. PE/PPE genes are highly repetitive and have an increased number of indel mutations compared to genes with higher sequence complexity [41]; therefore, it is likely some of the redundant annotation is due to frameshift mutations caused by indels. The low sequence complexity of PE/PPE genes also makes them harder to sequence accurately, so a fraction of the redundant annotation may also be attributed to sequencing or 76 assembly error. Because PE and PPE gene families are difficult to sequence accurately, they are often excluded from molecular epidemiological analyses [47]. Example gene alignments demonstrating the two causes of redundant annotation are shown in Figure 7B and C. The accessory genes in B were caused by annotation error, likely due to low coverage at the 3’ end of pks7. In C, the difference in accessory gene labeling was caused by a deletion in BQ2027_MB2105 in a subset of genomes including SAMN03300658. This deletion led to the annotation of an additional gene, labeled 369. An additional source of annotation error resulted from palindromic contigs, which likely resulted from assembly error. Although we removed palindromic contigs through visual inspection described in section 5.1, additional palindromic contigs remained in the analysis. The result of palindrominc contigs was identical sets of genes sequence annotated with different gene names. On non-palindromic contig genes from the identical sets received either the original gene annotation or the reverse compliment annotation. This resulted in an inflation of accessory genome count because core genes were designated as accessory due to inconsistent annotation. When we removed redundant gene labels generated by mislabeling at the ends of contigs, and palindromic contig labels, the size of the accessory genome decreased from 494 accessory genes present in between 15 and 95% of genomes and 970 rare genes present in less than 15% of genomes to 52 accessory and 85 rare genes (Figure 7D). The remaining accessory genes were either paralogs (33 genes) or pseudogenes (104 genes). 77 Figure 7. Accessory gene labels are created through redundant gene annotation. A. The distribution of all redundantly labeled genes, colored by the fraction of the time all genes in a redundant gene annotation group were found in the same genome. Within each reference genome, the number of mapped genes that were found within 1000 bp of the end of a contig are shown (black circles). B and C. Alignments of two example genes, pks7 and BQ2027_MB2105 (purple) with redundantly annotated accessory genes (blue). D. After removing redundant gene annotations, the accessory genome size (genes present in less than 95% of genomes) decreased from 1464 to 137 genes. 78 Although these redundantly annotated genes do not represent true accessory gene content variation, they do represent true differences in genotypes. This set of redundant genes contains genes with frameshift mutations leading to premature stop codons and loss of gene function, including mutations in single base repeats creating homopolymeric tracts, that may be reversible [41, 48]. We found insertions and deletions in homopolymeric tracts in multiple genes including frdBC and a C537, 7Cà8C insertion in glpK, which was shown to cause reversible antibiotic tolerance in M. tuberculosis [48]. Interestingly, the frameshift-causing homopolymeric tract mutations in glpK are roughly uniformly distributed across the phylogeny, indicating that these mutations occur in the absence of a clear selective force (Supplemental Figure 4A). An example alignment of the variable homopolymeric region in sequences from an outbreak that occurred in New Mexico, which had variable presence or absence indicating that variation in homopolymer tract length occurred within relatively short timespans (Supplemental Figure 4B). Attributes of accessory genes The gene presence absence patterns of the filtered set of accessory genes are shown in Figure 8. Three patterns of gene content variation were observed. First, it is evident that 8 accessory genes have perfectly correlated presence and absence patterns, shown in Figure 8. These genes are a group of 8 pseudogenes that are remnants of a phage, PhiRv1, also identified in the M. bovis accessory genome by Zimpel et al.[43], which was thought to be acquired when the ancestor of the MTBC transitioned to an intracellular lifestyle. This phage is absent in the M. bovis BCG vaccine strain, and it is 79 hypothesized that the loss of this phage is associated with the reduced virulence of the vaccine strain [49]. PhiRv1 genes are present most lineages but are notably absent from a monophyletic group within La1.8 consisting of 49% of the La1.8 samples included in this study. Absence of PhiRv1 genes from this subgroup could be used as a phylogenetic marker. We ran an additional PCA analysis on the Jaccard distance of the filtered presence/absence matrix and found that there were two main clusters of accessory genes. Although accessory genes do not appear to cluster by core genome lineage (Supplemental Figure 5A), the two distinct PCA clusters are separated by the presence and absence of the PhiRv1 genes (Supplemental Figure 5B). The second group consists of genes identified to be paralogs by Panaroo. This group consists mainly of genes involved in transposition (transposases, endonucleases), but also contains PE/PPE family genes and four genes that form a unique cluster in M. bovis La1.7. These genes originated from a single, duplication event of four genes (bpa, ragH, glfR, and a hypothetical protein) within rfbB (Supplemental Figure 6). The last pattern of accessory gene content is more diffuse presence or absence, and this pattern is driven by accessory genes that are undergoing pseudeogenization. We were not able to identify any functional category that was more likely to be undergoing pseudogenization in the PANTHER overrepresentation test, suggesting that the process of gene pseudogenization may be somewhat random; however, essential genes seem to be protected from pseudogenization. Only one detected pseudogene mapped to an essential M. tuberculosis H37Rv gene, glmS. Although 14 genomes contained this pseudogene, each genome had a different insertion sequence all located in a repetitive region of glmS; This region of glmS may either be hypervariable or difficult to sequence accurately. 80 Figure 8. Filtered gene presence absence patterns in the M. bovis pangenome. The core genome phylogeny from Figure 5 is shown, labeled with clonal complexes. The phage PhiRv1 genes are also labeled. Branch lengths are in SNP units. Gene content variation in outbreaks Variable gene content over an outbreak in Minnesota is shown in Figure 9A next to a core genome phylogenetic tree. The tree is rooted at the common ancestor to a closely related outgroup of sequences from a Texas outbreak (13-3900, 12-5094 and 12-5090) as in Glaser et al. [39]. Only three genes were found to be variable over this outbreak, group_2069, frdC and frdD. Upon further inspection of contig alignments, group_2069 was found to be annotated in genomes that lacked a 60 bp deletion between thrB and thrC compared to the reference genome AF2122/97. Due to the sparse distribution of group_2069 among outbreak samples, a more parsimonious explanation may be that the genomes with group_2069 have a 60 bp insertion relative to the other outbreak genomes, 81 which created a new open reading frame. The inserted sequence is identical to an adjacent 57 base pairs in the thrB/C intergenic region, suggesting this sequence was duplicated in genomes with group_2069 annotated. The other two genes, frdC and frdD were components of the AF2122/97 genes frdBC and frdD which were annotated differently depending on the insertion or deletion of base pairs in the homopolomyeric tract of frdBC. An alignment of the sequence differences in the frdBC homopolymeric tract are shown next to the gene presence absence matrix in Figure 9A. All gene content variation that was identified in the pangenome of the outbreak sequences was driven by indels; however, not all indels that occurred during the outbreak were identified in the pangenome analysis since only indels that changed an open reading frame were noted as variable gene content. Figure 9B shows a the relationships of indel presence and absence of outbreak sequences detected by ScarTrek [41]. As expected, there is more variation in indels than was captured in the gene presence absence analysis. Glaser et al. included two different phylogenetic analyses [39]. The first (Glaser et al. Figure 6) was a maximum likelihood phylogeny resulting in the same tree as shown in Figure 9A, where two cattle samples from the same herd (05-9955 and 05-9963) were placed in a clade with a deer sample (08-0863) because these three samples shared one SNP absent in other samples. The second tree included temporal information and produced a Bayesian time calibrated tree which placed two samples from the same cattle herd (05-9955 and 05-9963) at the base of the tree (Glaser et al. Figure 5), and deer sample in a distant clade. The second tree would suggest that the cattle and deer samples were not closely related and perhaps were not linked in a transmission event. Interestingly, when comparing the SNP tree to the indel tree in Figure 9, there is another 82 difference in these three samples shown in these highlighted samples in Figure 9. The positions of these three samples on the indel tree in Figure 9B adds an additional dimension to the two phylogenetic analyses presented in Glaser et al, showing cow sample 05-9963 and deer sample 08-0863 share an additional indel not present in 05- 9955. The combined analysis of SNPs and indels suggests that these samples may be more closely related than was suggested in the timed tree. In conclusion, although true gene content variability may occasionally occur during outbreaks in the form of duplication or larger gene deletion events, indels are likely more common, and may be useful in outbreak investigation when few SNP differences exist among samples. 83 Figure 9. Gene content variation in a Minnesota outbreak. Figure 9A shows a core genome phylogeny created with IQ-TREE2 using a GTR substitution model, and a gene presence absence matrix. Variation in gene content is observed in three genes, not including PE/PPE genes, however all gene content variation was driven by indels. An alignment of a variable homopolymeric tract region of M. bovis AF2122/97 fdrBC is shown next to the gene presence absence matrix. This homopolymeric tract variation resulted in differential gene presence absence annotation of frdD and frdB genes among outbreak samples. Indel variation is shown in a tree 84 created with binary indel presence/absence data in IQ-TREE2 with a JC2 substitution model in B. Three samples are highlighted to show disparate topologies of the SNP and indel trees. Three samples that had conflicting phylogenetic positions are highlighted in green. Recombination To determine if there is evidence of ongoing HGT in the core genome, we used Gubbins to infer recombination history in the core gene alignment and fastGEAR to detect recombination in a selected set of individual gene alignments. Gubbins inferred 16 recombination events; however, only 5 occurred on internal branches. Recombination events inferred on terminal branches can be unreliable because they only involve one genome. In addition to recombination analysis done on the core genome alignment, fastGEAR was used to investigate recombination on individual gene alignments with recombination events implicated in Reis and Cunha [6], and to confirm the recombination events detected by Gubbins. The fastGEAR analysis confirmed recombination events identified in Reis and Cunha [6] in narX and pks12. There is a high density of SNPs in a 100 bp locus in narX in18 monophyletic genomes. An alignment of the genomes with the narX SNP hotspot is shown in Supplemental Figure 7A. Four of the SNPs are nonsynonymous, resulting in three amino acid changes. An additional potential recombination event involving a 16 SNP hotspot was identified in pks12. The pks12 SNP cluster is also found in a monophyletic group, shown in Supplemental Figure 8A and B. SNPs occurred at the 3’ end of pks12 directly adjacent to regions of consistent poor mapping quality. In fact, the region of pks12 that contained 85 the SNP hotspot was always located at the end of a contig in our draft genomes (Supplemental Figure 8C). Since the SNP hotspot occurred at the end of contigs, near gaps, we could not confirm with high certainty that the SNPs were not called erroneously. Gubbins and fastGEAR also identified three additional recombination events in PPE 21, ppsB, and esxN. These potential recombination events consisted of sets of between five and seven SNPs, each shared by a monophyletic group. Pangenome openness To determine the impact of methodological differences on M. bovis pangenome representation, we re-constructed the pangenome using the 70 genomes used in Reis and Cunha [5]. The counts of accessory genes in the original analysis done by Reis and Cunha along with the Panaroo results before and after filtering are shown in Figure 10. Even before post-hoc filtering, the Panaroo estimate of accessory genome size was much smaller compared to the estimate obtained previously. Reis and Cunha used the get-homologs pipeline to analyze the pangenome [50], which unlike Panaroo does not employ a draft genome annotation correction method to reduce false positive accessory gene annotations. Using our methodology, we can reject the concept that the M. bovis pangenome is open, since we identified only 121 accessory genes in the 70 genome dataset, compared to the Reis and Cunha estimate of 3997 gene clusters. The gene presence/absence matrix of the pangenome created with the Reis and Cunha dataset is available in Supplemental File 3. 86 Figure 10. Methodological differences result in drastically different pangenome sizes. Panaroo produced a smaller accessory genome than that described in a recent M. bovis pangenome study. Soft core genes are present in greater than 95% of samples. Shell genes are present in less than 95% of samples but more than 2 samples. Cloud genes are present in one or two samples. After filtering redundantly labeled genes, the pangenome size was further reduced. The M. bovis pangenome is not open and shows presence and absence patterns consistent with clonal evolution. 87 Discussion Since a single genome contains only a portion of the genetic diversity of a species, comparing a new set of sequences to just one reference genome alone provides an incomplete description of the true genetic diversity in the set. There are currently hundreds of thousands of bacterial genomes publicly available on GenBank and other sequence repositories, potentially enabling the characterization of the full range of gene content diversity in multiple species; however, technical advances in annotating draft genomes have not caught up with the enormous number of incomplete genomes being uploaded to public databases [51]. One reason draft genomes are difficult to annotate is simply that genomes are incompletely sequenced and contain gaps between contigs. If the average prokaryotic gene is on the order of 1kb long, any contig with length on the same order of magnitude will likely contain fragmented genes. The ends of longer contigs also likely contain fragmented genes. Unfortunately, the only solution to this problem is to improve the sequence quality by closing the gaps between contigs, which perhaps will be done with future genomes sequenced using long read technology but may be too time intensive and costly for the thousands of existing draft genomes to be re-sequenced [51]. Methods for characterizing the pangenome typically involve identifying and classifying clusters of orthologous genes into core and accessory genes, without specifically accounting for erroneous annotations. Panaroo, attempts to correct annotation errors by creating a graphical representation of the pangenome and employing several quality control steps to remove poorly supported nodes [21]. With these additional quality control steps, Panaroo was able to more accurately characterize 88 the M. tuberculosis pangenome compared to several other pangenome pipelines [21], however correctly characterizing gene fragments remains an issue. Another perhaps unavoidable issue in pangenome studies is sampling bias. Since bacterial population sizes are so vast, even a relatively large sample may not truly represent the diversity in gene content present in a species [52]. Additionally, although we downloaded thousands of M. bovis genomes available from the NCBI Sequence Read Archive, we only retained the highest quality assemblies for analysis to minimize errors described above caused by higher contig number assemblies. The resulting sample still represents a convenience sample of M. bovis outbreaks in countries with well-funded surveillance programs, and may not fully represent M. bovis variants present in cattle globally. However, our sample includes genomes from all known major M. bovis clonal complexes and may be representative of the true diversity of the geographic regions included in this study because of intensive sampling programs in those regions. Despite the limitations in data quality created by short read sequencing, we were able to determine that M. bovis likely evolves clonally. After correcting for redundant gene annotations, we found a small pangenome consistent with a general pattern of ongoing gene deletion resulting in an increasingly compact genome. This gene deletion appears to be driven by random indel-causing pseudogenization. We also show that the M. bovis pangenome is likely closed, contradicting recent work by Reis and Cunha suggesting the M. bovis pangenome is open [5]. Additionally, we determined that accessory gene content is generally stable during outbreaks, although some variability in gene content is observed even between 89 closely related individuals. Since M. bovis and other MTBC species have a slow mutation rate (less than one substitution per genome per year) [53], genomes sampled in outbreak investigation often have very similar or identical SNP patterns. This low SNP diversity makes inferring who infected whom to be difficult or impossible using SNP data. Gene content variation could occasionally be useful in outbreak analysis to discriminate between samples with very similar SNP patterns, although variability in accessory genomes among closely related outbreak sequences is low. One important caveat in using gene content variation in outbreak analysis is that if accessory genes are not carefully examined, inferences on transmission patterns based on gene content variation are likely to lead to incorrect conclusions because of the inherent problem of gene detection in incomplete genomes. Indel presence/absence may be a more reliable and more common source of sequence variation in outbreak investigations. Using both indel and SNP data to infer transmission patterns may result in higher resolution transmission networks. We also identified frameshift mutations in glpK and in similar homopolmeric tracts in other genes that may be related to antimicrobial resistance in M. tuberculosis [54]. Recently, Safi et al. demonstrated that nucleotide insertions in homopolymeric C tract in glpK at lead to the gene transient gene inactivation, which led to reduced growth under nutrient deficient conditions or antibiotic treatment [48]. After relieving the stress caused by nutrient deficiency or antibiotic treatment, the inserted base pair was removed and colonies were able to restart growth. Therefore, Safi et al. conclude that this phase variation represents an additional mechanism for the evolution of antibiotic tolerance and could bias evolutionary trajectories toward antimicrobial resistance. Interestingly, 90 this same C537, 7Cà8C insertion was identified in our dataset. Since cattle are not treated with antimicrobials for bovine tuberculosis, glpK phase variation in M. bovis may not arise as a compensatory mechanism for antibiotic tolerance, but instead may result from drift. In this case, under conditions of antibiotic use, bacteria with the frameshift mutation would have a selective advantage and increased survival. The glpK gene encodes glycerol kinase, which is necessary for glycerol catabolism; however, most M. bovis strains have an inactivating mutation in pyk, the gene that encodes pyruvate kinase, which prevents M. bovis from being able to use glycerol as a sole carbon source [55]. Therefore, glpK may be dispensable in M. bovis and inactivating mutations may arise due to relaxed negative selection or even positive selection to limit the energetic cost of protein production for an enzyme that is not required [56]. More work is necessary to understand the role of phase variation in MTBC evolution, but evidence of glpK phase variation in M. bovis in cattle not treated with antibiotics for bovine tuberculosis suggests phase variation, especially in non-essential genes, could be a basal evolutionary process that occurs in the absence of antibiotic selection pressure. Indel mutations, generally, may be a random mutation process that selection can act on in addition to SNPs. Recently indels were implicated in antimicrobial resistance in M. tuberculosis [57], and it is possible that those resistance-associated indels first arose due to drift, and then were selected due to antimicrobial use. Lastly, we show that recombination is not an important driver of evolution in the M. bovis core genome. Recently Reis and Cunha present compelling evidence that a low level of recombination does occur in M. bovis [6], and we were able to replicate their findings in one gene, narX, and detected potential recombination events in an 91 additional three genes; however, the events in PPE 21, esxN, and ppsB only contained five to seven SNPs. Although it is possible that these mutations were caused by homologous recombination, other evolutionary forces such as recent positive selection could create a mutation hotspot [58]. Furthermore, all genomes with potential recombination event are part of monophyletic lineages, so it is likely that all mutations were caused by a single event. However, the mutation pattern in narX is unusual, and from this analysis alone we cannot rule out that it was caused by homologous recombination. Together, the small accessory genome and lack of significant evidence of recombination, provides strong evidence against the hypothesis that there is ongoing HGT in M. bovis, and instead, M. bovis indeed likely evolves strictly clonally. This finding has important implications both in understanding the basic mechanisms of MTBC evolution and in the practice of using genomic data in outbreak investigation. Strict clonality restricts the ability of M. bovis, and likely all MTBC species, to rapidly acquire genes that would allow it to adapt to new environments, including genes that would promote antibiotic tolerance and resistance. Although antimicrobial resistance is a growing global concern in MTBC and M. tuberculosis in particular, strict clonality implies that all antimicrobial resistance must evolve through de novo mutations. In outbreak analysis, SNP based bifurcating phylogenetic trees can be used reliably, without the need for complex graph structures to understand transmission. 92 Conclusions M. bovis is likely a strictly clonally evolving pathogen, and most variation in gene content is driven by gene deletion. M. bovis pangenome evolution is trending toward an increasingly compact genome, leaving only genes that are necessary for life inside the host. We also show that most recombination events and accessory genes identified by bioinformatic programs were false positives, although Panaroo’s graph connection mechanism appears to outperform other pangenome characterization algorithms in reducing the number of false positive genes. Because of the inherent problem of missing data created by using incomplete sequences, and the inability for bioinformatic programs to detect and rectify all errors in input data, both accessory gene and recombination events detected using incomplete MTBC genomes must be scrutinized to avoid incorrect inference based on pangenome data. We also found potentially reversible homopolymeric tract mutations in multiple genes, providing evidence that phase variation may be a general mechanism of gene regulation in MTBC. Lastly, we show that a small amount of gene content variation evolves over short time periods among closely related sequences, and this gene content variation is driven by indels. Indels may be a useful source of sequence variation in outbreak analysis when discriminating between genomes with similar SNP patterns. Acknowledgements The authors gratefully acknowledge the USDA-NIFA fellowship grant number 2020- 67034-31732, which funded PhD studies for KMC. This work was also funded by USDA-NIFA AFRI grant number 2014-67015-2240 as part of the joint USDA-NSF- 93 NIH-BBSRC-BSF Ecology and Evolution of Infectious Diseases Program. The authors would like to acknowledge the National Veterinary Service Laboratories at the United States Department of Agriculture for their extensive M. bovis sampling and sequencing efforts and would like to specifically acknowledge Dr. Suelee Robbe-Austerman, Dr. Kathy Orloski, Dr. Tyler Thacker, and Dr. Jason Lombard for helpful discussions related to the use of M. bovis pangenomics in outbreak investigation. 94 References 1. Boritsch EC, Khanna V, Pawlik A, Honore N, Navas VH et al. Key experimental evidence of chromosomal DNA transfer among selected tuberculosis- causing mycobacteria. Proc Natl Acad Sci U S A 2016;113(35):9876-9881. 2. Chiner-Oms Á, Sánchez-Busó L, Corander J, Gagneux S, Harris SR et al. Genomic determinants of speciation and spread of the Mycobacterium tuberculosis complex. Science Advances 2019;5. 3. Patane JS, Martins J, Beatriz Castelao A, Nishibe C, Montera L et al. Patterns and processes of Mycobacterium bovis evolution revealed by phylogenomic analyses. Genome Biol Evol 2017. 4. Kavvas ES, Catoiu E, Mih N, Yurkovich JT, Seif Y et al. Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance. Nat Commun 2018;9(1):4306. 5. Reis AC, Cunha MV. The open pan-genome architecture and virulence landscape of Mycobacterium bovis. Microb Genom 2021;7(10). 6. Reis AC, Cunha MV. Genome-wide estimation of recombination, mutation and positive selection enlightens diversification drivers of Mycobacterium bovis. Sci Rep 2021;11(1):18789. 7. Namouchi A, Didelot X, Schöck U, Gicquel B, Rocha EP. After the bottleneck: Genome-wide diversification of the Mycobacterium tuberculosis complex by mutation, recombination, and natural selection. Genome Res 2012;22(4):721-734. 8. Godfroid M, Dagan T, Kupczok A. Recombination Signal in Mycobacterium tuberculosis Stems from Reference-guided Assemblies and Alignment Artefacts. Genome Biol Evol 2018;10(8):1920-1926. 9. Almaw G, Mekonnen GA, Mihret A, Aseffa A, Taye H et al. Population structure and transmission of Mycobacterium bovis in Ethiopia. Microb Genom 2021;7(5). 10. Loiseau C, Menardo F, Aseffa A, Hailu E, Gumi B et al. An African origin for Mycobacterium bovis. Evol Med Public Health 2020;2020(1):49- 59. 11. Orloski K, Robbe-Austerman S, Stuber T, Hench B, Schoenbaum M. Whole Genome Sequencing of Mycobacterium bovis Isolated From Livestock in the United States, 1989-2018. Front Vet Sci 2018;5:253. 12. Rodrigues RA, Ribeiro Araújo F, Rivera Dávila AM, Etges RN, Parkhill J et al. Genomic and temporal analyses of Mycobacterium bovis in southern Brazil. Microb Genom 2021;7(5). 13. Zwyer M, Çavusoglu C, Ghielmetti G, Pacciarini ML, Scaltriti E et al. A new nomenclature for the livestock-associated Mycobacterium tuberculosis complex based on phylogenomics. Open Res Europe; 2021. 14. Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinformatics 2020;70(1):e102. 15. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 2014;9(11):e112963. 95 16. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 2013;29(8):1072-1075. 17. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 2015;25(7):1043-1055. 18. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014;30(14):2068-2069. 19. Malone KM, Farrell D, Stuber TP, Schubert OT, Aebersold R et al. Updated Reference Genome Sequence and Annotation of. Genome Announc 2017;5(14). 20. Garrison E, Kronenberg ZN, Dawson ET, Pedersen BS, Prins P. Vcflib and tools for processing the VCF variant call format. bioRxiv 2021. 21. Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 2020;21(1):180. 22. Mi H, Ebert D, Muruganujan A, Mills C, Albou LP et al. PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Res 2021;49(D1):D394-D403. 23. DeJesus MA, Gerrick ER, Xu W, Park SW, Long JE et al. Comprehensive Essentiality Analysis of the Mycobacterium tuberculosis Genome via Saturating Transposon Mutagenesis. mBio 2017;8(1). 24. Farrell D, Crispell J, Gordon SV. Updated functional annotation of the Mycobacterium bovis AF2122/97 reference genome. Access Microbiol 2020;2(7):acmi000129. 25. Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 2014;15(11):524. 26. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004;32(5):1792-1797. 27. Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum- likelihood trees for large alignments. PLoS One 2010;5(3):e9490. 28. Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res 2015;43(3):e15. 29. Darling AE, Mau B, Perna NT. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 2010;5(6):e11147. 30. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res 2021;49(W1):W293- W296. 31. Mostowy R, Croucher NJ, Andam CP, Corander J, Hanage WP et al. Efficient Inference of Recent and Ancestral Recombination within Bacterial Populations. Mol Biol Evol 2017;34(5):1167-1182. 32. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002;30(14):3059-3066. 96 33. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES et al. Integrative genomics viewer. Nat Biotechnol 2011;29(1):24-26. 34. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012;6(2):80-92. 35. Konishi T, Matsukuma S, Fuji H, Nakamura D, Satou N et al. Principal Component Analysis applied directly to Sequence Matrix. Sci Rep 2019;9(1):19297. 36. Bittinger K. abdiv: Alpha and Beta Diversity Measures. R package version 0.2.0., 2020. https://CRAN.R-project.org/package=abdiv. 37. Team R. RStudio: Integrated Development for R. Boston, MA; 2020. http://www.rstudio.com/. 38. Team RC. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2021. 39. Glaser L, Carstensen M, Shaw S, Robbe-Austerman S, Wunschmann A et al. Descriptive Epidemiology and Whole Genome Sequencing Analysis for an Outbreak of Bovine Tuberculosis in Beef Cattle and White-Tailed Deer in Northwestern Minnesota. PLoS One 2016;11(1):e0145735. 40. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol 2020;37(5):1530-1534. 41. Gupta A, Alland D. Reversible gene silencing through frameshift indels and frameshift scars provide adaptive plasticity for Mycobacterium tuberculosis. Nat Commun 2021;12(1):4702. 42. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 2014;30(9):1312-1313. 43. Zimpel CK, Brandao PE, de Souza Filho AF, de Souza RF, Ikuta CY et al. Complete Genome Sequencing of Mycobacterium bovis SP38 and Comparative Genomics of Mycobacterium bovis and M. tuberculosis Strains. Front Microbiol 2017;8:2389. 44. Zakham F, Sironen T, Vapalahti O, Kant R. Pan and Core Genome Analysis of 183 Mycobacterium tuberculosis Strains Revealed a High Inter-Species Diversity among the Human Adapted Strains. Antibiotics (Basel) 2021;10(5). 45. Dar HA, Zaheer T, Ullah N, Bakhtiar SM, Zhang T et al. Pangenome Analysis of Mycobacterium tuberculosis Reveals Core-Drug Targets and Screening of Promising Lead Compounds for Drug Discovery. Antibiotics (Basel) 2020;9(11). 46. Yang T, Zhong J, Zhang J, Li C, Yu X et al. Pan-Genomic Study of Mycobacterium tuberculosis Reflecting the Primary/Secondary Genes, Generality/Individuality, and the Interconversion Through Copy Number Variations. Front Microbiol 2018;9:1886. 47. Meehan CJ, Goig GA, Kohl TA, Verboven L, Dippenaar A et al. Whole genome sequencing of Mycobacterium tuberculosis: current standards and open issues. Nat Rev Microbiol 2019;17(9):533-545. 97 48. Safi H, Gopal P, Lingaraju S, Ma S, Levine C et al. Phase variation in Mycobacterium tuberculosis glpK produces transiently heritable drug tolerance. Proc Natl Acad Sci U S A 2019;116(39):19665-19674. 49. Fan X, Abd Alla AA, Xie J. Distribution and function of prophage phiRv1 and phiRv2 among Mycobacterium tuberculosis complex. J Biomol Struct Dyn 2016;34(2):233-238. 50. Contreras-Moreira B, Vinuesa P. GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 2013;79(24):7696-7701. 51. Salzberg SL. Next-generation genome annotation: we still struggle to get it right. Genome Biol 2019;20(1):92. 52. Brockhurst MA, Harrison E, Hall JPJ, Richards T, McNally A et al. The Ecology and Evolution of Pangenomes. Curr Biol 2019;29(20):R1094-R1103. 53. Menardo F, Duchêne S, Brites D, Gagneux S. The molecular clock of Mycobacterium tuberculosis. PLoS Pathog 2019;15(9):e1008067. 54. Bellerose MM, Baek SH, Huang CC, Moss CE, Koh EI et al. Common Variants in the Glycerol Kinase Gene Reduce Tuberculosis Drug Efficacy. mBio 2019;10(4). 55. Keating LA, Wheeler PR, Mansoor H, Inwald JK, Dale J et al. The pyruvate requirement of some members of the Mycobacterium tuberculosis complex is due to an inactive pyruvate kinase: implications for in vivo growth. Mol Microbiol 2005;56(1):163-174. 56. Vargas R, Farhat MR. Antibiotic treatment and selection for. Proc Natl Acad Sci U S A 2020;117(8):3910-3912. 57. Godfroid M, Dagan T, Merker M, Kohl TA, Diel R et al. Insertion and deletion evolution reflects antibiotics selection pressure in a Mycobacterium tuberculosis outbreak. PLoS Pathog 2020;16(9):e1008357. 58. Chattopadhyay S, Weissman SJ, Minin VN, Russo TA, Dykhuizen DE et al. High frequency of hotspot mutations in core genes of Escherichia coli due to short-term positive selection. Proc Natl Acad Sci U S A 2009;106(30):12412-12417. 59. Ceres KC, Stanhope MJ, Gröhn YT, Mycobacterium bovis pangenomics. Figshare 2022. https://figshare.com/s/c9665c733243e68326f7 98 Supplementary Materials Supplemental Figure 1 A. Distribution of contig counts and B. N50 among genomes used in this study. C. Percent completeness and percent contamination of each genome as detected by checkM. 99 Supplemental Figure 2. EsxI paralog is absent from reference genome AF2122.97 but present in M. tuberculosis H37Rv, M. bovis Mb3601 and 173 genomes in this study. Of the 173 genomes with the EsxI paralog present, 75 were found in La1.2, 63 in La1.3 and 43 in La1.4. 100 Supplemental Figure 3. The distribution of redundantly annotated accessory genes. The circular core phylogeny shown in Figure 5 is labeled with a gene presence absence matrix of redundant gene groups consisting of 6 or more genes labeled by mapped reference gene on the circumference of the phylogeny. Accessory genes are either found on the same genome such as genes mapped to PPE8, or are more diffusely present or absent. 101 Supplemental Figure 4. The distribution of glpK homopolymeric tract insertions. A. The distribution of 7C->8C insertions in glpK distributed on a circular representation of the core phylogeny shown in Figure 5. B. Variable homopolymeric tract lengths were found in closely related sequences from an outbreak in New Mexico labeled in teal, which occurred between 2007 and 2008. The homoplomyeric tract mutations appear to be transient. 102 Supplemental Figure 5. PCA of the Jaccard distance of the filtered gene presence absence matrix. In A. points are colored by Lineage, and in B. points are colored by the presence or absence of the phage PhiRv1 genes. The percent variation described by each principal component is shown on the x and y axes. 103 Supplemental Figure 5. PCA of the Jaccard distance of the filtered gene presence absence matrix. In A. points are colored by Lineage, and in B. points are colored by the presence or absence of the phage PhiRv1 genes. The percent variation described by each principal component is shown on the x and y axes. 104 Reference Previously Gubbins FastGEAR SNPs Event gene detected detected detected validate d narX Yes No Yes Yes 14 SNPs (18 monophyletic genomes) pks12 Yes Yes Yes Yes 17 SNPs (151 monophyletic genomes) pks12 Yes Yes Yes Yes 17 SNPs (4 monophyletic genomes espa T No Yes No Ends of contigs tatA Yes No No NA NA PE PGRS22 Yes No No NA NA rrs Yes No No NA NA PPE21 No Yes Yes Yes 5 SNPs (7 monophyletic genomes) esxN No Yes Yes Yes 5 SNPs (56 monophyletic genomes) ppsB No Yes No Yes 7 SNPs (2 monophyletic genomes) Supplemental Table 1. Recombination events identified by Gubbins or fastGEAR. SNPs were validated if they were detected both by a recombination detection program and were identified SNPs by the vSNP pipeline after quality control filtering. Previously detected refers to the recombination events detected in Reis & Cunha, Scientific Reports 2021. 105 A B 420 430 440 450 460 470 480 490 500 510 10 SNPs CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T89G6GGCCGACATCCAGGCCGACCCGCGGCGGCGCCGCCGCTACCAGCGCGCCCGCGGCAAGGGCGGGCTGGTCCGGGTCAG CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T89G5GGCCGACATCCAGGCCGACCCGCGGCGGCGCCGCCGCTACCAGCGCGCCCGCGGCAAGGGCGGGCTGGTCCGGGTCAG CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T90G4GGCCGACATCCAGGCCGACCCGCGGCGGCGCCGCCGCTACCAGCGCGCCCGCGGCAAGGGCGGGCTGGTCCGGGTCAG CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T75G8GGCCGACATCCAGGCCGACCCGCGGCGGCGCCGCCGCTACCAGCGCGCCCGCGGCAAGGGCGGGCTGGTCCGGGTCAG 1629 1650 CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T80G8GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T81G3GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG 10 SNPs CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T84G0GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T79G4GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617619 AACTCGGGCGATCTCAACACCGGATTGTTCAACT CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T41G5GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617476 AACTCGGGCGATCTCAACACCGGATTGTTCAACT CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G5GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617524 AACTCGGGCAACATCAACACCGGCTTCTTCAACT CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T37G1GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G6GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617679 AACTCGGGCAACATCAACACCGGCTTCTTCAACT CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G2GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617479 AACTCGGGCAACATCAACACCGGCTTCTTCAACT CACGTTTGGGTGATCCGGTGGSCRGRG17C9C1T84G9GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617535 AACTCGGGCAACATCAACACCGGCTTCTTCAACT CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G1GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617690 CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G4GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG AACTCGGGCAACATCAACACCGGCTTCTTCAACT CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G0GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617651 AACTCGGGCAACATCAACACCGGCTTCTTCAACT CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G8GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SRR7617463 AACTCGGGCAACATCAACACCGGCTTCTTCAACT CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T41G6GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG SYN CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T36G7GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG Variant effect NSYN CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T41G3GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG CACGTTTGGGTGATCCGGTGGSCRGRG17C9C2T34G8GGCCGACATTCAGGCGGATCCCGAGCGCAGACGCCGCTATCAACAGGCCCGCGGCAAGGGTGGGCTGGTCCGGGTCAG Variant effect SYNNSYN C D 10 20 62 93 1809 1834 GCTCATGGCGCCATGATCSRCRG7C8G51C3T16CAGGCGGCGTCGCTTGAGGCGGAGCATCAGGCCATC 50 SNPs GCTCATGGCGCCATGATCSRCRG1C6G27C8T28C0AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC GCTCATGGCGCCATGATCSRCRG1C6G27C8T25C1AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1657061 CTTCTCGTTGCACGACGTGCTGGCTAACGGCGAGGAGCTA GCTCATGGCGCCATGATCSRCRG7C8G51C3T17CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1791816 CTTCTCGTTGCACGACGTGCTGGCTAACGGCGAGGAGCTA GCTCATGGCGCCATGATCSRCRG7C8G51C3T81CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR6467887 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA GCTCATGGCGCCATGATCSRCRG7C8G51C3T65CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792438 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA GCTCATGGCGCCATGATCSRCRG7C8G51C3T47CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792343 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA GCTCATGGCGCCATGATCSRCRG5C4G86C0T79CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792330 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA GCTCATGGCGCCATGATCSRCRG5C4G85C7T29CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792329 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA GCTCATGGCGCCATGATCSRCRG1C7G92C4T85CAGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1791723 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA GCTCATGGCGCCATGATCSRCRG1C3G19C9T75C2AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792065 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA GCTCATGGCGCCATGATCSRCRG1C3G19C9T73C5AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SRR1792046 CTTCTCGCTGCGCGACGTGATCGCCACCGGCAAGGAGCTA 1G0C STNCPAsTGGCGCCATGATCSRCRG1C3G19C9T76C2AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC SYNGCTCATGGCGCCATGATCSRCRG1C3G19C9T75C8AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC Variant effect NSYN GCTCATGGCGCCATGATCSRCRG1C3G19C9T75C6AGGCCGGGTTGCTGGAGGCCGAGCATCAGGCCATC Variant effect SYNNSYN Supplemental Figure 7. Potential recombination events confirmed by fastGEAR in A. narX, B. PPE 21, C. esxN, and D. ppsB. A. The recombinant segment in narX contains a hotspot of 14 SSuNppPsle smhaerendt ably F1i8g suerqeu e7n.c ePso ftreonmt iaa sl inregcleo omutbbirneakt.i o Fno uerv SeNnPtss caroen nfoirnmsyendon byym ofuas taGndE rAesRul ti n A. nainrX th,r Bee. aPmPinEo 2ac1i,d Cch. aensgxeNs. ,B a. nInd P DPE. p2p1,s tBh.r eAe m. Tuthaeti ornesc womereb ninonasnytn soengymoeunst. Cin. Tnwaor Xou ct onf tains a hthoet fsipvoe tm ouft a1ti4on Ss NwePrse snhonasryendo nbyym 1o8us s, eanqdu ienn Dce. sfi vfer oomut oaf sthine gselev eonu mtburtaetaiokn.s wFeorue r SNPs are nonnosnysynnoonnyymmoouus.s and result in three amino acid changes. B. In PPE 21, three mutations we re nonsynonymous. C. Two out of the five mutations were nonsynonymous, and in D. five out of the seven mutations were nonsynonymous. .Arg154Gln Ala22Gl .y Arg1 S 5e 6r2 S3 eL reu .Arg161His Arg605H Ai ss p544Asn Leu545Ile Ile608L I ele u608Met Thr610Asn Leu L 5y 4s 96 P1 h2 eGlu 106 Supplemental Figure 8. SNP hotspot in pks12 is not verified because of low mapping quality. The distribution of pks12 SNPs is shown in A. SNPs are shown in B and cluster in lineage La1.8. The location of the SNPs in pks12 is shown in C (arrow). SNPs occur at the 3’ end of pks12. pks12 generally had poor mapping quality in our sample, and the SNP hotspot was always located at the end of a contig. 107 CHAPTER 4 Characterizing the effects of drift, progeny skew and population dynamics on M. bovis within host evolution 108 Abstract Rapid evolution can lead to evolutionary trajectories that differ from those predicted using standard models of molecular evolution based on the Kingman’s coalescent. Since molecular epidemiology relies on population genetic theory to link pathogen genetic data to epidemiological inference, thorough understanding of a pathogen’s evolutionary biology in relation to basic model assumptions is necessary to ensure models are used appropriately. Here we study the within host evolution of Mycobacterium bovis, the cause of bovine tuberculosis, using the empirical distribution mutations that arise de novo within an animal host and forward genetic simulation to test hypotheses for evolutionary mechanisms within the host. We found that the empirical distribution of mutations does not deviate from a null expectation of random mutation, suggesting drift, not selection, is the primary evolutionary force driving mutation during infection in animal hosts. We also found that a within host evolutionary scenario involving skewed offspring distributions and panmictic within host populations and a rapid mutation rate was best supported by the data. The diversity generating effect of faster mutation rate is counteracted by the diversity reducing effect of large variance in offspring distributions, where a minority of individuals produces offspring in the next generation. Our simulation framework can be applied to other aspects of M. bovis molecular epidemiology, including transmission analysis, and studying the evolution of important phenotypes like antimicrobial resistance. 109 Introduction Pathogens can evolve rapidly under the right conditions, and this rapid evolution can produce evolutionary trajectories that significantly deviate from standard models that assume relatively slow evolution. One aspect of rapid evolution is that offspring distributions may be highly skewed, with fit individuals producing more offspring than less fit individuals. This variance in reproductive success may be driven by selection induced by antimicrobial treatment or host immune response [1], and by serial population bottlenecks [2, 3]. The Mycobacterium tuberculosis complex (MTBC), including M. tuberculosis, the cause of tuberculosis in humans, and M. bovis, the cause of bovine tuberculosis, is one example of a pathogen system that’s evolutionary trajectory is shaped by serial population bottlenecks associated with transmission events. Additionally, successful treatment for tuberculosis requires months of antimicrobial therapy, and substandard treatment [4] or low patient adherence [5] can create strong selective forces for antimicrobial resistance to rapidly evolve. Standard models of evolution based on the Kingman’s coalescent [6] that assume narrow offspring distributions may not appropriately model the clonal evolution of MTBC species dominated by clonal reproduction, frequent population bottlenecks, and potential differences in reproductive success [7]. Recent studies have found multiple-merger coalescent models [2] and explicit modeling of skewed offspring distributions [8, 9] describe M. tuberculosis evolution better than standard Kingman’s coalescent based models. Additionally, since the MTBC is highly clonal with little to no recombination, the dynamics of neutral mutations in a population may not depend 110 only on neutral drift, but on genetic draft due to linkage with any site under selection in the genome [10]. This genetic hitchhiking may be particularly important in understanding the dynamics of mutations linked to resistance-conferring mutations, which can sweep to high population frequencies during anti-tuberculosis drug treatment [11]. These theoretical deviations of MTBC evolution from standard coalescent models highlight the importance of understanding the biology of the host-pathogen system when applying epidemiological models, but also demonstrate the potential increased utility of genomic data, once appropriate models are described. For example, using standard coalescent models, the evolutionary rate of M. tuberculosis lineages has been estimated to between 1e-10 and 1e-7 substitutions per site per year [12] and between 1e-8 and 1e-7 substitutions per site per year in M. bovis lineages [12, 13] or between about 0.04 to 0.4 substitutions per genome per year assuming a genome length of about 4 Mbp. The basal mutation rate of an organism determines the amount of genetic variation produced in a population but does not perfectly correlate with estimates of evolutionary rate [14]. After mutations arise, positive and negative selection and population bottlenecks caused by transmission events, alter the distribution of mutations in a population. The combination of these extrinsic effects and intrinsic mutation rate produces the observed differences among a sample of genomes. In a simulation framework that relaxes the assumptions of narrow offspring distributions, Morales-Arce et al found that the combination of a relatively fast mutation rate of 1e-7 mutations per site per generation, or about 40 mutations per genome per 100 generations, and the genetic diversity reducing effect of skewed progeny 111 distributions, where a subset of individuals produced more offspring than others, produced the numbers of segregating sites commonly observed between serial patient samples [8]. This higher basal mutation rate could have important implications for the potential within host adaptation that is possible at a given timescale, including the probability of de novo antimicrobial resistance mutation evolution. Here we study the within host microevolution of three geographically distinct M. bovis outbreaks using a combination of empirical analysis and forward genetic simulation. We describe the empirical distribution of mutations in functional categories and use simulation to estimate the de novo mutation rate, infection population bottleneck size, and the degree of progeny skew. We find that in the context of within host evolution, drift, not selection, appears to be the driving force determining the distribution of mutations. We also find that when considering both fixed and variable SNPs instead of fixed SNPs alone and the diversity reducing effect of clonal progeny skew, the underlying mutation rate is much faster than previous estimates suggest, mirroring a recent finding in human M. tuberculosis infections [8]. Our findings can be used to develop epidemiological frameworks that better incorporate all variant information available in population sequencing samples and may aid in developing higher resolution transmission networks in outbreak analysis. Methods Sequence selection and variant calling Whole genome sequences from Glaser et al.[15], Rodrigues et al. [16], and Almaw et al. [17] were selected because each study included samples from multiple tissue types 112 within individual hosts. Sequence reads were downloaded from the NCBI Sequence Read Archive. SNPs were called using vSNP pipeline (https://github.com/USDA- VS/vSNP, lofreq [18] and VarScan 2 [19] using AF2122/97 as a reference (NCBI accession: NC_002945v4). The bam files produced from step 1 of the vSNP pipeline were used to call SNPs using LoFreq and VarScan 2. We retained variants if they were called by both LoFreq and VarScan 2, and had at least 10 reads supporting them, a coverage of at least 50x, a minimum Phred score of 20, and a minimum allele frequency of 0.5 %. SNPs in repetitive regions of the genome including PE/PPE genes mobile genetic elements were also removed. Variants were labeled as variable SNPs (vSNPs) if the allele frequency was between 5% and 90%. SNPs were considered fixed (fSNPs), if the allele frequency was greater than or equal to 90%. Population structure analysis A phylogenetic tree was created using IQtree2 [20] using the GTR+G substitution model. The tree was rooted at the AF2122/97 reference. RhierBAPS was used to identify up to five levels phylogenetic clusters [21]. Similar to the criteria used in Moreno-Molina et al. [22], an infection was determined to be polyclonal if SNPs clustered into different RhierBAPS phylogenetic clusters, or if there were 30 or more SNPs different between samples. The correlation in allele frequencies between distinct samples in the same host was also calculated and plotted using the R package corrplot [23]. Pairs of samples were predicted to have highly correlated allele frequencies if they were derived from the same source infection population. Samples from polyclonal infections were predicted to have less correlated allele frequencies. 113 Characterization mutation effects The functional effects of SNPs (synonymous or nonsynonymous) were predicted using snpEff [24] with a custom reference database created with the same reference genome used to call SNPs. Genes were categorized into nonessential and essential groups based on characterization of gene essentiality in M. tuberculosis. First, genes were labeled as essential or non-essential using labels developed by a random saturating transposon mutagenesis study in H37Rv M. tuberculosis [25]. Genes were further were further characterized into distinct functional categories (antibiotic resistance, antigens, mobile genetic elements, other) using the classification used in Vargas et al. for classifying M. tuberculosis genes [26]. Genes in the “other” category contained essential and nonessential genes that were not otherwise categorized as antibiotic resistance genes, antigens, or mobile genetic elements. The gene coordinates of the H37Rv reference gene were translated to coordinates in M. bovis AF2122/97 using the annotations provided by Farrell et al. [27]. Genotypic resistance was predicted using TB-Profiler [28]. Mutational density test To identify genes with significant mutation burden, we used a mutational density test similar to the mutational density test described in Vargas et al.[26]. We applied the mutational density test to individual genes, and to gene categories, to determine if any gene or gene category contained more mutations than expected under the null model, where any location in the genome was equally likely to be mutated. The mutational density test assumes the number of SNPs, N, that occur across all samples in gene j, is modeled as a Poisson random variable 𝑁#~𝑃𝑜𝑖𝑠(𝜆). For each gene, 𝑁# was compared to 114 a null rate 𝑁5 was assumed to be the mean number of SNPs across all genes analyzed, weighted by gene length 𝑔5: M 𝑔 𝜆5 = ab𝑁5cd 5 e ∑M 5:3 5 𝑔5 A gene was determined to be significantly enriched mutations if the p value (𝑃(𝑁# > 𝑁5)) was less than a Bonferroni corrected threshold (p < 0.05 / n_genes), corrected for the number of genes analyzed (n_genes = 3,758). When applying the mutational density test to gene categories instead of individual genes, genes were classified into functional categories described in Section 2.3, and 𝑔5 represented the total length of all genes in category i. The p value threshold was adjusted for the six categories tested (p < 0.5 / 6). Forward genetic simulation A forward genetic agent-based model was developed in SLiM 3 v 3.7.1 [29] to model the evolution of M. bovis genomes inside animal hosts. The goal of the simulation analysis was to estimate the best fitting parameter values for critical parameters that describe the within host evolution process: mutation rate 𝜇, infection bottleneck size 𝑏, and degree of within clonal progeny skew. Model structure We created four different simulation structures with different combinations of parameters. A visual description of the simulation framework is shown in Figure 11 and parameter combinations are shown in Table 4. 115 A C No progeny skew Progeny skew μ B Gi Gi+1 Gi Gi+1 B D λ p μ 0.75 0.1 p p 0.50 0.5B 1.0 p 0.25 0.00 0 2 4 Number of offspring Figure 11. Within host population simulation schematic. Four model structures were created. Models varied in the presence or absence of within host population structure (A-B) and the presence or absence of skewed offspring distributions (C). For models with progeny skew, at each generation the number of offspring produced by an individual was drawn from a Poisson distribution. Example distributions in the range of parameter values, 𝜆, explored is shown in D. Table 4. Components of each model structure tested Model Number Progeny skew Population structure 1 No No 2 Yes No 3 No Yes 4 Yes Yes Simulation details The generic simulation framework is described below according to the agent-based model “PARTE” framework [30]. Density 116 Properties: The agents modeled are M. bovis bacteria. Each individual has a haploid genome with length 4.3e6 base pairs which can acquire neutral, or deleterious with mutation rate 𝜇. Mutations in 2/3 of the genome are considered nonsynonymous, and nonsynonymous mutations have a gamma distribution of fitness effects with mean of - 0.3 and shape parameter of 0.2, making nonsynonymous mutations weakly deleterious. Synonymous mutations have a fixed fitness effect of zero (no effect). The recombination rate is set to zero, to simulate strictly clonal evolution. Actions: Throughout the course of simulation, individuals may reproduce, die or in the structured simulations, migrate to a new subpopulation. Rules: Each simulated infection begins with 10,000 generations of neutral burn-in. At generation 10,001, a population bottleneck occurs, which reduces the population size to the value 𝑏, to simulate the infection bottleneck. In the structured simulations, at the beginning of simulation, a predefined number of empty subpopulations are created. These subpopulations are the in silico representation of potential infection sites within the host. During the simulation, individual bacteria from the source subpopulation can migrate to empty subpopulations, and probability of migration at each generation is determined by the value of 𝑝. In the unstructured simulations, one single subpopulation is simulated. Within a subpopulation, fitness of individuals is proportion to population size, whereas if the population size reaches the carrying capacity, or the maximum number of individuals that can live within a host, the fitness of all individuals in the subpopulation decreases. The carrying capacity of the subpopulation is determined by 117 the within host population size parameter, 𝐾. In the simulations with clonal progeny skew, at each generation, the number of offspring generated by a single individual is drawn from a Poisson distribution with parameter 𝜆. If 𝜆 < 1, on average some individuals will produce 0 offspring, whereas others will reproduce 1 or more. Time: Time is measured in M. bovis generations, with new reproduction events happening at each tick. Each simulation was run for 365 generations after burn-in, assuming a generation time of 24 hours. Environment: Infection sites within the host were represented by a set of subpopulations which within SLiM is a potential space where M. bovis individuals may reside. Parameter values Each model contained parameters that remained constant between simulations, and parameters that were estimated. When comparing model structures, the same model parameters were used for each structure. Prior distributions for each parameter are shown in Table 5. 118 Table 5. Prior distributions for simulation parameters Parameter Definition Distribution 𝜇 Mutation rate 𝑈𝑛𝑖𝑓~(1𝑒*N, 1𝑒*O) 𝐵 Bottleneck size 𝑈𝑛𝑖𝑓~(10, 1𝑒L) 𝑝 Probability of subpopulation formation 𝑈𝑛𝑖𝑓~(0, 1𝑒*L) 𝜆 Parameter of Poisson distribution 𝑈𝑛𝑖𝑓~(0.1, 1) controlling offspring count Fixed model parameters are shown in Table 6, and parameter values used in model comparison and model fit analyses are shown in Table 7. Table 6. Fixed model parameter values. Parameter Definition Value K0 Initial population size 1000 𝐾3& Post-bottleneck carrying capacity per 500 subpopulation, structured model 𝐾3$& Post-bottleneck carrying capacity, non- 1000 structured model 𝜌 Recombination rate 0 Table 7. Parameter values used in model assessment. Parameter Definition Value 𝜇& Mutation rate 2.5𝑒*N 𝐵& Bottleneck size 100 𝑝& Probability of subpopulation formation 0.05 𝜆& Parameter of Poisson distribution 0.5 controlling offspring count Model selection and parameter The goodness of fit of each simulation structure, and the pairwise comparative fit of all models was determined was determined using the R package ABC [31]. First, for each model structure 1000 simulations were run using parameter values in Table 7. The mean number of neutral mutations, mean number of deleterious mutations, and the mean 119 number of mutation differences between two samples of 10 genomes was recorded at each simulation. These summary statistics were compared to the mean number of synonymous and nonsynonymous mutations, and the mean number of differences between pairs of samples from the same host in the empirical data. The goodness of fit of each model was assessed using the gfit function in ABC. The posterior probabilities of each model scenario were calculated using the postpr function and the "mnlogistic” method in ABC, and models were compared Bayes Factors. A log10 Bayes Factor value greater than 1 indicated strong support for one model compared to the other. Parameter values were estimated for the best fitting model using Markov Chain Monte Carlo- approximate Bayesian computation (MCMC-ABC) using the Wegmann method [32] in the R package EasyABC [33]. Results Phylogeny and population structure 159 M. bovis whole genome sequences were analyzed, consisting of 62 samples from Glaser et al, 29 from Brazil study, and 68 from Ethiopia, which will now be referred to as the US, Brazil, and Ethiopian samples, respectively. The US samples were from a monophyletic outbreak in Minnesota in cattle and white-tailed deer (Odocoileus virginianus), whereas the Brazil and Ethiopia datasets consisted of multiple different outbreaks in different geographic regions in cattle hosts. The phylogenetic tree of all samples included in the study labeled by study set and RhierBAPS phylogenetic cluster is shown in Figure 12. 120 Figure 12. Phylogeny of study samples colored by geographic region and three levels of RhierBAPS phylogenetic clustering. 121 Single infections vs polyclonal infections 10, 13, and 26 cattle from the US, Brazil, and Ethiopia datasets, respectively, had more than one sample from a single host. To determine whether a cow was infected by a single clone or multiple clones, each set of samples from a given cow was evaluated by calculating a correlation matrix of allele frequencies between samples, phylogenetic clustering and a between-sample SNP cutoff. An example correlation matrix from polyclonal and single infections is shown in Figure 13. The correlation matrices comparing all samples within a group is shown in Supplemental Figure 1. 31 samples were determined to be polyclonal (1 in the US dataset, 11 in the Brazil dataset and 19 in the Ethiopia dataset). Only single source infections were used in subsequent analyses. 122 Figure 13. Distinguishing polyclonal from single source infections using allele frequencies. Since single source infections began from just population, allele frequencies of mutations derived from the infection source tend to be correlated (A). In polyclonal infections, allele frequencies between distinct populations within the host are less correlated (B). Distribution of mutation types The distribution of mutation types and mutational density was evaluated among de novo mutations that arose during single source infections. De novo mutations were defined as mutations present in a single individual host, but absent from all other hosts within a phylogenetic cluster. Sixty percent of all de novo mutations were nonsynonymous mutations and 73% of intragenic mutations occurred nonessential genes, which is consistent with random mutation in a highly gene-dense genome; however, both synonymous and nonsynonymous mutations also occurred in essential genes and 123 antigens. Mutations in genes known to be associated with antimicrobial resistance are shown in Table 8. Table 8. Within host mutations in genes associated with antimicrobial resistance Gene Nucleotide Amino acid Effect Essential Fixed Associated change change with resistance embB G1486GA Asp496Asn NS ES Yes no gyrB C774T Val258Val S ES Yes no iniA T771C Val257Val S NES Yes no rpoB G1723C Val575Leu NS ES No no alr C221T Ala74Val NS ES Yes no Effect: S=synonymous mutation, NS=nonsynonymous mutation; Essential: ES=essential gene, NES=nonessential gene Five de novo mutations occurred in genes associated with antibiotic resistance; however, no mutations occurred at positions known to cause antibiotic resistance. Therefore, these mutations may have no effect on antibiotic resistance phenotypes. Mutational density test The mutational density test was applied to genes individually, and to gene categories to test whether the distribution of de novo mutations deviates from the null distribution. Deviation from the null distribution suggests regions of the genome are under positive or negative selection. One gene, pcaA had more mutations than expected (p=4.2e-7). This gene contained four different vSNPs; however, all mutations were found in a single genome in one host animal, J_47. If this animal was removed, no genes had mutation counts different from the null distribution. The distribution of de novo SNP counts among gene categories is shown in Table 9. 124 Table 9. Distribution of de novo SNPs among gene categories Gene category Nj Ni p value p<0.008 Antigen, ES 3 3.7 0.21 No Antigen, NES 18 15.8 0.08 No AMR, ES 3 2.1 0.19 No AMR, NES 1 1.0 0.37 No Other, ES 50 52.5 0.05 No Other, NES 167 166.9 0.03 No ES= essential; NES = nonessential; AMR = antimicrobial resistance, Ni and Nj = expected and observed number of de novo mutations, respectively. No gene category was found to have mutation densities different than expected under the null model of random mutation over the length of the genome. Simulation results Model selection All four model structures fit are supported by the data (Goodness-of-fit p value > 0.5, Supplemental Figure 2). The relative support of model structures given the summary statistics of the observed de novo mutations is shown in Figure 14. 125 0.00 −0.72 0.71 −0.21 1.50 1.20 0.90 0.72 0.00 1.42 0.51 0.60 0.30 0.00 −0.30 −0.71 −1.42 0.00 −0.91 −0.60 −0.90 −1.20 0.21 −0.51 0.91 0.00 −1.50 1 2 3 4 Model structure B Figure 14. Model comparison using log10 Bayes factors. Bayes factors comparing models A and B were calculated by comparing the posterior probability of model A to that of model B. Model 2 was more strongly supported by the data than any of the other models. Generally, models with clonal progeny skew had more support than those without (Models 2 and 4 compared to 1 and 3). Model 2 had moderately more support by the data than model 4, and model 2 had fewer parameters than model 4, so model 2 was chosen as the best fitting model. Posterior distributions for fitted parameter values of Model 2 are shown in Figure 15. Model structure A 4 3 2 1 126 Figure 15. Model parameter estimates calculated with Approximate Bayesian Computation. Bivariate density plots are shown for pairs of parameters in A. Lighter colors indicate higher density. Marginal density plots for each parameter values are shown in B. 127 The mutation rate 𝜇 was estimated to be about 5e-9 mutations per site per generation, or about 2 mutations per genome 100 generations. The posterior distributions of the bottleneck population size and Poisson offspring distribution parameter, 𝜆 were more diffuse. Population bottleneck sizes between 250 and 750 individuals were plausible, as were lambda values around 0.4 and 0.8. Discussion Selection vs drift By studying the mutations that arise de novo within hosts during M. bovis infection, we found that M. bovis within host evolution appears to be driven by drift. Most de novo mutations were in nonsynonymous genes, and may affect protein function, however, the observed proportion nonsynonymous mutations was roughly equal to what would be expected for random mutation on a highly gene-dense genome (60% nonsynonymous). Although we considered M. bovis samples from three distinct geographic regions and two host species (white-tailed deer and cattle), the distribution of mutations in individual genes and in gene categories what would be expected if each segment of the genome were equally likely to be mutated. Even in gene categories that are potential targets for positive selection, like antigens and genes associated with antimicrobial resistance, we did not find evidence of mutational patterns associated with selection. Instead, in the context of within host evolution over short time scales, the distribution of mutations appears to be driven by drift. 128 The MTBC contains six human adapted lineages, and four animal adapted lineages [34]. The most recent common ancestor to M. bovis and all animal adapted MTBC lineages, was likely a human pathogen, and human to cattle transmission may have been related to the domestication of cattle [34]. Using molecular dating analyses, Loiseau et al. dated the most recent common ancestor of M. bovis to between about 1000-2000 years ago, which suggests M. bovis has been adapted to cattle hosts for at least a millennium [13]. Additionally, when comparing pairwise nucleotide diversity between human and animal adapted M. bovis lineages, Brites et al found that, on average, there was a similar number pairwise differences between strains from two distinct human linages as between lineages adapted to two different animal host species, suggesting that the genetic diversity in animal adapted strains is not well correlated with the genetic distance between host species [34]. From this finding, Brites et al. suggest that host factors may not define host tropism in animal adapted MTBC lineages, and instead infection may depend on contact and infection opportunity [34]. Although it is expected that animal adapted strains would have less nucleotide diversity than human adapted strains if the ancestor of the MTBC complex was human adapted, and the animal adapted strains went through a population bottleneck, reducing genetic diversity, the animal adapted strains have a surprisingly wide host range. Since M. bovis is well adapted to cattle hosts, it is possible that the effect of ongoing adaptation is not measurable in short time scales within individual hosts. Additionally, cattle are not treated for bovine tuberculosis using human important drugs like isoniazid and rifampin, so there is possibly no strong selection pressure for M. bovis to rapidly evolve during a single infection. 129 Apart from selection pressures driven by antimicrobial use or recent host changes, different conditions in the infection microenvironment may influence mutation rates and distributions. A recent study in human tuberculosis patients found a higher number of mutations in surgical granulomas compared to sputum samples and suggested that increased reactive oxygen species (ROS) activity within granulomas may be linked to increased mutation rates, with a bias towards increased transitions [22]. In our study, all samples were from granulomatous tissue; however, among samples with three or more vSNPs of fSNPs, the transition to transversion ratio (ti:tv) ranged from 0.5 to 5.1 (Supplementary Table 1). Additionally, samples from one animal, J_47 had the largest number of mutations (133 fSNPs and 39 vSNPs), and a mutational bias toward transitions (ti:tv = 1.8). It is possible that this individual was infected for a longer period than most other animals but could also be possible that this individual experienced a higher mutation rate induced by ROS. When excluding mutations found in J_49 samples from the mutational density test, no genes were found to be significantly enriched in de novo mutations. If we were to compare M. bovis samples within and outside granulomas we may observe differences in mutation profiles. Simulation results Despite finding no clear evidence of positive selection, the best fitting models describing within host evolution included skewed offspring distributions. It is possible that negative selection creates skewed offspring distributions in the host, where individuals with more deleterious mutations have reduced reproductive success compared to individuals with relatively neutral mutations. The mutation rate was 130 estimated to be about 5e-9 mutations per site per generation or about 2 mutations per genome per 100 generations. This mutation rate estimate is about a factor 10 faster than the mutation rate estimated by Morales-Arce et al. [8], although we found the same trend that the diversity reducing effect of skewed offspring distributions counteracts the diversity generating mechanism of a relatively fast mutation rate. Morales-Arce et al also used a different method of modeling progeny distributions, where instead of drawing the number of offspring from a Poisson distribution in a SLiM reproduction callback in a non-Wright Fisher modeling framework, they modeled variance in reproductive success using offspring distribution using two population migration model in a Wright-Fisher framework, so differences in mutation rates estimated here and in Morales-Arce et al may be driven by this methodological difference. We were not able to find clear estimates for bottleneck size or the offspring distribution parameter 𝜆, however, the parameter value combinations with the highest support included relatively small bottleneck sizes of about 100 individuals. Additional summary statistics are likely needed to resolve true parameter values. Implications on the evolution of antimicrobial resistance Our finding that drift dominates M. bovis within host evolution in the context of well adapted host species is significant because it highlights the utility of M. bovis in animal hosts as a model for evolution of MTBC in the absence of the selective pressure of antimicrobial use. Nonsynonymous mutations occurred in genes associated with antimicrobial resistance in M. tuberculosis, however, no known resistance-causing mutations were identified. Mutations in three AMR-associated genes were fixed within- 131 host, implying the mutation was not deleterious enough to prevent growth in the host. Since mutations evolve in genes associated with antimicrobial resistance at approximately the same rate as genes in other functional categories, it is possible that a combination of drift and population dynamic changes that occur during the course of infection provide the necessary genetic diversity in a population that selection can act upon, when a human tuberculosis patient is treated with antimicrobials. M. bovis animal infections then may provide a useful model for studying the likelihood of developing resistance conferring mutations and could be used to describe the basic evolutionary processes that influence the trajectory of resistance mutations. Implications for epidemiological analysis Additionally, although it is common to sequence a whole population of bacteria from a lesion, the distribution of allele frequencies is often ignored epidemiological studies, or variants with low allele frequency are ignored, meaning only fSNPs are analyzed. vSNPs that are at low allele frequency are often discarded from population samples in epidemiological analysis and phylogenies because each tip in a phylogenetic tree is assumed to represent one individual, not a population. We found that using SNP callers designed to accurately identify vSNPs, we could characterize not only the variants present in a single consensus genome, but also the distribution of alleles in populations represented in a whole genome sequence [35], which helped us distinguish single source infections from polyclonal infections, and were necessary to generate a more comprehensive mutation rate estimate. Using the flexible simulation structure SLiM provides, we could include both vSNPs and fSNPs into our simulation and parameter 132 inference framework. Especially for organisms like M. bovis with relatively slow substitution rates compared to other pathogens, analytical frameworks that allow full use of the information present in a whole genome sequence could provide a basis for developing finer-scale transmission networks or even inferring the time of coalescent events with increased resolution. Limitations As we were only interested in de novo mutations that occurred over short time scales, our analysis may not provide a complete description of contemporary M. bovis evolution. Selection may be an important factor in other contexts, such as adaptation to new environments such as new host species, geographic regions, or soil conditions. We also did not explicitly test for selection because of the low genetic diversity between gene sequences. A larger sample set from a broader genetic background would be more appropriate for selection analysis; however, the since the goal of our study was to focus specifically on the evolution of mutations that occur inside a host, we were restricted to study samples where within host genetic diversity could be assessed, that is, multiple samples taken from the same host. Another limitation of our study was the cross- sectional design, which prevented studying empirical mutational dynamics over time. Since there are strict regulations involving culling infected animals upon discovery in some countries, a large scale longitudinal analysis of within host M. bovis dynamics in animal hosts would be very difficult to accomplish [36]. If longitudinal data became available, the simulation structure could be easily adapted to incorporate temporal signal into the parameter estimation framework. 133 Conclusions Over the relatively short time period of the length of a single infection in cattle and white-tailed deer hosts, M. bovis evolution appears to be driven by drift. Nonsynonymous mutations were common, even in essential genes; however, the number of mutations in different functional categories did not differ from a null model of random mutation along the length of the genome. The forward genetic simulation framework described here allowed us to test hypotheses for within host dynamics and demonstrated the combined effect of skewed offspring distributions and population bottlenecks on the observed mutations between individuals. This simulation framework could provide useful insights in understanding the pathogen’s epidemiology and the evolution of important phenotypes like antimicrobial resistance. 134 References 1. Wheatley R, Diaz Caballero J, Kapel N, de Winter FHR, Jangir P et al. Rapid evolution and host immunity drive the rise and fall of carbapenem resistance during an acute Pseudomonas aeruginosa infection. Nat Commun 2021;12(1):2460. 2. Menardo F, Gagneux S, Freund F. Multiple Merger Genealogies in Outbreaks of Mycobacterium tuberculosis. Molecular Biology and Evolution 2020;38(1):290-306. 3. Irwin KK, Laurent S, Matuszewski S, Vuilleumier S, Ormond L et al. On the importance of skewed offspring distributions and background selection in virus population genetics. Heredity (Edinb) 2016;117(6):393-399. 4. Bate R, Jensen P, Hess K, Mooney L, Milligan J. Substandard and falsified anti-tuberculosis drugs: a preliminary field analysis. Int J Tuberc Lung Dis 2013;17(3):308-311. 5. Nellums LB, Rustage K, Hargreaves S, Friedland JS. Multidrug-resistant tuberculosis treatment adherence in migrants: a systematic review and meta-analysis. BMC Med 2018;16(1):27. 6. Kingman JFC. The coalescent. Stochastic Processes and their Applications 1982;13(3):235-248. 7. Eldon B, Wakeley J. Coalescent processes when the distribution of offspring number among individuals is highly skewed. Genetics 2006;172(4):2621-2633. 8. Morales-Arce AY, Harris RB, Stone AC, Jensen JD. Evaluating the contributions of purifying selection and progeny-skew in dictating within-host Mycobacterium tuberculosis evolution. Evolution 2020;74(5):992-1001. 9. Sabin S, Morales-Arce AY, Pfeifer SP, Jensen JD. The impact of frequently neglected model violations on bacterial recombination rate estimation: a case study in Mycobacterium canettii and Mycobacterium tuberculosis. G3 (Bethesda) 2022. 10. Messer PW, Ellner SP, Hairston NG. Can Population Genetics Adapt to Rapid Evolution? Trends Genet 2016;32(7):408-418. 11. Trauner A, Liu Q, Via LE, Liu X, Ruan X et al. The within-host population dynamics of Mycobacterium tuberculosis vary with treatment efficacy. Genome Biol 2017;18(1):71. 12. Menardo F, Duchene S, Brites D, Gagneux S. The molecular clock of Mycobacterium tuberculosis. PLoS Pathog 2019;15(9):e1008067. 13. Loiseau C, Menardo F, Aseffa A, Hailu E, Gumi B et al. An African origin for Mycobacterium bovis. Evol Med Public Health 2020;2020(1):49- 59. 14. Peck KM, Lauring AS. Complexities of Viral Mutation Rates. J Virol 2018;92(14). 15. Glaser L, Carstensen M, Shaw S, Robbe-Austerman S, Wunschmann A et al. Descriptive Epidemiology and Whole Genome Sequencing Analysis for an Outbreak of Bovine Tuberculosis in Beef Cattle and White-Tailed Deer in Northwestern Minnesota. PLoS One 2016;11(1):e0145735. 16. Rodrigues RA, Ribeiro Araújo F, Rivera Dávila AM, Etges RN, Parkhill J et al. Genomic and temporal analyses of Mycobacterium bovis in southern Brazil. Microb Genom 2021;7(5). 135 17. Almaw G, Mekonnen GA, Mihret A, Aseffa A, Taye H et al. Population structure and transmission of Mycobacterium bovis in Ethiopia. Microb Genom 2021;7(5). 18. Wilm A, Aw PP, Bertrand D, Yeo GH, Ong SH et al. LoFreq: a sequence- quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res 2012;40(22):11189- 11201. 19. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012;22(3):568-576. 20. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol 2020;37(5):1530-1534. 21. Tonkin-Hill G, Lees JA, Bentley SD, Frost SDW, Corander J. RhierBAPS: An R implementation of the population clustering algorithm hierBAPS. Wellcome Open Res 2018;3:93. 22. Moreno-Molina M, Shubladze N, Khurtsilava I, Avaliani Z, Bablishvili N et al. Genomic analyses of Mycobacterium tuberculosis from human lung resections reveal a high frequency of polyclonal infections. Nat Commun 2021;12(1):2716. 23. Wei T, Simko V. R package 'corrplot': Visualization of a Correlation Matrix. 2021. 24. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012;6(2):80-92. 25. DeJesus MA, Gerrick ER, Xu W, Park SW, Long JE et al. Comprehensive Essentiality Analysis of the Mycobacterium tuberculosis Genome via Saturating Transposon Mutagenesis. mBio 2017;8(1). 26. Vargas R, Freschi L, Marin M, Epperson LE, Smith M et al. In-host population dynamics of. Elife 2021;10. 27. Farrell D, Crispell J, Gordon SV. Updated functional annotation of the Mycobacterium bovis AF2122/97 reference genome. Access Microbiol 2020;2(7):acmi000129. 28. Phelan JE, O'Sullivan DM, Machado D, Ramos J, Oppong YEA et al. Integrating informatics tools and portable sequencing technology for rapid detection of resistance to anti-tuberculous drugs. Genome Med 2019;11(1):41. 29. Haller BC, Messer PW. SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model. Mol Biol Evol 2019;36(3):632-637. 30. Regulation CotAoA-BMtITP, Practice BoPHaPH, Medicine Io. Assessing the Use of Agent-Based Models for Tobacco Regulation. 2015. 31. Csilléry K, François O, Blum MGB. abc: an R package for approximate Bayesian computation (ABC). Methods in ecology and evolution 2012;3(3):475-479. 32. Wegmann D, Leuenberger C, Excoffier L. Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics 2009;182(4):1207-1218. 136 33. Jabot F, Faure T, Dumoulin N. Easy ABC: performing efficient approximate B ayesian computation sampling schemes using R. Methods in Ecology and Evolution 2013;4(7):684-687. 34. Brites D, Loiseau C, Menardo F, Borrell S, Boniotti MB et al. A New Phylogenetic Framework for the Animal-Adapted Mycobacterium tuberculosis Complex. Front Microbiol 2018;9:2820. 35. Goossens SN, Heupink TH, De Vos E, Dippenaar A, De Vos M et al. Detection of minor variants in Mycobacterium tuberculosis whole genome sequencing data. Brief Bioinform 2022;23(1). 36. Orloski K, Robbe-Austerman S, Stuber T, Hench B, Schoenbaum M. Whole Genome Sequencing of Mycobacterium bovis Isolated From Livestock in the United States, 1989-2018. Front Vet Sci 2018;5:253. 137 Supplementary materials United States Brazil 1 A_1_156 1 B_2_0 A_1_156 B_2_0 A_8_162 0.8 C_1_44 0.8 A_8_162C_14_166 C_1_44 C_14_166 0.6 Deer_14_3 C_16_362 0.6 C_16_362 Deer_14_3 0.4 C_18_119 0.4 Deer_2_0 C_18_119 Deer_2_0 C_19_455 0.2 C_19_455 0.2 Deer_24_5 C_19_455 Deer_24_5 C_20_434 0 C_20_434 0 Deer_25_4 C_21_327 Deer_25_4 C_21_327 −0.2 C_21_327 −0.2 Deer_26_1 E_30_254 Deer_26_1 E_30_254 −0.4 F_33_261 −0.4Deer_8_0 F_33_261 Deer_8_0 F_35_265 −0.6 E_1_2 F_35_265 −0.6F_35_265 E_1_2 −0.8 I_43_10 J_1_1 I_43_10 −0.8 J_1_1 J_47_3 −1 J_47_3 −1 Ethiopia AA_1C0001_0 1 1 AA_1C0001_0 SA_2C0007_0 AA_1C0003_1 SA_2C0007_0 AA_1C0003_1 0.8 SA_2C0088_1 AA_1C0003_1 SA_2C0088_1 0.8 B_agr−056_569 SA_2C0217_569 B_agr−056_569 SA_2C0217_569 E_tse−036_2 0.6 SA_2C0217_569 0.6 E_tse−036_2 SA_2C0229_314 E_tse−036_2 SA_2C0229_314 H_pri−063_82 0.4 SA_2C0305_2 H_pri−063_82 SA_2C0305_2 0.4 I_sad−005_282 W_8A0035_347 I_sad−005_282 0.2 W_8A0035_347 I_sad−005_282 X_DAERO_50 0.2 M_ser−002_531 X_DAERO_50 M_ser−002_531 Y_4577_102 M_ser−002_531 0 Y_4577_102 0 M_ser−007_417 Y_4577_102 M_ser−007_417 Y_4589_95 N_ALB720_43 −0.2 Y_4589_95 N_ALB720_43 Y_4589_95 −0.2 N_ALB720_43 Y_4589_95 N_ALB720_43 Y_4594_441 R_211−178_291 −0.4 Y_4594_441 −0.4 R_211−178_291 Y_4594_441 R_25/158_353 Y_4630_194 R_25/158_353 −0.6 Y_4630_194Y_4630_194 −0.6 R_26014_27 Y_6013_5 R_26014_27 Y_6013_5 SA_2C0005_5 −0.8 Z_8A0574_42 SA_2C0005_5 −0.8Z_8A0574_42 SA_2C0007_0 Z_8A0574_42 SA_2C0007_0 −1 Z_8A0574_42 −1 Supplemental Figure 1. The correlation of allele frequencies of samples within each geographic group is shown. Ethiopian samples were broken up into two matrices for ease of visualization. Samples within each geographic group are labeled by animal id and number of pairwise fSNP differences between all samples taken from one animal. For example, label “B_2_0” indicates animal ID B_2 and zero fSNPs. Sample IDs are colored in black if all samples were located in the same phylogenetic cluster, and red if samples were located in more than one cluster. Polyclonal infections (labeled in red) had lower allele frequency correlations compared to single source infection. AA_1C0001_0 B_2_0 AA_1C0001_0 AA_1C0003_1 B_2_0 AA_1C0003_1 C_1_44 AA_1C0003_1 B_agr−056_569 C_1_44 B_agr−056_569 E_tse−036_2 Deer_14_3 E_tse−036_2 Deer_14_3 E_tse−036_2 H_pri−063_82 Deer_2_0 H_pri−063_82 I_sad−005_282 Deer_2_0 I_sad−005_282 Deer_24_5 I_sad−005_282 M_ser−002_531 Deer_24_5 M_ser−002_531 M_ser−002_531 Deer_25_4 M_ser−007_417 Deer_25_4 M_ser−007_417 N_ALB720_43 Deer_26_1 N_ALB720_43 N_ALB720_43 Deer_26_1 N_ALB720_43 R_211−178_291 Deer_8_0 R_211−178_291 Deer_8_0 R_25/158_353 R_25/158_353 E_1_2 R_26014_27 R_26014_27 E_1_2 SA_2C0005_5 J_1_1 SA_2C0005_5 SA_2C0007_0 J_1_1 SA_2C0007_0 A_1_156 SA_2C0007_0 A_1_156 SA_2C0007_0 SA_2C0088_1 A_8_162 SA_2C0088_1 A_8_162 SA_2C0217_569 C_14_166 SA_2C0217_569 C_14_166 SA_2C0217_569 C_16_362 SA_2C0229_314 SA_2C0229_314 C_16_362 SA_2C0305_2 C_18_119 SA_2C0305_2 C_18_119 W_8A0035_347 C_19_455 W_8A0035_347 C_19_455 X_DAERO_50 X_DAERO_50 C_19_455 Y_4577_102 C_20_434 Y_4577_102 C_20_434 Y_4577_102 C_21_327 Y_4589_95 Y_4589_95 C_21_327 Y_4589_95 C_21_327 Y_4589_95 E_30_254 Y_4594_441 E_30_254 Y_4594_441 F_33_261 Y_4594_441 Y_4630_194 F_33_261 Y_4630_194 F_35_265 Y_4630_194 F_35_265 Y_6013_5 F_35_265 Y_6013_5 I_43_10 Z_8A0574_42 Z_8A0574_42 I_43_10 Z_8A0574_42 J_47_3 Z_8A0574_42 J_47_3 138 Model 1 p=0.185 Model 2 p=0.107 Observed distance Observed distance 2 4 6 8 10 2 4 6 8 10 Distance Distance Model 3 p=0.086 Model 4 p=0.112 Observed distance Observed distance 2 4 6 8 2 4 6 8 Distance Distance Supplemental Figure 2. The null distributions for the test statistic for the goodness-of- fit assuming each model structure. The test statistic used was the mean of the vector of the distance between the observed and expected number of synonymous mutations, nonsynonymous mutations, and number of differences between two samples. Density Density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 Density Density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 139 Supplementary Table 1. Number of transitions, transversions, fSNPs and vSNPs found in the collection of samples from each animal. ID Transitions Transversion Ti:Tv fSNPs vSNPs B_1 0 2 0 0 2 Deer_12 0 1 0 0 1 Deer_5 0 1 0 0 1 Deer_9 0 1 0 1 0 E_1 0 2 0 0 2 E_2 0 1 0 0 1 E_3 0 2 0 2 0 L_10 0 1 0 0 1 B_3 1 2 0.5 0 3 K_1 1 2 0.5 1 2 Deer_22 2 2 1 0 4 Deer_24 2 2 1 3 1 SA_2C0005 4 4 1 5 3 Y_6013 5 4 1.25 8 1 B_2 4 3 1.33 0 7 Deer_10 7 5 1.4 1 11 Deer_18 3 2 1.5 0 5 Deer_7 3 2 1.5 0 5 AA_1C0003 11 6 1.83 9 8 J_47 110 60 1.83 131 39 Deer_16 2 1 2 0 3 G_1 2 1 2 0 3 SA_2C0088 11 5 2.2 9 7 R_26014 8 3 2.67 3 8 AA_1C0001 26 9 2.89 34 1 Deer_17 3 1 3 0 4 Deer_26 3 1 3 0 4 Deer_8 3 1 3 0 4 SA_2C0007 99 31 3.19 115 15 Deer_19 4 1 4 2 3 Deer_23 4 1 4 0 5 L_2 5 1 5 0 6 I_43 61 12 5.08 71 2 Deer_14 1 0 1 0 Deer_21 1 0 1 0 Deer_25 1 0 1 0 Deer_27 1 0 0 1 Deer_6 1 0 0 1 140 F_1 1 0 0 1 H_2 1 0 0 1 J_2 1 0 0 1 L_11 1 0 0 1 L_6 1 0 0 1 L_7 1 0 1 0 L_8 1 0 1 0 L_9 1 0 1 0 Deer_20 2 0 0 2 D_1 3 0 0 3 Deer_15 5 0 1 4 Deer_2 5 0 4 1 141 CHAPTER 5 The evolution of antimicrobial resistance in the Mycobacteirum tuberculosis complex in the absence of antimicrobial use 142 Abstract Antimicrobial resistant Mycobacterium tuberculosis is a growing threat to global health. Antimicrobial treatment selects for resistant mutations to reach high population frequencies; however, despite fitness costs associated with protein structure altering drug resistance mutations, resistance has also been found in the Mycobacterium bovis in animal hosts in the absence of tuberculosis-specific drug treatment. We hypothesized that both small effective population sizes during transmission events and the coevolution of neutral or low-cost structural enhancing compensatory mutations can lead to resistance evolution in the absence of positive selection by antimicrobial treatment. Using a global sample of over 3000 M. bovis whole genome sequences, we find that genotypic resistance is rare, but resistance to first-line tuberculosis drugs occurs in sequences from both human and animal hosts. Using forward genetic simulation incorporating predicted structural effects of specific resistance-causing mutations, we show that resistance mutations can evolve through mutation and drift and can become fixed through population bottlenecks. Our work shows that the epistatic interaction of structural stabilizing mutations and resistance-causing destabilizing mutations, and the demographic effect of population bottlenecks on allele frequency distributions can describe a mechanism for evolution of antimicrobial resistance without antimicrobial use. Resistance evolution in M. bovis in the absence of antimicrobial use is both theoretically and empirically rare but is possible, and the mechanism described here may demonstrate the fundamental processes governing resistance evolution before selection in the Mycobacterium tuberculosis complex. 143 Introduction Antimicrobial resistant tuberculosis is a global health emergency. In 2018 ~1.2 million deaths were attributed to tuberculosis, and 10 million new tuberculosis cases were reported globally, including over 450,000 multidrug resistant infections [1]. Antimicrobial treatment saved 45 million lives between 2000 and 2017 [1], but without access to adequate treatment or drugs capable of treating resistant infections, tuberculosis can be fatal. Among people with multi drug resistant infections, treatment is successful in only 56% of patients [1]. Understanding how antimicrobial resistance (AMR) evolves in the Mycobacterium tuberculosis complex (MTBC), the cause of tuberculosis in humans and animals, could lead to improved drugs and drug treatment plans, and could aid in developing policies that ensure sustainable antimicrobial use. The MTBC is monophyletic and includes two main pathogenic ecotypes: Mycobacterium bovis the main cause of tuberculosis in cattle and other ungulates, and Mycobacterium tuberculosis, the main cause of tuberculosis in humans. M. bovis and M. tuberculosis have greater than 99% sequence identity in alignable regions and share identical sequences for many genes associated with resistance [2]. Additionally the pathology and etiology of bovine tuberculosis in animals is very similar to that of tuberculosis in humans, from transmission route, to infection site, and granuloma composition [3]; therefore, within-host population dynamics of M. bovis and M. tuberculosis are likely to be similar. Although M. bovis and M. tuberculosis are genetically similar and susceptible to many of the same drugs, first line antibiotics used to treat human tuberculosis infections: isoniazid, rifampin, ethambutol and pyrazinamide, are not used in animals. Instead, animals that test positive for M. bovis 144 are culled and removed from the food chain to prevent zoonotic infection. Since animals are not treated with tuberculosis antimicrobials, population dynamics of tuberculosis infection without the selection pressure of antimicrobial use can be studied in this system. Antimicrobial resistant M. bovis in animal hosts [4-6], but the mechanisms for resistance evolution in animal hosts remains unclear. AMR tuberculosis results from an individual being infected with an already resistant strain (primary infection), or from de novo evolution within the host (acquired resistance). Acquired resistance is strongly linked to antimicrobial use; however, resistant M. bovis isolates have been found in cattle who have not been treated with tuberculosis antimicrobials, so there are likely evolutionary forces other than selection by antimicrobial use that results in resistance (2, 3). Unlike many other pathogenic bacteria, the MTBC cannot acquire genes through horizontal gene transfer, so resistance acquired through single nucleotide polymorphisms (SNPs), or insertions or deletions [2]. SNPs that lead to non-synonymous changes are usually deleterious; however, some mutations can be low-cost or cost-free [7] and others could co-evolve with compensatory mutations that restore fitness lost by the resistance mutations. For example, mutations in rpoB, the gene that encodes for the b-subunit of the RNA polymerase, specifically those within an 81 base pair rifampin resistance determining region (RRDR) lead to rifampin resistance [8]. Although mutations in rpoB cause a decrease in fitness associated with decreased RNA polymerase function, secondary mutations in rpoA/B/C can reverse the deleterious effect of rifampin resistance, negating the fitness cost of resistance [7]. The epistatic interaction of the resistance conferring 145 rpoB mutation and additional compensatory mutations produces a higher fitness than the wild type under rifampin exposure. Since AMR mutations occur in cattle hosts where they are likely deleterious in the absence of the selection pressure caused by antimicrobial use, we suspect the evolution of AMR mutations is also influenced demographic factors, particularly, serial population bottlenecks created by transmission. Transmission bottlenecks decrease effective population size each time a new host is infected, which could create the opportunity for weakly deleterious alleles to expand within a host by drift. If the mutations are not deleterious enough to prevent transmission, they could propagate in a susceptible host population. Our objective is to study evolutionary mechanisms leading to drug resistance evolution in M. bovis in cattle. Our central hypothesis is that if drug resistance mutations exist in M. bovis isolated from cattle not treated for bovine tuberculosis, then mechanisms of drug resistance evolution other than selection by drug use exist. Using forward genetic simulation we investigate the impact of transmission bottlenecks on the evolution of resistant mutations, incorporating specific amino acid mutation effects predicted by Portelli et al.[9]. We compare the simulated results to the empirical distribution of genotypic resistance mutations in over 3000 M. bovis whole genome sequences to attempt to understand drivers of AMR in an MTBC population that has not experienced selection pressure through AMU. Our unique comparative approach can be used to better understand the evolutionary mechanisms that drive resistance evolution and may be incorporated into treatment plan design to slow the emergence of drug resistant MTBC. 146 Methods M. bovis sample selection and processing We used the collection of M. bovis whole genome sequences described in Supplementary Table 1 of Chapter 3 to study the empirical distribution of resistance genotypes. This sample contains 3438 genomes and is representative of all known M. bovis lineages. Raw fastq files were downloaded from the NCBI Sequence Read Archive. SNPs were called using the vSNP pipeline (https://github.com/USDA-VS) using the M. tuberculosis H37Rv reference, and SNPs were removed if they had a read depth less than 10, an allele frequency less than 0.9, and a mapping quality less than 30 using vcffilter in vcflib v1.0.2 [10]. Lineages were assigned using SNP markers described in Zwyer et al. [11] and genotypic resistance was predicted using TB-Profiler [12] run in parallel using GNU Parallel [13]. TB-Profiler predicts genotypic resistance to four first-line drugs: isoniazid, rifampicin, ethambutol and pyrazinamide, in addition to second-line drugs: aminoglycosides (kanamycin, streptomycin, amikacin, capreomycin), fluoroquinolones (ciprofloxacin, levofloxacin, ofloxacin, moxifloxacin), ethionamide, delamanid, linezolid, clofazimine, bedaquiline, para aminosalicylic acid (PAS), and cycloserine [12, 14]. Pyrazinamide is not considered a first-line drug for human M. bovis infections, because most M. bovis lineages are resistant [15]. To remove low-coverage or otherwise poor SNP calls associated with resistance prediction, TB- Profiler resistance predictions were filtered if they were not present in the corresponding filtered vcf files. In addition to fixed SNPs present in greater than 90% of reads, we were also interested in studying variable SNP (vSNP) dynamics within genes associated with 147 resistance. vSNPs were called using lofreq [18], and variants were filtered such that positions had least 10 reads supporting them, a coverage of at least 50x, a minimum Phred score of 20, and a minimum allele frequency of 0.5 %. Variants were labeled as vSNPs if the allele frequency was between 5% and 90% and fixed SNPs (fSNPs) if allele frequency was greater than or equal to 90%. Phylogenetic analysis Individual vcf files were merged using bcftools [16], and the merged vcf file was translated into a SNP alignment using custom scripts in RStudio 2021.09.0 [17] using R v 4.1.2 [18]. To better visualize the position of drug resistant samples in a phylogeny, susceptible samples were randomly down sampled. Within each lineage the same number of pan-susceptible or pyrazinamide resistant genomes and samples resistant to pyrazinamide and at least one other drug were selected. A phylogenetic tree was created from the SNP alignment using IQ-Tree 2 using a GTR+G substitution model [19]. Trees were visualized using iTOL [20]. Genome-wide epistasis analysis We conducted a genome-wide epistasis analysis using SpydrPick to detect signatures of epistatic interactions between resistance-conferring mutations and potential compensatory mutations [21]. The merged vcf file was fed into SpydrPick, and the resulting SNP-SNP relationships were filtered to include only mutations in genes with known resistance mutations detected by TB-Profiler, or potential target genes for compensatory mutations. The list of compensatory mutations was curated using known 148 mutations reviewed in Alame Emane et al. [22]. in addition to genes found in the same biological pathway as genes with known resistance mutations. Biological pathways were predicted using WikiPathways [23]. The candidate set of epistatic interactions was filtered to only contain only drug resistance mutations and non-synonymous mutations in compensatory genes. Simulation structure We created a nucleotide-based non-Wright Fisher model to simulate the evolution of katG during a single infection using SLiM 3 v 3.7.1 [24]. Mutations in katG cause isoniazid resistance, and katG was chosen because the in vitro mutational target size for isoniazid resistance is larger than that of other first-line drugs [25, 26]. The simulation was initialized using the M. bovis AF2122.97 reference nucleotide sequence for katG, which is identical to the M. tuberculosis H37Rv katG sequence. At initialization, the nucleotide sequence is converted to amino acids in SLiM, and the initial codon sequence is recorded. As in chapter 4, simulations consisted of three stages. The first stage consisted of 1000 generations of neutral burn-in. The second stage consisted of a population bottleneck to simulate a transmission event, and the third stage consisted of a population expansion to a constant within-host size for 365 generations. 10,000 replicates of each simulation structure were run. The effective population size is governed by a density dependent function that decreases fitness if the population size exceeds carrying capacity. At each generation, every individual has the potential to reproduce. The number of offspring generated by a single individual is drawn from a Poisson distribution with parameter 𝜆. The mutation rate, bottleneck size, and carrying 149 capacity parameter were rescaled by a factor of 10 to improve computational efficiency, but still approximate the same within-host evolutionary trends as described in chapter 4. A diagram of the simulation structure is shown in Figure 16. A A C B C B C B C A B C C C C C C C Haplotypes in bottleneck B C2 D C3 B C Haplotypes pre-bottleneck Haplotypes post-bottleneck B C 1.0 C 1 Deleterious Deleterious + Beneficial 0.5 C 0 Weak Moderate Strong A Destabilizing effect of mutation A0 Pre-bottleneck Post-bottleneck Figure 16. Simulation structure and graphical hypothesis. A. Example haplotypes before, during, and after an infection bottleneck. By chance, low frequency haplotypes are sampled in the bottleneck. B. Hypothetical fitness effects of individuals with different types of mutations is shown relative to the fitness of wildtype individuals. Structurally stabilizing mutations can reverse some of the deleterious effect of destabilizing mutations. A,C. One haplotype fixes in the post-bottleneck population, whereas the other haplotype is lost. Relative fitness of haplotype Allele frequency 150 Mutation effect prediction Recently, Portelli et al. predicted and summarized the fitness effect of mutations in three genes associated with resistance to isoniazid, rifampicin and D-cycloserine using a series protein structural analysis [9]. Mutations were assigned a selection coefficient proportional to their predicted effect listed in Supplementary Table 2 of Portelli et al. to simulate the effect of specific katG amino acid mutations on bacterial fitness. At each iteration, the number the combined effect of all mutations on the genome was calculated as the additive effect of the number of nonsynonymous mutations n multiplied by the effect e of each mutation. Most mutations are deleterious, but some have stabilizing effects when they occur on the same genome as deleterious mutations. Therefore, we model both deleterious mutations and beneficial mutations. We used a custom fitness callback function to alter the fitness depending on the number and type of mutations: 𝑠 = 1 − (𝑛>@ ∗ 𝑒>@ + 𝑛P@ ∗ 𝑒>@ + 𝑛Q@ ∗ 𝑒Q@ + 𝑛=4 ∗ 𝑒=4) + (𝑛>J ∗ 𝑒>J + 𝑛PJ ∗ 𝑒>J + 𝑛QJ ∗ 𝑒QJ) Since stabilizing mutations only have an effect in the presence of destabilizing mutations, we fixed the maximum value for s to be 1. Model parameters are shown in Table 10. Note, the number of mutations in each effect category is given the same subscript as the effect category. Since we did not know the actual value of mutations on fitness in the absence of antimicrobials (MIC can be used to indicate fitness in the presence of antimicrobials), we tested two fitness rescaling parameters, which modified the effect of each mutation. 151 Table 10. Simulation model parameters. Parameter Description Value B Infection bottleneck size 100 (10 rescaled) K Carrying capacity 1000 (100 rescaled) 𝜇 Mutation rate 5e-8 (5e-7 rescaled) 𝜌 Recombination rate 0 𝜆 Parameter of Poisson distribution 0.3 controlling offspring count 𝑟@ Scaling parameter for deleterious [0.01, 0.001] mutations 𝑟J Scaling parameter for beneficial mutations [0.01, 0.001] 𝑒>@ Effect of weakly deleterious mutation 1 ∗ 𝑟@ 𝑒>J Effect of weakly beneficial mutation 1 ∗ 𝑟J 𝑒P@ Effect of moderately deleterious mutation 5 ∗ 𝑟@ 𝑒PJ Effect of moderately beneficial mutation 5 ∗ 𝑟J 𝑒Q@ Effect of highly deleterious mutation 10 ∗ 𝑟@ 𝑒QJ Effect of highly beneficial mutation 10 ∗ 𝑟J 𝑒=4 Effect of otherwise not categorized 1 ∗ 𝑟@ nonsynonymous mutation B, K, and 𝜇 were rescaled to improve computational efficiency. Analysis of simulation results The final amino acid sequence for each simulation was recorded and compared to the reference sequence. Nonsynonymous mutations were compared to the katG amino acid changing mutations listed in the database of known resistance causing mutations utilized by TB-profiler [12]. The proportion of resistance mutations among 10,000 iterations of different simulation structures was compared using Pearson’s chi squared tests using the chisq.test function in R with simulated p-values. 152 Results Genotypic resistance was predicted for 3729 M. bovis whole genome sequences. The presence or absence of resistant genotypes is shown with a phylogeny of a reduced subset of samples in Figure 17. All samples resistant to pyrazinamide and at least one other drug and a subset of pan-susceptible and pyrazinamide monoresistant samples are shown in the phylogeny in Figure 17. The prevalence of drug resistance to any drug besides pyrazinamide was 4.6 % and was and was zero for five drugs (delaminid, clofazimine, bedaquiline, PAS, and cycloserine). Lineage La1.1 was pyrazinamide susceptible, and all other lineages were resistant. Although prevalence for all drugs tested was low, we detected resistance to all first line drugs except ethambutol (0.64 % isoniazid resistance, 0.15 % and rifampicin resistance). Correlations in drug resistance genotypes is shown in Supplemental Figure 1. 153 Figure 17. Phylogeny of TB-Profiler predicted resistant samples and closely related samples. The phylogeny was created using a GTR+G substitution model. M. bovis lineages, host species, and drug resistance genotypic predictions are labeled. Tree scale units are in substitutions per site. 154 The prevalence of resistance to each drug among human, cattle and deer samples is shown in Figure 18. As expected, the resistance prevalence for first-line drugs was higher in humans than animals, and the difference in prevalence between human and animal host groups is likely directly related to the selection pressure of antimicrobial treatment in human hosts. Among second-line drugs, the prevalence of each of the fluoroquinolones was highly correlated, likely because of the shared mechanism of action of all fluroquinolones, whereas the prevalence of each aminoglycoside was less correlated. Among aminoglycosides, the prevalence of streptomycin was high overall (1.3 % overall prevalence) and was especially high in cattle (1.6 % in cattle vs 0.88% in human). 155 Figure 18. TB-Profiler predicted resistance genotypes for M. bovis samples from human, cattle, and deer hosts. Human samples had the highest prevalence of resistance for all drugs except streptomycin. Drug resistance was generally rare (less than 1% per host species), but resistance to first-line tuberculosis drugs was found in both human and animal hosts. Epistatic associations between resistance causing mutations and potential compensatory mechanisms were predicted using SpydrPick. Correlation matrices of mutations located in resistance-causing genes or potential compensatory genes are shown in Figure 19. 156 Figure 19. Correlation between resistance-causing mutations and potential compensatory mutations. Drug resistance mutations are colored in red. Mutations in genes related to isoniazid and streptomycin resistance were uncorrelated, whereas some mutations in genes related to rifampicin and fluoroquinolone resistance were correlated with other mutations in the same drug category. Mutations in rpoB and rpoC were correlated with rpoB mutations associated with rifampicin resistance. 157 Mutations in RNA polymerase genes were correlated with rifampicin resistance- causing mutations. The distribution of SNPs correlated with rifampicin resistance is shown in Figure 20. Figure 20. Distribution of mutations associated with rifampicin resistant, and other correlated mutations in RNA polymerase subunit genes rpoB and rpoC. Clades without correlated rpoB/C mutations were collapsed for clarity. 158 In addition to fSNPs, we analyzed vSNPs in RNA polymerase subunit genes to better understand the dynamic evolution of rifampicin resistance. Variable and fixed mutations in rpoB, rpoC, and rpoA are shown in Figure 21. Figure 21. Variable and fixed mutations in RNA polymerase subunit genes plotted along a radial depiction of the M. bovis phylogeny shown in Figure 17. Genomes with mutations associated with resistance are labeled in blue, and genomes with fixed or variable SNPs in rpoB, rpoC, and rpoA are shown in yellow, green, and teal respectively. Although only five M. bovis samples, all from human hosts, had fSNPs associated with rifampicin resistance (Figure 17, Figure 20), an additional ten samples had vSNPs associated with rifampicin resistance including eight samples from cattle hosts, and one sample from a deer host. All genomes with vSNPS associated with resistance also had potentially compensatory vSNPs or fSNPs in rpoA and rpoC. Allele frequencies of a known compensatory mutation, Val431Met in RpoC, and linked resistance mutations are shown in Figure 22. 159 0.639175 0.550898 0.470588 0.408805 0.360544 0.313253 0.260274 0.251142 0.239631 0.214876 0.210526 0.202532 RifR 0.200000 R 0.168950 0.166667 S 0.163934 0.144654 0.131707 0.114754 0.114583 0.095652 0.094488 0.090909 0.089286 0.062827 0.033163 Va l er hr la et 804 39 S 8T 1A M e 4 r42 r4 4 l43 1 p.I l _ p.P ro p.S e p.S e p.V a oB B_ B_rp o o oB _ C_ rp rp rp rpo gene_mut Figure 22. Allele frequencies of variable SNPs associated with resistance and linked rpoC compensatory mutations. Solid lines connect mutations found on the same genome. Allele frequencies of mutations associated with resistance are colored in red, and the allele frequencies of the putative compensatory mutation is colored in blue. AF 160 In genomes with both the RpoC Val431Met compensatory mutation and rifampicin- resistance associated mutations in RpoB, the compensatory mutation was usually at higher allele frequency than the resistance-associated mutation (5/6 genomes). Since there are low recombination rates in M. bovis it is likely that the compensatory mutation evolved before the resistance-associated mutation. It is possible that the compensatory mutation decreased the fitness cost for resistance mutations to evolve. Simulation results We developed codon-explicit forward genetic simulations were to simulate the effect of population bottlenecks and linked structurally stabilizing mutations on the evolution of resistance mutations in katG. Two different mutation effect scaling factors were used (0.01 or 0.001) and two demographic structures (population bottleneck or constant population size) were tested. The prevalence of resistance mutations in each simulation structure is shown in Figure 23. 161 Figure 23. Prevalence of resistance-causing mutations katG that emerged during forward genetic simulations. More resistance mutations evolved in scenarios with population bottlenecks, and more resistance evolved in scenarios with a smaller mutation scaling effect. We compared the proportion of resistance mutations that evolved between simulations with the same scaling factor and between simulations with and without population bottlenecks. Population bottlenecks caused an extra 0.3 percent of resistance mutations in simulations with a scaling factor of 0.01 and an extra 0.4 percent of resistance mutations in simulations with a scaling factor of 0.001. Discussion All knowledge gained on M. tuberculosis genome dynamics since whole genome sequencing became available was discovered either using clinical isolates that have been exposed to 50 years of antimicrobial therapy, or laboratory strains. By studying the evolution of antimicrobial resistance in M. bovis sequences using a combination of 162 genotypic resistance prediction and forward genetic simulation we show the resistance in an MTBC population that has not experienced selection pressure through AMU. We found that genotypic resistance is rare among a global dataset of M. bovis sample, but generally more prevalent in samples from human hosts than from samples from animal hosts. Although prevalence of resistance was higher in human hosts, resistance to first- line tuberculosis drugs was found in animal hosts including wildlife. The higher prevalence of resistance in human hosts was expected because tuberculosis treatment selects for drug resistance to evolve. Since animals are not treated for bovine tuberculosis with first-line drugs, resistance in genotypes could have either been transmitted from human to animal hosts or have evolved de novo. Human to animal MTBC transmission is well documented [27-30], but to date, only one study has documented animal to human transmission [31]. Human M. bovis infections are usually caused by zoonotic transmission or by eating contaminated food sources; however, person to person transmission has been documented in immunocompromised people but is very rare [32]. Since human to animal transmission is likely very rare, at least some of the resistance to first-line tuberculosis drugs here likely evolved de novo. We also only evaluated genotypic resistance predicted by TB-Profiler, which has high sensitivity and specificity for most drugs (see Table 2 in Pelhan et al. [12]); however, since we do not have phenotypic confirmation for most isolates, we cannot rule out that resistance predictions are not altered by contamination. Probability of resistance evolution is related to mutational target size [25]. Because of the larger mutational target size, isoniazid resistance often evolves before rifampicin resistance. Since more mutations can cause isoniazid resistance, we 163 hypothesized that we would find a higher isoniazid resistance prevalence in animal hosts than other first line drugs. Isoniazid resistance was 14.9 times more common among cattle and deer hosts than rifampin resistance, and 7.5 times more common than ethambutol resistance. Additionally, KatG may not be essential in vivo [33], so slightly deleterious mutations may be tolerated. Epistatic interactions between resistance conferring mutations and protein stabilizing mutations may also increase the probability of resistance evolution. In our simulations, stabilizing mutations were extremely rare and did not influence the dynamics of resistance evolution considerably; however, we chose arbitrary values for the fitness effect of beneficial mutations based on the effects described in Portelli et al.[9], and it is possible that with increased selection coefficients the simulation dynamics would change. In the empirical dataset we found two separate fixed putative stabilizing mutations in rpoC associated with rifampin resistance-causing mutations in rpoB. Additionally, we found different rpoC mutations correlated with rpoB mutations in animal hosts, however samples with these mutations were predicted to be rifampicin susceptible. We also found multiple vSNPs associated with rifampicin resistance and linked mutations in rpoA and rpoC in both human and animal hosts and show that putative compensatory mutations may evolve before resistance-associated mutations. Random rpoB/C mutation combinations may be tolerated in the absence of antimicrobial use and could potentially prime individuals for rapid resistance development upon exposure to antimicrobials. In addition to allelic interactions, resistance dynamics may be shaped by transmission bottlenecks. If resistance alleles are present at low frequency in one host and by chance are transmitted to another host, they may expand to high frequencies in 164 the new host. We consistently found more resistance mutations in simulations that modeled a population bottleneck than those with a constant population size. Although the resistance mutations we simulated decreased the overall fitness of the individual, the resistance mutations could still fix in the post-bottleneck population because of the high variance in allele frequencies in small populations. The combined effect of a relatively fast mutation rate, skewed offspring distributions, and small population sizes during transmission events can lead to high variance in mutation distributions within host. Under the presence of antimicrobial use, selection can act to sweep these mutations, and any mutations linked to the resistance-conferring mutations, to high population frequency. Lastly, the distribution of second-line resistant genotypes suggests that there may be selection for aminoglycoside and fluoroquinolone resistant M. bovis in farmed animal hosts. The difference in prevalence between isoniazid resistance and second-line drug resistance may be related to off-target selection pressure driven by drug use for purposes other than M. bovis infection. Streptomycin resistance was the most common resistant genotype in cattle (1.3% prevalence), followed by fluoroquinolone resistance (0.87 % prevalence), isoniazid (0.64% prevalence), and other aminoglycosides (0.03 % prevalence). Additionally, despite perfect correlation in resistance genotypes among all fluroquinolones tested, which is expected for members of a drug class with the same mechanism of action, there was a much higher resistance prevalence in streptomycin than other aminoglycosides in animal hosts. Although some mutations confer resistance to all aminoglycosides [34], the number of known mutations associated with resistance 165 in the TB-Profiler database is larger for streptomycin than other aminoglycosides (193 for streptomycin, 88 for other aminoglycosides) [12], so the mutational target size for streptomycin may be larger than that of other aminoglycosides. Streptomycin is used in veterinary medicine to treat gram-negative infections in large animals medicine to treat gram negative enteric infections in poultry, cattle and swine [35], and is also used as a pesticide to treat fire blight (Erwinia amylorva) on apple and pear trees [36], so animals infected with M. bovis may be exposed to streptomycin through treatment or through environmental sources. Fluoroquinolones are also used in veterinary medicine to treat and control respiratory disease in cattle [35], and like with streptomycin use, fluoroquinolone use for purposes other than M. bovis may select for resistance. Fluoroquinolones target DNA gyrase and II topoisomerases, and act to block the DNA replication apparatus and generate double stranded breaks in bacterial DNA [37]. At sub-lethal concentrations, fluoroquinolone exposure has been shown to induce a faster mutation rate in Escherichia coli [38] Staphylococcus aureus [39], and M. fortuitum [40], and to induce increased expression of DNA protection and repair genes in M. tuberculosis H37Rv in vitro [41]. If the use of fluoroquinolones in animal hosts is responsible for resistant M. bovis, and if sub-lethal doses can promote resistance evolution generally by increasing the mutation rate, the use of fluoroquinolones in animals in regions with a high M. bovis prevalence should be cautioned. 166 Conclusions Genotypic resistance is rare among M. bovis sequences, but we observed resistance to all first-line tuberculosis drugs in sequences from both human and animal hosts and found resistance to aminoglycosides and fluroquinolones in production animals that may be related to drug exposure for non-M. bovis infections. Using forward genetic simulation incorporating structural effects of specific amino acid mutations, we show that resistance mutations can evolve through mutation and drift and can become fixed in the population through population bottlenecks. Studying the evolutionary dynamics of M. bovis in animal hosts is not only necessary to understand the risk of antimicrobial resistance in zoonotic M. bovis infections but can also elucidate the underlying evolutionary mechanisms that cause resistance alleles to emerge in MTBC populations. 167 References 1. Global tuberculosis report: World Health Organization2018. 2. Gillespie SH. Evolution of drug resistance in Mycobacterium tuberculosis: clinical and molecular perspective. Antimicrob Agents Chemother 2002;46(2):267-274. 3. Cassidy JP. The pathogenesis and pathology of bovine tuberculosis with insights from studies of tuberculosis in humans and laboratory animal models. Vet Microbiol 2006;112(2-4):151-161. 4. Vázquez-Chacón CA, Rodríguez-Gaxiola FJ, López-Carrera CF, Cruz- Rivera M, Martínez-Guarneros A et al. Identification of drug resistance mutations among Mycobacterium bovis lineages in the Americas. PLoS Negl Trop Dis 2021;15(2):e0009145. 5. Franco MMJ, Ribeiro MG, Pavan FR, Miyata M, Heinemann MB et al. Genotyping and rifampicin and isoniazid resistance in Mycobacterium bovis strains isolated from the lymph nodes of slaughtered cattle. Tuberculosis (Edinb) 2017;104:30- 37. 6. Sweetline Anne N, Ronald BSM, Senthil Kumar TMA, Thangavelu A. Conventional and molecular determination of drug resistance in Mycobacterium tuberculosis and Mycobacterium bovis isolates in cattle. Tuberculosis (Edinb) 2019;114:113-118. 7. Comas I, Borrell S, Roetzer A, Rose G, Malla B et al. Whole-genome sequencing of rifampicin-resistant Mycobacterium tuberculosis strains identifies compensatory mutations in RNA polymerase genes. Nat Genet 2011;44(1):106-110. 8. Zaw MT, Emran NA, Lin Z. Mutations inside rifampicin-resistance determining region of rpoB gene associated with rifampicin-resistance in Mycobacterium tuberculosis. J Infect Public Health 2018;11(5):605-610. 9. Portelli S, Phelan JE, Ascher DB, Clark TG, Furnham N. Understanding molecular consequences of putative drug resistant mutations in Mycobacterium tuberculosis. Sci Rep 2018;8(1):15356. 10. Garrison E, Kronenberg ZN, Dawson ET, Pedersen BS, Prins P. Vcflib and tools for processing the VCF variant call format. bioRxiv 2021. 11. Zwyer M, Çavusoglu C, Ghielmetti G, Pacciarini ML, Scaltriti E et al. A new nomenclature for the livestock-associated Mycobacterium tuberculosis complex based on phylogenomics. Open Res Europe; 2021. 12. Phelan JE, O'Sullivan DM, Machado D, Ramos J, Oppong YEA et al. Integrating informatics tools and portable sequencing technology for rapid detection of resistance to anti-tuberculous drugs. Genome Med 2019;11(1):41. 13. Tange O. Gnu parallel-the command-line power tool. 14. Peloquin CA, Davies GR. The Treatment of Tuberculosis. Clin Pharmacol Ther 2021;110(6):1455-1466. 15. Lan Z, Bastos M, Menzies D. Treatment of human disease due to Mycobacterium bovis: a systematic review. Eur Respir J 2016;48(5):1500-1503. 16. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V et al. Twelve years of SAMtools and BCFtools. Gigascience 2021;10(2). 17. Team R. RStudio: Integrated Development for R. Boston, MA; 2020. http://www.rstudio.com/. 168 18. Team RC. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2021. 19. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol 2020;37(5):1530-1534. 20. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res 2021;49(W1):W293- W296. 21. Pensar J, Puranen S, Arnold B, MacAlasdair N, Kuronen J et al. Genome- wide epistasis and co-selection study using mutual information. Nucleic Acids Res 2019;47(18):e112. 22. Alame Emane AK, Guo X, Takiff HE, Liu S. Drug resistance, fitness and compensatory mutations in Mycobacterium tuberculosis. Tuberculosis (Edinb) 2021;129:102091. 23. Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN et al. WikiPathways: connecting communities. Nucleic Acids Res 2021;49(D1):D613-D621. 24. Haller BC, Messer PW. SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model. Mol Biol Evol 2019;36(3):632-637. 25. Castro RAD, Borrell S, Gagneux S. The within-host evolution of antimicrobial resistance in Mycobacterium tuberculosis. FEMS Microbiol Rev 2021;45(4). 26. Ford CB, Shah RR, Maeda MK, Gagneux S, Murray MB et al. Mycobacterium tuberculosis mutation rate estimates from different lineages predict substantial differences in the emergence of drug-resistant tuberculosis. Nat Genet 2013;45(7):784-790. 27. Romero B, Rodríguez S, Bezos J, Díaz R, Copano MF et al. Humans as source of Mycobacterium tuberculosis infection in cattle, Spain. Emerg Infect Dis 2011;17(12):2393-2395. 28. Ameni G, Tadesse K, Hailu E, Deresse Y, Medhin G et al. Transmission of Mycobacterium tuberculosis between farmers and cattle in central Ethiopia. PLoS One 2013;8(10):e76891. 29. Mittal M, Chakravarti S, Sharma V, Sanjeeth BS, Churamani CP et al. Evidence of presence of Mycobacterium tuberculosis in bovine tissue samples by multiplex PCR: possible relevance to reverse zoonosis. Transbound Emerg Dis 2014;61(2):97-104. 30. Adesokan HK, Akinseye VO, Streicher EM, Van Helden P, Warren RM et al. Reverse zoonotic tuberculosis transmission from an emerging Uganda I strain between pastoralists and cattle in South-Eastern Nigeria. BMC Vet Res 2019;15(1):437. 31. Lombard JE, Patton EA, Gibbons-Burgener SN, Klos RF, Tans-Kersten JL et al. Human-to-Cattle. Front Vet Sci 2021;8:691192. 32. Evans JT, Smith EG, Banerjee A, Smith RM, Dale J et al. Cluster of human tuberculosis caused by Mycobacterium bovis: evidence for person-to-person transmission in the UK. Lancet 2007;369(9569):1270-1276. 33. Pym AS, Saint-Joanis B, Cole ST. Effect of katG mutations on the virulence of Mycobacterium tuberculosis and the implication for transmission in humans. Infect Immun 2002;70(9):4955-4960. 169 34. Reeves AZ, Campbell PJ, Sultana R, Malik S, Murray M et al. Aminoglycoside cross-resistance in Mycobacterium tuberculosis due to mutations in the 5' untranslated region of whiB7. Antimicrob Agents Chemother 2013;57(4):1857-1865. 35. US Food and Drug Administration: Approved Animal Drug Products (Green Book). 36. Aćimović SG, Zeng Q, McGhee GC, Sundin GW, Wise JC. Control of fire blight (Erwinia amylovora) on apple trees with trunk-injected plant resistance inducers and antibiotics and assessment of induction of pathogenesis-related protein genes. Front Plant Sci 2015;6:16. 37. Hooper DC, Jacoby GA. Topoisomerase Inhibitors: Fluoroquinolone Mechanisms of Action and Resistance. Cold Spring Harb Perspect Med 2016;6(9). 38. Phillips I, Culebras E, Moreno F, Baquero F. Induction of the SOS response by new 4-quinolones. J Antimicrob Chemother 1987;20(5):631-638. 39. Fung-Tomc J, Kolek B, Bonner DP. Ciprofloxacin-induced, low-level resistance to structurally unrelated antibiotics in Pseudomonas aeruginosa and methicillin-resistant Staphylococcus aureus. Antimicrob Agents Chemother 1993;37(6):1289-1296. 40. Gillespie SH, Basu S, Dickens AL, O'Sullivan DM, McHugh TD. Effect of subinhibitory concentrations of ciprofloxacin on Mycobacterium fortuitum mutation rates. J Antimicrob Chemother 2005;56(2):344-348. 41. O'Sullivan DM, Hinds J, Butcher PD, Gillespie SH, McHugh TD. Mycobacterium tuberculosis DNA repair in response to subinhibitory concentrations of ciprofloxacin. J Antimicrob Chemother 2008;62(6):1199-1202. 170 Supplementary Materials Supplementary Figure 1. Pairwise correlation of resistance patterns among drugs tested. A. Correlation of resistance patterns among all samples. B. Non-human samples, C. Human samples 171 CONCLUSIONS AND FUTURE DIRECTIONS Conclusions The Mycobacterium avium complex and Mycobacterium tuberculosis complex (MTBC) have infected human and animal populations for thousands of years and remain an important cause of morbidity and mortality today [1]. Both Mycobacterium avium subsp. paratuberculosis (MAP), the cause of Johne’s disease in ruminants and Mycobacterium bovis, the cause of bovine tuberculosis impact cattle production in the United States and globally; however, the policy for disease control is distinct. The combined impact of M. bovis on public health and animal created an impetus for national eradication in the United States and many other countries around the world [2][14]; however, eradication efforts are complicated by wildlife reservoirs and by animal movement from endemic regions into M. bovis free regions. Johne’s disease policy is aimed toward control rather than eradication, with emphasis on mitigating the economic costs of reduced production. Johne’s disease is difficult to control because of environmental persistence and long latency periods, which prevent rapid detection of infected animals based on clinical signs [3, 4]. Some sub-clinically infected animals will progress to a high-shedding disease state, at which time they may increase environmental contamination, transmission and persistence on the farm [5]. In chapter 2 we developed a hidden Markov model that classifies animals into progression categories based on fecal shedding. We found most MAP infected animals (81-93%) progress to a high shedding disease state, and about 9% of animals maintain a low shedding disease state. In practice our model could be used to predict which animals will become high shedders and which 172 will maintain a low shedding disease state, which could aid in management decision making. Because M. bovis is not endemic in the United States, disease control relies on outbreak identification followed by testing and culling. Whole genome sequencing is used to identifying the likely source of an outbreak by comparing outbreak sequences to a global database of whole genome sequences. Genetic diversity of M. bovis outbreak sequences could also be useful in understanding local transmission patterns [6], however, M. bovis has a low substitution rate of less than one mutation per genome per year [7]. Therefore, multiple animals may be infected with M. bovis with identical or very similar whole genome sequences, limiting the ability to decipher transmission pathways from whole genome sequences [8]. Previous pangenome studies found that MTBC species have large accessory genomes, ranging from 36 to 69% of all genes in the pangenome, and a recent study suggested that M. bovis had an open pangenome [9- 15]. If M. bovis has a large accessory genome, gene content variation could be useful as an additional source of genetic diversity in outbreak analysis; however, there is no known mechanism for gene acquisition through horizontal gene transfer, so a large accessory genome is inconsistent with current knowledge of MTBC evolutionary biology [16]. In chapter 3 we identified methodological issues inherent in pangenome analysis using draft genome sequences constructed with short read data, and developed filtering procedures to accurately characterize the M. bovis pangenome. We found that PE/PPE genes which contain highly repetitive DNA sequence were often located at the ends of contigs and were annotated as distinct genes even though the fragmented sequences mapped to a single reference gene. After removing these erroneous 173 annotations, we found that M. bovis has a compact pangenome and that gene content variation was driven by pseudogenization and gene deletion. We also used our pangenome analysis pipeline to analyze the pangenome constructed with genomes used in the recent that suggested the M. bovis pangenome was open and found just 70 accessory genes, compared to the nearly 4000 genes identified in the original study [15]. Although gene content is highly consistent among M. bovis genomes, we found that there is indel variation that evolves during outbreaks. Indels may be useful in discriminating between genomes that have identical single nucleotide polymorphisms. In chapter 4 we analyze the microevolution that occurs within hosts during infection. In contrast to similar studies of M. tuberculosis in human hosts where selection induced by antimicrobial treatment affects evolutionary trajectories [17, 18], we found that within host evolution was dominated by random genetic drift. Nonsynonymous mutations were common and occurred in both essential and nonessential genes and genes related to antimicrobial resistance, suggesting slightly deleterious mutations are able to persist within hosts. We also used forward genetic simulation and approximate Bayesian computation to estimate the effective population size of M. bovis infections, and the parameter value governing offspring distribution size, and the M. bovis mutation rate. We found that a relatively fast mutation rate of 5e- 7 mutations per position per generation combined with the diversity reducing effect of clonal progeny skew and population bottlenecks caused by transmission reduce diversity, create the slow apparent substitution rate that results in low genetic diversity measured in closely related outbreak sequences. In chapter 5 we studied the effect of a rapid mutation rate on the evolution of antimicrobial resistance genes in M. bovis 174 genomes that were not exposed to tuberculosis-specific antimicrobial treatment. We found that genotypic resistance is rare in M. bovis; however, we identified highly sensitive markers associated with resistance [19] to first line tuberculosis drugs in both in human and animal hosts. We also showed that resistance can evolve through mutation and drift and can become fixed in demonstrated a potential mechanism of resistance evolution in the absence of antimicrobial use. Studying the evolutionary dynamics of M. bovis in animal hosts is not only necessary to understand the risk of antimicrobial resistance in zoonotic M. bovis infections but can also elucidate the underlying evolutionary mechanisms that cause resistance alleles to emerge in MTBC populations. The mechanism we described for the evolution of resistance without antibiotic treatment selection may describe the underlying evolutionary processes that result in the generation of antimicrobial resistant-associated alleles in populations, which selection can act upon. Future directions Building on this knowledge developed in this thesis on M. bovis evolution from the microevolution that occurs within hosts, to the long-term evolutionary history that shaped pangenomes, we are developing a modeling framework that uses deep learning to predict characteristics of an outbreak from whole genome sequencing data. We are developing a convolutional neural network (CNN) to predict the relationship between herd infection population whole genome alignments and outputs of interest including the time since a herd was infected. The CNNs are a class of deep learning models commonly used in image processing and are designed to recognize patterns in data 175 without a predefined feature vector. We selected a CNN framework was selected for this problem because of the ability of CNNs to accurately identify evolutionary processes from labeled population genetic data, with equal or greater accuracy than other methods [20]. Additionally, CNNs take matrices as inputs, and sequence alignments can be easily formatted as matrices, and unlike other modeling approaches based on coalescent theory, CNNs can incorporate all information in genomes, including low frequency alleles, indels and gene content variation into the same prediction framework. We believe our model will outperform existing methodologies in describing M. bovis outbreak dynamics because we explicitly model forces that shape the evolution of this intracellular, highly structured pathogen population. Combining the evolutionary theory developed in this dissertation with new modeling approaches that incorporate all genetic variation present in outbreak sequences may enable the development of high resolution transmission networks that can better describe the transmission within herds, between herds, or between host species. 176 References 1. Loiseau C, Menardo F, Aseffa A, Hailu E, Gumi B et al. An African origin for Mycobacterium bovis. Evol Med Public Health 2020;2020(1):49- 59. 2. Scott C, Cavanaugh JS, Pratt R, Silk BJ, LoBue P et al. Human Tuberculosis Caused by Mycobacterium bovis in the United States, 2006-2013. Clin Infect Dis 2016;63(5):594-601. 3. Whittington RJ, Sergeant ES. Progress towards understanding the spread, detection and control of Mycobacterium avium subsp paratuberculosis in animal populations. Aust Vet J 2001;79(4):267-278. 4. Grewal SK, Rajeev S, Sreevatsan S, Michel FC, Jr. Persistence of Mycobacterium avium subsp. paratuberculosis and other zoonotic pathogens during simulated composting, manure packing, and liquid storage of dairy manure. Appl Environ Microbiol 2006;72(1):565-574. 5. Smith RL, Schukken YH, Grohn YT. A new compartmental model of Mycobacterium avium subsp. paratuberculosis infection dynamics in cattle. Prev Vet Med 2015;122(3):298-305. 6. Salvador LCM, O'Brien DJ, Cosgrove MK, Stuber TP, Schooley AM et al. Disease management at the wildlife-livestock interface: Using whole-genome sequencing to study the role of elk in Mycobacterium bovis transmission in Michigan, USA. Mol Ecol 2019;28(9):2192-2205. 7. Menardo F, Duchene S, Brites D, Gagneux S. The molecular clock of Mycobacterium tuberculosis. PLoS Pathog 2019;15(9):e1008067. 8. Glaser L, Carstensen M, Shaw S, Robbe-Austerman S, Wunschmann A et al. Descriptive Epidemiology and Whole Genome Sequencing Analysis for an Outbreak of Bovine Tuberculosis in Beef Cattle and White-Tailed Deer in Northwestern Minnesota. PLoS One 2016;11(1):e0145735. 9. Zakham F, Sironen T, Vapalahti O, Kant R. Pan and Core Genome Analysis of 183 Mycobacterium tuberculosis Strains Revealed a High Inter-Species Diversity among the Human Adapted Strains. Antibiotics (Basel) 2021;10(5). 10. Dar HA, Zaheer T, Ullah N, Bakhtiar SM, Zhang T et al. Pangenome Analysis of Mycobacterium tuberculosis Reveals Core-Drug Targets and Screening of Promising Lead Compounds for Drug Discovery. Antibiotics (Basel) 2020;9(11). 11. Kavvas ES, Catoiu E, Mih N, Yurkovich JT, Seif Y et al. Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance. Nat Commun 2018;9(1):4306. 12. Yang T, Zhong J, Zhang J, Li C, Yu X et al. Pan-Genomic Study of Mycobacterium tuberculosis Reflecting the Primary/Secondary Genes, Generality/Individuality, and the Interconversion Through Copy Number Variations. Front Microbiol 2018;9:1886. 13. Patane JS, Martins J, Beatriz Castelao A, Nishibe C, Montera L et al. Patterns and processes of Mycobacterium bovis evolution revealed by phylogenomic analyses. Genome Biol Evol 2017. 177 14. Zimpel CK, Brandao PE, de Souza Filho AF, de Souza RF, Ikuta CY et al. Complete Genome Sequencing of Mycobacterium bovis SP38 and Comparative Genomics of Mycobacterium bovis and M. tuberculosis Strains. Front Microbiol 2017;8:2389. 15. Reis AC, Cunha MV. The open pan-genome architecture and virulence landscape of Mycobacterium bovis. Microb Genom 2021;7(10). 16. Boritsch EC, Khanna V, Pawlik A, Honore N, Navas VH et al. Key experimental evidence of chromosomal DNA transfer among selected tuberculosis- causing mycobacteria. Proc Natl Acad Sci U S A 2016;113(35):9876-9881. 17. Trauner A, Liu Q, Via LE, Liu X, Ruan X et al. The within-host population dynamics of Mycobacterium tuberculosis vary with treatment efficacy. Genome Biol 2017;18(1):71. 18. Moreno-Molina M, Shubladze N, Khurtsilava I, Avaliani Z, Bablishvili N et al. Genomic analyses of Mycobacterium tuberculosis from human lung resections reveal a high frequency of polyclonal infections. Nat Commun 2021;12(1):2716. 19. Phelan JE, O'Sullivan DM, Machado D, Ramos J, Oppong YEA et al. Integrating informatics tools and portable sequencing technology for rapid detection of resistance to anti-tuberculous drugs. Genome Med 2019;11(1):41. 20. Flagel L, Brandvain Y, Schrider DR. The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference. Mol Biol Evol 2019;36(2):220-238.