BU-1141-M Hidden Markov Chains and the Analysis of Genome Structure by Gary A. Churchill October 1991 '. Hidden Markov Chains and the Analysis of Genome Structure Gary A. Churchill October 29, 1991 Abstract In this paper, statistical methods based on a hidden Markov chain model {Churchill1989) are used to study the structure of some small complete genomes and a human genome segment. A variety of discrete compositional domains are discovered and their correlations with genome function are explored. 1 Introduction The total genomic DNA of plants and higher animals shows little variation in total G+C content, with most species falling in the range 40 to 44%. However, when the DNA is fragmented, a wide range of intragenomic heterogeneity in 1 .. G + G content is observed. Bernardi et al. (1985) proposed the isochore model of DNA sequence organization to explain these observations. The DNA sequence is viewed as a. mosaic of large homogeneous segments (> 106 bases) which differ from one another in composition. The banding patterns seen in stained metaphase chromosomes provides a striking example of this higher level of organization in the genomic DNA of eukaryotes (Holmquist, 1989). By contrast, the total G+G content in the genomes of bacterial species varies over a. wide range, 25 to 75%. Intra.genomic variation in G+ G content is relatively small but is still in excess of random expectations. These observations led .· ·Elton. (1970) to sugge8t a. shiiilar segmented ·genome inodef for proka.ryotes~ · · With the increasing availability of large DNA sequences, it has become possible to study compositional heterogeneity directly. Ikemura. et al. (1990) studied variations in the G+G content of human DNA by arranging sequences from the database into large linkage groups. They found patterns of G + G composition that are consistent with the isochore hypothesis and identified a potential isochore boundary. Fickett et al. (1991) investigated the statistical properties of human and E. coli sequences available in the current database. They found that the variation in composition exceeds expectations under a. homogeneous stochastic model of the DNA sequence. However, they find that the local base composition tends to persist over large regions with stronger 2 persistence in human sequences than in E. coli sequences. Pevzner et al. (1989) introduced a measure of heterogeneity for small words in genetic texts, the irregularity coefficient. They used this method to study the "zonal structure" of viral and E. coli sequences. In Kozhukhin and Pevzner (1991), this study was extended to include all large (> 25kb) sequences in the current database. They conclude that the major contribution to genome inhomogeneity is due to the uneven distribution of SS and WW dinucle<>tides. This .heterogeneity cannot be explained by the isochore model because it occurs locally. These studies suggest that compositional heterogeneity is a general phenomenon that is present-In a. vanety· of distinct foriris~· . · . · . The statistical methods used in this work are based on a. stochastic model, the hidden Markov chain model (HMC), in which the DNA is composed of homogeneous segments belonging to a small number of distinct compositional classes (Churchill1989). The probability of observing a. particular base at a given site on the DNA molecule depends on the type of segment it is in. An underlying organization of the DNA is assumed in which the switching from one segment to the next follows a. hidden Markov chain. The states of the hidden process indicate the type of segment and are not directly observable. Parame- +ters which characterize the different states (e.g. G G content and switching rates) are estimated using likelihood methods and a recursive smoothing algo- 3 rithm is then applied to reconstruct the segments. The smoothing algorithm does not utilize a sliding window, thus sharp transitions of state are possible and both small and large segments can be detected in a single pass. Results of the smoothing are displayed graphically to provide a global summary of the organization of sequence composition. Furthermore, the model makes specific predictions about properties of the DNA sequence which are related to i~s base composition, e.g. distribution of restriction enzyme sites, and provides a framework within which hypotheses about genome structure can be formulated and tested. ·we illustrate the use. of HMC methods with four-examples of its.. application to nucleotide sequences obtained from the GenBank database (release 60, Burkset al. 1990). In the first example, the mitochondrial genomes of animals are shown to have a consistent heterogeneous pattern in the distribution of purines and pyrimidines. The purine content in one strand of the DNA is shown to be correlated with the coding sense of the strand and the type of product, protein or rRNA. In the second example, the genome of bacteriophage lambda is shown to also have a strand assymetry which is associated with coding function. However a simple two-state model provides an inadequate description of the compositional variation and a more general state-space model is suggested. In the third example, the two major transcrip- 4 tional domains of simian virus 40 are found to correspond with two genome segments of distinct dinucleotide composition. In the final example, the human alpha globin region is analyzed to detect variation in the distribution of CpG dinucleotides. The CpG-rich islands associated with genes in this region are sharply defined suggesting that HMC methods could be used to screen databases and raw sequence data to search for genes. 2 Hidden Markov Chain Models for DNA Se- que.nce D~ta . .. . . . ... . . .. .. .. . . . ..... . The Model We will consider a single strand of the DNA molecule in 51 to 31 orientation. The base at position t of the sequence will be denoted by Yt and the entire sequence up to t by yt = {y1, ••• , yt}, where Yt will take values in the set {C, T, A, G}. Binary representations of the DNA such as the purine-pyrimidine {AG-CT) or the strong-weak hydrogen bonding (GC-AT) sequences will also be considered. The state of the sequence at position t will be denoted by St and the entire sequence of states up to t by st = { s1, ... , St}. For clarity, we will describe the case of a binary sequence with two distinct states where both Yt and St will take values in the set {0, 1}. The probability distribution of Yt, conditional on the state St, will be hi- 5 nomial a.nd ca.n be written as (1) =where p0 = Pr(yt = 1 I St 0) and p1 = Pr(yt = 11 St = 1) are assumed to be not equal. The segments of the sequence alternate between state 0 a.nd state 1 according to a binary Markov process with transition probability matrix 1- .A A= [ .A ] ' T 1-r The parameter .A defines the probability of switching from state 0 to state 1, i.e. distributed with mea.n 1f .A. A similar interpretation applies to the parameter r. Both .A a.nd r are thought of as being small (< 10-3). The result is a . ........ . sequence of hidden states which consists of long runs of all zeros or all ones. Segments of the sequence are identified with these runs. In the general case, there are m observable outcomes (e.g. A, T, G, C). The probability of observing outcome i given that the current state is k is given by the parameter Pi,k = Pr (Yt = i I St = k). Successive outcomes may also be Markov dependent and the outcome transition probabilities are given =by Pij,k = Pr (Yt+l = j I Yt = i, St k). The hidden process which defines the segments will switch according to a Markov chain on r states, with the r x r transition probability matrix denoted A = [Aij]· The number of free 6 .. parameters required to specify a. model with m outcomes and r states is given by k = (m- 1)r + r(r- 1). If the outcomes are Markov dependent, k = m(m - 1)r + r(r - 1). The one-state model (r = 1) is an important special case that includes the usual independent and Markov chain models for DNA sequences. - Smoothing Algorithm The smoothing algorithm used to reconstruct the homogeneous segments of a sequence is described in detail by Churchill (1989). The outcome of the procedure is, for each site in the sequence, the probability .... ·.. . . th~t the site. "!Jel9ngs. to st~te j condit~onal on the entire observed sequence. •. • • ·I• ., •' ' • •• • • • •', • • • • •, '• This quantity, denoted Pr (st = j I yn), is called the smoothed estimate of St and can be plotted against the index t to provide a graphic summary of compo- sitional structure. The conditional probabilities are computed by a recursive algorithm with two steps. The filtering step is a forward pass through the se- quence which incorporates the information in all 5' bases and the current base into the smoothed estimate. The smoothing step is a reverse pass through the sequence which incorporates the information in 3' bases into the smoothed estimate. This algorithm is related to the Kalman filter and is described in a general setting by Kitagawa (1987). 7 Parameter Estimation The smoothing algorithm requires tha.t the model parameters be specified in a.dva.nce. Typically, these values will not be known a.nd must be estimated from the da.ta.. Churchill (1989) describes a.n EM algorithm which computes a. maximum likelihood estimate (MLE) of the model parameters. In the present work, all smoothed estimates are computed based on the MLE parameter estimate. Model Comparison Given several candidate models, the model with the maximum value of the Ba.yesia.n information criterion (DIC), . . .• .1 ... BIC ~ 1(6)- 2"klogn .... : . ... (2) is taken to be the best model. Here, l(O) is the maximized loglikelihood, k is the number of free parameters in the model, a.nd n is the sequence length. The BIC value is a. large sample approximation to the Bayes factor a.nd provides a. consistent criterion for comparison of different models (Schwarz 1978). It is often more convenient to report changes in the value of BIC (.6.BIC) relative to a. simple model. For example, the independent a.nd equally-likely outcomes model which ha.s zero free parameters a.nd BIC is equal to the loglikelihood lo = nlog !· 8 3 Examples of Genome Analysis using HMC Models Animal Mitochondrial DNAs The mitochondrial genomes of animals are circular double-stranded DNA molecules ranging in size from 16 to 19kb. Their functional organization is highly compact with most of the DNA involved in the coding of proteins or functional RNA' molecules. Complete mitochondrial DNA (mtDNA) sequences have been determined for a variety of organisms including bovine (Anderson et.al. 1982), human (Anderson et.al. 1981), rriouse· (Bibb ·et'~a.L ·l98f), xeriopus· (ROe et~aL 1985) and drosophila (Clary and Wolstenholme 1985). The mammalian and xenopus genomes share identical topographies with all coding regions in the same relative positions. The drosophila genome has a different gene order which can be related to the others by a small number of translocation and inversion events. All of the animal mtDNAs are AT-rich (bovine 60.6%, human 55.5%, mouse 63.2% and xenopus 63.1%). The drosophila mtDNA has an exceptionally high A+ T content of 78.6% and it has been suggested that selection has favored AT base-pairs at all locations where they are compatable with function (Clary and Wolstenholme 1985). Binary purine-pyrimidine sequences derived from the 1-strands of each 9 genome were used to compute model comparison statistics and maximum likelihood parameter estimates (table 1). In every case, the largest value of tlBIC is found for the two-state model. Furthermore, the consistency of the estimated parameters across all five genomes suggests a common organization at the level of purine composition, despite the wide divergence in A+ T content. For bovine mtDNA, the states are characterized by average purine contents of 45.3% (state 0) and 55.4% (state 1). The expected size of a state 0 region is 1/.A ~ llkb and that of a state 1 region is 1/r ~ 2kb. Similar interpretations apply to the other genomes. . .· .The smoothing algorithm wa.S. applied· .to identify the. genome segments involved. The sequence profiles (figure 1) reveal large segments with fairly sharp transitions. The general pattern of mtDNA organization in these animals is a single purine-rich region that contains the sense-strand rRNA genes and a single pyrimidine-rich region that contains the sense-strands of most protein encoding genes. The organization of the drosophila sequence provides a remarkable example of the consistency of this pattern. The profile shows a greater number of segments but the functional correlations are identical. Using the major coding regions as landmarks, we see that the sense strands in rRNA encoding genes are purine rich while the sense strands of protein encoding genes are pyrimidine rich. 10 In order to explain the pyrimidine excess in the sense strand of protein encoding genes, a statistical analysis of base composition by reading frame was carried out (results not shown). The most significant contributing factor was found to be the predominance of T (> 40%) in the second coding po- sition. This in turn is explained by the high proportion of the hydrophobic amino-acids leucine and isoleucine in mitochondrial encoded proteins. Little or no pyrimidine excess is seen in the third coding position. Further calculations confirmed that.the amino-acid composition of mitochondrial proteins is sufficient to produce the observed strand assymetry. # •••••• •· •• ·' Bacteriophage Lambda The complete genome of the bacteriophage lambda is a double-stranded circular DNA molecule of 48502 base pairs (Sanger et al. 1982). The genome is compact with very little non-coding DNA and several overlapping reading frames. Markov chain analyses of the lambda sequence have suggested that there is a long range dependence between neighboring bases (Tavare and Giddings 1989; Churchill 1988). The compositional heterogenity of lambda DNA was observed by Skalka et al (1968) and studied using statistical methods by Pevzner et al. (1989) and Churchill (1988, 1989). A hidden Markov chain analysis is carried out here to identify regions which might correspond to distinct genome modules. HMC models with independent and first-order dependent outcomes and 11 up to four hidden states were fit to the four-base sequence of bacteriophage lambda. According to the 6-BIC criterion (table 2), thebest model has threestates and first-order dependent bases. However, examination of the smoothed profiles suggests that the discrete state model is not an adequate description of heterogeneity in lambda. Similar analyses were carried out for binary AT/GC and AG/TC sequences as well a three outcome sequence with indicators for SS, WW and mixed dinucleotides. In every case, the left half of the lambda genome appears as a single homogeneous segment and the right half fails to show distinct segments. We conclude that the compositional variation in lambda does .. not t~ fall into ~ small numb~r of distinct. states. Thus it w1lll)e Iieccessary to consider more general state-space models before an adequate characterization of this genome can be obtained. The two-state model fit to the four-base sequence of lambda illustrates the pattern. Estimated state transition probabilities are A. = 0.000115 and r = 0.000224 and the estimated base compositions are summarized in table 3. A two-state profile of the smoothed estimates (figure 2) shows a general correspondence between the segments identified by HMC analysis and the direction of transcription of the genes. The reversal around 32kb reflect the poor fit of the two-state model in the right half of lambda. 12 Simian Virus 40 The Simian Virus 40 (SV40) genome is a circular doublestranded DNA molecule of 5243 bases (e.g. Reddy et al. 1978). The expression of SV40 genes is temporally regulated by two major transcripts, early and late. Transcripts are generated in opposite orientations outward from the replication origin region. The early transcript produces the major and minor T-anitgens and the late transcript produces other viral proteins. As in the previous example, a number of HMC models were fit to the SV40 four-outcome sequence. As indicated by the .6.BIC statistics (table 2), the best model has two states and first-order dependent bases. Parameter estimates for this model are showri in table 4. A plot ·of the·sniootlied ·estim:ates·reveals two · large segments which correspond to the the early transcript (state 1) and to the replication origin and late transcript (state 0). These regions are not identified by any of the independent outcome models nor are they apparent from models fit to binary AT/GC or AG/TC sequences. Thus, the states reflect regions of distinct dinucleotide composition. The CpG dinucleotide is rare in both states but especially in state as is reflected in the outcome transition probabilities (Pca,o = 0.0435, PcG,I = 0.0055). Other notable differences between the two states are due the frquencies of TG, GG and GA dinucleotides. Human Alpha Globin Region The dinucleotide CpG is rare in vertebrate DNA, occurrring at only 10 to 20% of its expected frequency. The CpG dinu- 13 cleotide is also the unique site for methylation of vertebrate DNA and thus is potentially important for the regulation of gene expression. Protein encoding regions are often found to be associated with 5' and/or 3' regions of elevated CpG content {Bird 1986). These CpG-rich islands provide easily recognizable landmarks for locating genes in large DNA sequences. Hidden Markov chain models can be adapted to locate CpG-rich islands and any other ty,ee of region characterized by an excess or deficit of specific nucleotide patterns. Thus HMC methods could be used to screen databases or raw sequence data to locate regions of potential interest. This method of screening would be We have examined the distribution of CpG dinucleotides in a 12.5kb region of the human a-globin cluster which include the functional genes a1 and a2 and the pseudogene f/Jal. The four-base DNA sequence was converted to a binary sequence with 1 indicating the C position of a CpG dinucleotide and O's elsewhere. The binary sequence is modelled as a first-order Markov chain with two hidden states, "normal" and "CpG-rich". The CpG dinucleotide cannot self-overlap, thus it is not possible to observe two consecutive 1's in this sequence and the transition from outcome 1 to outcome 0 will occur with probability one in both states. The outcome transition probabilities from 0 to 1 are estimated from the data. 14 The D.BIC statistics for models with two and three states relative to the one-state model are 180.77 and 157.12 respectively, clearly indicating that a tw. = 0.000245, switching into the CpG-rich state and r = 0.001171, switching out of the CpG rich state. Thus, the pattern is long segments of low CpG content interrupted by short segments of relatively high CpG content. The profile plot (figure 3) reveals three CpG-rich islands with fairly sharp boundaries.· The first ·is located 2kb ·upstream froi:ri" the pseu:dogene: ·The other two begin in the 5' regions of the functional genes and contain at least the first two exons. In order to contrast the HMC approach with a standard sliding-window analysis, the proportions of G+C and CpG over windows of size 256bp were plotted. The window size was selected to give the best visual resolution of the CpG-rich islands. Smaller window sizes produced noisier plots and larger window sizes tended to wash-out the features. Although the sliding window method is sufficient to detect the the CpG-rich islands in this example, it has a number of shortcomings: several window sizes must be tested; the boundaries are not sharply defined; estimates of CpG content and feature size are not readily available; it is not possible to make formal comparisons 15 among alternative models. 4 Discussion One of the fundamental questions of modern biology concerns the nature and extent of compositional variation and its relation to the organization of struc- ture and function in genomic DNA. Since random mutation processes woulO. tend to homogenize DNA, it is reasonable to suppose that some constraints are active in creating and maintaining compositional variation. Bernardi and Bernardi (1986) suggested that compositional constraints which affect both .• ... . . . . . . . .· coding and non-coding sequences have resulted from seledive pressures, per- haps at the level of chromosome structure, and that these constraints repre- sent an important subset of the total constraints acting on the evolution of a genome. Analysis of heterogeneity could provide clues about the nature of compositional constraints and different levels of organization in large genomes. An alternative explanation for compositional heterogeneity is that the pattern of mutations varies across regions of the genome.· Understanding the nature of this variation could provide significant insights into the process of point mutations and their role in genomic evolution. Fickett et al. (1991) conclude from their studies that multidomain models are inadequate to describe the observed variation and suggest that models with 16 continuous variation in local G +C content should be considered. Kozhukhin and Pevzner (1991) also raise doubts about the generality of the large homogeneity domains in DNA sequences. These doubts are confirmed by the bacteriophage lambda example above. However, the hidden Markov chain model is readily generalized to include continuous variation. Some computational problems involved with fitting such models are currently being studied. The idea that different functional domains of DNA can be distinguished by their statistical characteristics is not new (e.g. Smith et al. 1983). However, the statistical approach to interpretation of DNA sequences has met with only ..lhnited. success:. dne proble~ ·may be . that· statisti"cal inethods ·a.i:e not wen· suited for the detection of unique features that are often biologically important. In their proper domain, statistical methods can provide very powerful descriptive and inferential tools. When we begin to look at DNA on a global scale, we can expect to overlook some of the important details that are essential to DNA function. What we gain is a new perspective, a view of the higher levels of organization which exist in genomic DNA. Presently it appear that the genomic DNA of eukaryotes is hierarchically organized at several different levels but our understanding of this organization is limited. The patterns of DNA sequence organization which will eventually be observed in the complete nuclear genomes of eukaryotes (and also in prokaryotic genomes) may 17 be very different from the patterns found in the examples presented here, but the hidden Markov chain methods will allow us to observe and analyze these patterns. 5 Literature Cited 1. Anderson, S., A.T. Bankier, B.G. Barrell, M.H.L. de Bruijn, A.R. Coul- son, J. Drouin, I.C. Eperon, D.P. Nierlich, B.A. Roe, F. Sanger, P.H. Schreier, A.J.H. Smith, R. Staden, and I.G. Young. 1981. Se1uence and organization of the human mitochondrial genome. Nature 290:457-465. •.. • •r • • •. • • • •• • • • . • •• • • • . •• •• .• • • • • ·~ • . 2. Anderson, S., M.H.L. de Bruijn, A.R. Coulson, I.C. Eperon, F.Sanger, and I.G. Young. 1982. Complete sequence of bovine mitochondrial DNA: conserved features of the mammalian mitochondrial genome. J. Mol. Evol. 156:683-717. 3. Bernardi, G. and G. Bernardi. 1986. Compositional constraints and genome evolution. J. Mol. Evol. 24:1-11. 4. Bernardi, G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival and F. Rodier. 1985. The mosaic genome of warm blooded vertebrates. Science 228:953-958. 18 5. Bibb, M.J., R.A. Van Etten, C.T. Wright, M.W. Walberg, and D.A. Clayton. 1981. Sequence and gene organization of mouse mitochondrial DNA. Cell 26:167-180. 6. Bird, A.P. 1986. CpG-rich islands and the function of DNA methylation. Nature 321:209-213. 7. Burks, C. et al. 1990. GenBank: Current status and future directions. Meth. Enzymol. 183:1-22. 8. Churchill, G.A. 1988. Stochastic Models for DNA Sequence Data. PhD The8is~· Umversity ·orWasb.ington; Seattle·.···· · · ·· · · ······ ·· 9. Churchill, G.A. 1989. Stochastic models· for heterogeneous DNA sequences. Bull. Math. Biol. 51:79-94. 10. Clary, D.O. and D.R. Wolstenholme. 1985. The mitochondrial DNA molecule of Drosophila yakuba: Nucleotide sequence, gene organization, and genetic code. J. Mol. Evol. 22:252-271. 11. Elton, R.A. 1974. Theoretical models for heterogeneity of base composition in DNA. J.Theor.Biol. 45:533-553. 12. Fickett, J.W., D.C. Torney, D.R. Wolf. 1991. Base compositional structure of genomes. Manuscript. 19 13. Holmquist, G.P. 1989. Evolution of chromosome bands: Molecular ecology of noncoding DNA. J. Mol. Evol. 28:469-486. 14. Ikemura, T., K-N. Wada, S-I Aota. 1990. Giant G+C% mosaic structures of the human genome found by arrangement of GenBank human DNA sequences according to genetic position. Genomics 8:207-216. 15. Kitagawa, G. 1987. Non-Gaussian state-space modeling of nonstationary time series. J. Am. Statist. Assoc. 82:1032-1041. 16. Kozhukhin, C.G. and P.A. Pevzner. 1991. Genome inhomogeneity is detrerin~d·m~~Y by ww ~