MCALIGN2: Faster, accurate global pairwise alignment of non-coding DNA sequences based on explicit models of indel evolution
© Wang et al; licensee BioMed Central Ltd. 2006
Received: 07 November 2005
Accepted: 08 June 2006
Published: 08 June 2006
Non-coding DNA sequences comprise a very large proportion of the total genomic content of mammals, most other vertebrates, many invertebrates, and most plants. Unraveling the functional significance of non-coding DNA depends on how well we are able to align non-coding DNA sequences. However, the alignment of non-coding DNA sequences is more difficult than aligning protein-coding sequences.
Here we present an improved pair-hidden-Markov-Model (pair HMM) based method for performing global pairwise alignment of non-coding DNA sequences. The method uses an explicit model of indel length frequency distribution which can be specified, and allows any time reversible model of nucleotide substitution. The method uses a deterministic global optimiser to find the alignment with the highest posterior probability. We test MCALIGN2 in simulations, and compare it to a previous Monte Carlo based method (MCALIGN), to the pair HMM method of Knudsen and Miyamoto, and to a heuristic method (AVID) that performed very well in a previous simulation study. We show that the pair HMM methods have excellent performance for all combinations of parameter values we have considered. MCALIGN2 is up to ten times faster than MCALIGN. MCALIGN2 is more accurate in resolving indels given an accurate explicit model than heuristic methods, but is computationally slower.
MCALIGN2 produces better quality alignments by explicitly using biological knowledge about the indel length distribution and time reversible models of nucleotide substitution. As a result, it can outperform other available sequence alignment methods for the cases we have considered to align non-coding DNA sequences.
The advent of automated DNA sequencing methods has resulted in an enormous growth in the volume of sequence data deposited in public databases. The increasing availability of genome sequence data for many related organisms offers great opportunities to study gene function and genome evolution, but it also presents new challenges for DNA sequence analysis, especially for non-coding DNA sequences.
For much of the past two decades, research in DNA sequence analysis has focused on protein-coding sequences, which account for only a very small proportion of the total genomic content in mammals, most other vertebrates, many invertebrates, and most plants . For example, protein-coding gene sequences comprise as little as 1–2% of the human and mouse genomes [2, 3]. However, there is an increasing body of evidence showing that non-coding DNA sequences contain many functional sequences involved in gene regulation and potentially other unknown functions. For example, it has been estimated that ~50% of bases in intergenic and intronic sequences of Drosophila melanogaster are selectively constrained . In rodents, it has been inferred that the total number of selectively constrained nucleotides in non-coding DNA adjacent to gene sequences is similar to that in coding DNA , Evidence for the presence of a large number of potentially functional non-coding sequences on human chromosome 21 has recently been obtained from a comparative genomics analysis . Determining the fraction of non-coding DNA that is functional and establishing what that function is, is therefore a central problem in genome research.
Accurate inferences about the function of non-coding DNA from comparative methods depends critically on correct alignments of non-coding sequences. However, the alignment of non-coding DNA sequences is more difficult than aligning protein-coding sequences. Protein-coding sequences tend to be highly evolutionarily conserved, so insertions and deletions (indels) are uncommon and rarely cross codon boundaries. However, indel events are common in non-coding DNA, and can occur at most nucleotide sites. Numerous advances in sequence alignment methods for noncoding DNA have been made. Many recently proposed methods are based on heuristic alignment algorithms that can be very fast and accurate in cases where sequences are similar, but perform less well when sequence divergence is high . Furthermore, heuristic scoring functions are not guaranteed to use the correct relationship between the relative penalties for point substitution and indel events, as they have no evolutionary interpretation. Therefore, explicit evolutionary models are desired to address this problem.
True evolutionary models of sequence evolution allow both multiple point substitutions and multiple indel events to affect any site in the sequence. The first true evolutionary model of indel evolution was introduced by Thorne, Kishino, and Felsenstein , the TKF91 model, and allows single-residue indel events. This method uses a maximum likelihood algorithm to estimate the evolutionary distance between two sequences, summing over all possible alignments in the likelihood calculations . It was subsequently improved by allowing longer indel events with a geometric length distribution , by assuming that the sequence contains unbreakable fragments, and that only whole fragments are inserted and deleted. This assumption introduces hidden information in the form of fragment boundaries, and may potentially bias multiple alignment . Knudsen and Miyamoto  presented a pairwise statistical alignment method based on an explicit evolutionary model of indel events. Indel length was assumed to be geometrically distributed, and up to two overlapping events were allowed for indels. A good approximation to such a model was then made using a pair HMM. The geometric distribution parameter, the indel rate, and the evolutionary time were estimated by maximum likelihood. A "long indel" evolutionary model has been introduced recently by Miklos et al. , which allows multiple-residue indels without hidden information such as fragment boundaries. They developed a finite trajectory approximation for computing the likelihood function, producing a method that has very good performance .
Previously, Keightley and Johnson  proposed a non-coding sequence alignment method called MCALIGN. This is based on a simplified evolutionary model that does not allow for any multiple hits or interaction between indel events. A key feature of their approach is that it uses additional data from "unambiguous" alignments (e.g. between sequences from closely related species) to infer the actual distribution of indel lengths, and the relative rate of indels to point substitutions. They used a Monte Carlo (MC) hill-climbing algorithm to search for the most probable alignments. This method has been successfully used for aligning real genomic sequences, such as Drosophila, rodent and hominid non-coding DNA [5, 14, 15]. In a simulation study, Keightley and Johnson  found that MCALTGN was generally superior to the other alignment methods that it was compared to.
Here, we describe an improved non-coding sequence alignment algorithm based on a generalisation of the evolutionary model used by Keightley and Johnson . We show how a combination of a dynamic programming (DP) algorithm and a one dimensional deterministic optimisation – algorithm can be used to find the most probable pairwise sequence alignment. Note that when we assume the Jukes-Cantor  model for nucleotide substitution, the present DP method and the previous MC method are essentially using two different optimisers to attempt to maximise the same "score" function: alignment probability. However, the new optimiser is expected to be better and faster.
We have compared our method to the pair HMM method of Knudsen and Miyamoto (PairHMM_KM hereafter), which is quite similar to the present method in that it explicitly makes use of an evolutionary time parameter . We have also compared our method to the heuristic alignment program AVID of Bray et al.  in simulations that assume a general-time-reversible (GTR) model  that had first been fitted to real Drosophila non-coding DNA sequence data. It has been shown that AVID performs very well compared to other heuristic methods [13, 17], so here we only compare our method to AVID rather than other heuristic methods.
In our tests, the new DP method (MCALIGN2) is up to ten times faster than the previous MC method (MCALIGN), and is also faster than the pairHMM_KM method , although none can compete in speed terms with heuristic methods.
For cases of real non-coding sequence data, we also compared MCALIGN2 with AVID and CLUSTALW , and show that they perform differently for some specific cases.
We use a Bayesian statistical framework [20, 21] to make inference about the pairwise alignment. The aim is to compute the posterior probabilities of different possible alignments, using the observed sequences as data and eliminating other "nuisance" parameters from the analysis. Here we focus on finding the alignment with the highest posterior probability.
Let t be the total divergence time between two sequences, a be an alignment of two sequences, and S be the observed data, which is two non-coding DNA sequences. In a Bayesian framework, the behaviours of all variables are modelled by probability distributions. Joint inference about a and t is accomplished simply via Bayes' theorem
The probability P(S) that appears in the denominator of equation (1) may be difficult to calculate, but because in Bayesian inference the observed data S is held fixed, the unconditional probability P(S) is constant. We can therefore make our inference using only relative probabilities and P(S) need not be calculated. The other unconditional probability that appears in equation (1) is P(t), which is specified as a prior; our method will work for any prior.
To calculate the posterior probability of an alignment, we consider the divergence time t as a nuisance parameter. The posterior probability for an alignment is therefore marginal to the divergence time t, and is calculated using the integral
P(a|S) = ∫ P(a, t|S) dt. (2)
We approximate this integral using Laplace's method, described in detail below.
Probability model of sequence evolution
The transition probabilities for the pair HMM determine the pattern of indels in the alignment. The emission probabilities for the pair HMM determine the sequences that are observed, given the pattern of indels in the alignment. We specify the transition probabilities with an explicit model of insertion and deletion events, and the emission probabilities are specified by a model of nucleotide frequencies and of nucleotide substitutions. We consider the transition and emission probabilities in turn.
We assume that insertions and deletions occur as independent events over time with a total rate θ per interbase site relative to nucleotide substitutions. As we ignore multiple hits for indels, the probability of an indel is 1-e(-θt)per interbase site, which we approximated as θt, an approximation that should be good for small values of t. An indel can correspond to a gap in sequence x or a gap in sequence y. These two events have the same probability, so the probability of a gap in either of the two sequences, x and y, is then θt/2. In Figure 1, this corresponds to the transition probability from the M state to the Ix,1state, or to the Iy,1state. The pair HMM must move through one of these states whatever the length of the indel.
As shown in Figure 1, given that the pair HMM has arrived in state I x,i or I y,i (for any i ≥ 1), the transition probability back to the M state is w i / and the transition probability to state Ix,i+1or Iy,i+1is 1 - w i /. (Here a sum with no terms is understood to be zero.) This produces the desired distribution of indel lengths. We also assume that a gap in sequence x will not be followed directly by a gap in sequence y, and therefore there are no transitions from any of the states I x to any of the states I y , or vice versa. Our approach could be extended to accommodate an indel length distribution that is any mixture of geometric distributions (as used by Miklos et al. ) by duplicating the nodes in the pair HMM for insertions and deletions, and setting transition probabilities according to each component in the mixture. Such an extension may lead to increased accuracy, but at the expense of increased computational demands.
In order to make our pair HMMs describe a probability distribution over all possible alignments, we need to include a Begin state and an End state. We set the transition probability from the Begin state to states M, Ix,1and Iy,1to be the same as those from the M state. We allow all states to make transitions to the End state, with a low transition probability ε. If ε is small enough, we can ignore it in all of our calculations .
The emission probabilities, which determine the sequences given the pattern of indels, are derived from the general time-reversible (GTR) model of nucleotide substitution . The Jukes-Cantor  model and the Kimura-2-parameter  model are two specific cases of the GTR model when certain parameters are fixed.
The emission probabilities and are the equilibrium frequencies of nucleotides m i and n j , which are equal for sequences x and y. The emission probabilities are the probabilities of starting with an unobserved common ancestor nucleotide o, drawn from the equilibrium distribution of nucleotide frequencies, and evolving independently down two lineages, to m i in time t1 along one lineage and to n j in time t2 along the other lineage. (Under a time reversible model, this is the same as the probability of starting with n j and evolving to m i (or vice versa) in time t1+ t2). Since the times t1 and t2 are individually nonindentifiable, we parameterise our model simply by the total divergence time t = t1+ t2. For a given total divergence time, the conditional probability of evolving to m i given starting with n j is , and the matrix of these conditional probabilities, Q(t), can be calculated from the fixed instantaneous rate matrix A by matrix exponentiation , that is,
Q(t) = e tA . (3)
which can be calculated using the eigenvalues and eigenvectors of A. Here we estimate the rate matrix A from the same external data that is used to estimate the parameters for indels, as described below.
Given that P(a,S|t) has been specified by the model, and that a prior P(t) for the divergence time has also been specified, we have developed an algorithm to infer the approximate maximum a-posteriori (MAP) alignment . This is the alignment with highest posterior probability given the observed sequences, with the divergence time eliminated as a nuisance parameter. Thus, is the alignment that maximises P(a|S), which is given by the integral in equation (2). To approximate this integral, we assume that P(a, t|S), when treated as a function of t with both a and S held fixed, is approximately Gaussian. Then, using Laplace's method , we can write
Here is the mode, or value of t that maximises P(a, t|S) (again, when treated as a function of t with a and S held fixed). The quantity
is the modal dispersion, which is the reciprocal of the curvature at the mode . We make a further approximation,
P(a|S) ≈ P(a, |S)*C(S) (6)
where C(S) is a constant that depends on S but not on a or t. This approximation can be made when we wish to maximise P(a|S) over a set of a for which |V a | is approximately constant. The goodness of this approximation is discussed below.
Given that our approximations hold, Equation (6) shows that maximises P(a|S) if and only if maximises P(a,|S). Since by definition (, ) maximises P(a, t|S), we see that can be found by unrestricted optimisation of P(a, t|S). Our algorithm to find exploits the fact that we are free to solve the unrestricted optimisation problem with any manner we choose, and specifically that we can "change the order of maximisation". The statistical argument we presented above says that we should find for each a, and then maximise P(a, |S) over all a. An equivalent solution is to find t , for each t (that is, the best alignment for a given t) and then maximise P( t , t|S) over all t. The second solution is much easier in practice, because t , can be found using a standard dynamic programming algorithm for pair HMMs , and then P( t , t|S) can be maximised using any standard algorithm for maximising a one dimensional function.
The dynamic programming algorithm guarantees to find the global maximising t (with ties broken arbitrarily). We find a straightforward Golden Section Search  to be adequate for maximising P( t , t|S). This assumes that there is a single global optimum to be found. Actually we are able to trap events where local optima are detected. However, no local optima has ever been detected. We terminate the search when the values of t bracketing the maximum differ by less than 0.001. Moreover, we are able to terminate earlier when the optimal alignment is the same at all points within the bracketing area.
Parameterization of models of sequence evolution
Our model of noncoding DNA evolution is parameterized according to the empirical distribution of indel lengths and their overall rate relative to nucleotide substitutions from species for which essentially unambiguous alignments can be made. Here, we consider a parameterization by intronic data of D. simulans and D. sechellia (Shown in Figure 2). For these data, the rate of indels per interbase site, relative to the rate of nucleotide substitution, was previously estimated as θ = 0.225 . We fitted the observed frequencies of different indels lengths to our model as follows. We directly use the observed frequencies of 1-bp and 2-bp indels, that is, 0.455 and 0.182, respectively. For indels of ≥ 3-bp, the frequencies, W x , for the model were obtained by minimizing the sum over ≥ 3-bp indels of the squared differences between the observed frequency distribution and w x = β/α x . Here β is a constant. The estimate for α was 1.170. Our software performs this curve fitting and in fact the whole analysis with a supplied empirical distribution of containing any lengths.
A GTR model of nucleotide substitution was fitted to Drosophila data shown in Table 1. By assuming the GTR model, we can then symmetrise this matrix by averaging the table with its transpose before any of the following calculations were carried out. The estimated equilibrium frequencies of each base are obtained from the normalised column sums, yielding (q A , q G , q C , q T ) = (0.324, 0.197, 0.213, 0.266). The estimated rates of each type of substitution are obtained by dividing the entries in each column by the respective column sums, yielding:
Drosophila intronic data that is used to derive a GTR model of DNA evolution.
Finally, find the matrix A that satisfies Equation (3) when time is measured in units of expected substitutions, to obtain our estimate of the instantaneous rate matrix:
For non-coding sequences, there are few externally verified alignments available to test the performance of alignment methods. As a substitute, we simulate sequence divergence in silico, so that sequences are generated that are related by a known, "correct" alignment . We tested the MCALIGN2 program by examining the posterior probability of the best alignment found by the algorithm, the fraction of correctly aligned sites, an estimate of divergence time calculated from the estimated alignment, and the time taken to compute the alignment.
We compared the dynamic programming approach used here against the Monte-Carlo approach proposed previously  and the pair HMM approach of Knudsen and Miyamoto  in simulations assuming the Jukes-Cantor model of nucleotide evolution. In comparisons of MCALING2 and MCALIGN, for each simulated pair of sequences, we compared the posterior probability, P(a|S) ≃ P(a, |S), of the best alignment found by MCALIGN2 with the best alignments found by MCALIGN.
We also compared MCALIGN2 against AVID of Bray et al.  in simulations assuming a GTR model, parameterised using the Drosophila intronic data as described above. In these comparisons we investigated cases in which the model assumed by MCALIGN2 differed from the simulation model, by using the simpler JC and K2P models to analyse data simulated under a GTR model.
In all comparisons, we calculated the fraction of correctly aligned sites by counting the number of base pairs or bases-to-gaps which were correctly aligned in a comparison to the true alignment. As an alternative measure of alignment quality, we considered the precision of divergence time estimated from the alignments. The estimator of divergence time we used was distance under the GTR model. It is made by estimating the base frequencies q i , and the rates a ij , and finding ones that most closely predict the observed net transition matrix P . This estimator of divergence time uses only the non-indel regions, and does not use the presence of indels to help estimate divergence time. For all the simulations with a given divergence time t and a certain evolutionary model, we observed the mean and variance of the estimator of t calculated from both the true alignment and the alignments estimated by sequence alignment methods we considered here. We express the precision of the estimator as the estimated root mean squared error (e.r.m.s.e.), since none of the estimators examined are perfectly unbiased. For t, this is
when there are N simulations.
Although our program allows any prior for divergence time, for all comparisons we used a relatively diffuse or uninformative prior:
which has the mean 0.75. Because low divergences are more likely than high ones for two homologous sequences, this prior on t seems to be a reasonable one.
Comparison amongst PairHMM_KM, MCALIGN2(DP) and MCALIGN(MC)
We generated non-coding sequence data using a model of non-coding DNA evolution in which gap lengths are parametrized by intronic data of D. simulans and D. sechellia, and point substitutions occur according to the Jukes-Cantor model to compare the performances of PairHMM_KM, MCALIGN2 (DP hereafter) and MCALIGN (MC hereafter). In this setting, the DP and MC methods aim to find the same most probable alignment, since they assume essentially the same model and prior, but use different algorithms.
Performance of MCALIGN2(DP), MCALIGN(MC) and PairHMM_KM compared by the estimator of divergenct time corrected by the Jukes-Cantor model
Performance of MCALIGN2(DP), MCALIGN(MC) and PairHMM_KM compared by examining the proportions of correctly aligned sites.
Proportions of correctly aligned sites.
Comparison between MCALIGN2 and AVID
For each combination of values of t and θ, 200 replicate simulations were performed, each simulating a pair of sequences of length 500 base pairs, evolving under an indel model and a general time reversible (GTR) model of nucleotide substitution, parameterised using real Drosophila data. This model is very different to the simple Jukes-Cantor (JC) model, and quite different to Kimura's 2 parameter (K2P) model. In addition to comparing MCALIGN2 with AVID, it is interesting to explore the effect of the nucleotide substitution model assumed by MCALIGN2. We aligned each simulated pair of sequences using MCALIGN2 under the assumptions of the correct GTR model, a simple K2P model with the ratio of transition events to transversion events equal to 2, and the JC model.
Performance of MCALIGN2 and AVID compared by proportions of correctly aligned sites based on a GTR model
Proportion of matched bases
Performance of MCALIGN2 and AVID compared by estimator of divergence time based on a General Time-Reversible Model
Here, lower accuracy is indicated by a lower proportion of correctly aligned bases, and estimates of divergence time t that are more biased and have larger e.r.m.s.e.. In particular, alignments produced by AVID exhibit consistent upward bias estimates of t, and lower means proportions of correctly aligned bases than alignments produced by MCAL1GN2. This remains true, for most of the cases we considered here, whether MCALIGN2 used the correct GTR model of nucleotide substitution, or the incorrect JC or K2P models. The improvement in alignment quality gained by knowing the correct model of nucleotide substitution is generally modest, but worthwhile.
Test using real data
We also compared MCALIGN2 with AVID and CLUSTALW using real intronic DNA sequences from mouse and rat. Although we do not know the true alignments for real sequence data, we can still judge the alignment performances of different methods by examining the plausibility of the alignments (e.g. positions of gaps in the alignments and proportion of matched bases). Here we show three specific cases in which MCALIGN2 performed quite differently from AVID and CLUSTALW.
Figure 4(b) shows a different fragment from alignments produced by AVID, CLUSTALW and MCALIGN2. In this case, the alignment produced by MCALIGN2 also has a long gap from ~300 bp–~1000 bp, and it has the highest proportion of matched bases compared to other alignments. However, the alignment produced by CLUSTALW has several small gaps and a long gap in the terminal portion, and it has the lowest proportion of matched bases. Although the alignment produced by AVID looks better than the one produced by CLUSTALW, it is still fragmented by several small-length gaps.
However, MCALIGN2 does not always produce less gaps than other methods. As shown in Figure 4(c), the alignment produced by MCALIGN2 has more gaps than the others, but a smaller number of nucleotide differences. Without other information it is impossible to say which is more plausible.
The problem of statistical inference of an alignment can be separated into two parts: specifying a scoring function, and finding an alignment that optimises that scoring function. The scoring function is specified on biological and/or statistical grounds, and determines the biological meaningfulness and accuracy of the inferred alignment. The choice of optimising algorithm determines the speed of the method, and may hamper accuracy if convergence to a global optimum cannot be guaranteed. A useful alignment method must produce biologically meaningful and accurate alignments, and also must do so quickly. There is a trade-off because the most biologically realistic scoring functions are difficult to optimise.
Many scoring functions can essentially be described by the relative contributions for individual nucleotide substitution and indel events, which were traditionally thought of as penalty scores for mismatches and for gaps. However, no general theory guides the selection of these penalties , unless divergence time is known . Although almost all scoring functions have a probabilistic interpretation , only ones in which divergence time is an explicit parameter have an evolutionary interpretation. This inclusion of a time parameter is crucial in allowing us to train or parameterize our model using closely related sequences, in order to improve the accuracy of alignments between more distantly related sequences. Although the idea of training a scoring function on known alignments is an old one (especially with respect to amino acid substitutions (e.g. PAM250 matrix of Dayhoff et al. )), in the past it has generally been necessary to use a training set of sequences at similar evolutionary distance as the sequences that are ultimately to be aligned.
Heuristic scoring functions are often chosen because an algorithm exists to optimise them efficiently. However, without any underlying evolutionary model, the alignments produced by such methods will be biased (at least at some evolutionary distances), in the sense that they will exhibit features that depart in a systematic direction from the true alignment.
The evolutionary model used in our method strikes a balance between biological realism and computational tractability. We ignore multiple hits of indel events, and assume a distribution of indel lengths that corresponds to an improved affine gap penalty scheme. Our model is therefore quite different from more realistic evolutionary models that account properly for multiple hits of indel events [8, 9, 11, 12]. The TKF91 model is particularly unrealistic for non-coding DNA, since it allows only single base indels. Keightley and Johnson  suggest that the present model (ignoring multiple hits for indels) is a better approximation to their simulation model (which allowed multiple hits of multi-base pair indels), for the parameter values used in their simulations. The TKF92 model allows a geometric distribution of indel lengths, but only allows whole insertions to be subsequently deleted, or vice versa. That model has therefore been criticised as introducing non-biological "hidden fragment boundaries". Since our model does not allow insertions to be deleted at all, or vice versa, it could be seen as also introducing "hidden fragment boundaries". Our model allows a more realistic distribution of indel lengths than the TKF92 model. The approach of Knudsen and Miyamoto  could be seen as an extension of the TKF92 model, assuming a geometric distribution of indel lengths and allowing multiple hits involving up to two indel events. Our results suggest that this model (approximated using a three state pair HMM), and our model (using a seven state HMM) offer approximations of very similar quality. Intuitively, we would have expected our model to be superior when multiple hits of indel events were rare, i.e. for relatively smaller evolutionary distances and indel rates. However, it seems that in such cases the performance of both methods is so good that it is hard to detect any difference. The "long indel" model of Miklos et al.  is certainly more realistic than either model, since it allows an arbitrary distribution of indel lengths and accounts almost exactly for multiple hits of indels. However, the finite trajectory algorithm  used to account for multiple hits is computationally expensive (O(L4) in complexity).
MCALIGN2 uses a dynamic programming algorithm that is guaranteed to find the most probable alignment for a given divergence time, whereas the stochastic hill-climbing algorithm used in the Monte Carlo method can only search locally by making heuristically chosen adjustments to an alignment.
MCALIGN2 stops its search when the maximising divergence time is bracketed to high precision, with the bracket length being reduced by a geometric factor at each step of the algorithm. In contrast, the Monte Carlo method must search until no improvement in alignment probability is found during a predetermined number of iterations.
In comparisons of MCALIGN2 against the pair HMM method of Knudsen and Miyamoto, a method with an evolutionary time parameter and an affine gap penalty , we found that the two methods performed very similarly for almost all cases, but MCALIGN2 is computationally faster. When comparing MCALIGN2 against AVID, a time-naive model , we found that MCALIGN2 produced better quality alignments than AVID for almost all combinations of parameters. This shows that, when the evolutionary model is known, this knowledge can be used in in a model based inference method to estimate alignment more accurately.
Despite being substantially faster than our original Monte Carlo approach and the pair HMM method of Knudsen and Miyamoto, MCALIGN2 cannot compete with AVID in terms of execution time, because of the clever heuristics used by AVID. Its general strategy for aligning two sequences is to select anchors using a variant of the Smith-Waterman algorithm  to split long sequences into short sequences, which are aligned by a dynamic programming algorithm, Needleman-Wunsch . A set of maximal matches between sequences is constructed using a suffix tree. This approach is fast and memory efficient, and practical for sequence alignments of large genomic regions up to megabases long . In principle, the fast heuristics used by AVID can be applied for any pair HMM, and therefore could be combined with our approach to give faster, high quality alignments.
In order to examine the robustness of the MCALIGN2 method, we also investigated cases in which the model assumed in the MCALIGN2 analysis was a simpler model (JC or K2P) than the model the data were simulated under (GTR). Generally, the MCALIGN2 method assuming an incorrect model still has good performance for small and medium divergence times, but for larger divergence times and/or higher indel rates, performance suffers slightly compared with when the correct GTR model was assumed. Therefore, when aligning sequences from distant species, it is desirable to use an evolutionary model that is as realistic as possible. However, it is in precisely this situation that it may be most difficult to estimate a model, because the assumption of that the evolutionary process is the same between closely and distantly related species is most likely to break down.
When inferring alignment in a Bayesian framework, divergence time is a nuisance parameter that must be eliminated by integration (Equation 2). The computational implementation of our method relies totally on being able to approximate this integral (Equations 4 and 6) rather than having to calculate it numerically using e.g. quadrature. The approximations we make will be good when P(t|a,S) is approximately normal with constant variance for a certain set of high probability alignments. Because P(t|a,S) is a product of multinomial probabilities, the normality approximation will be good for long sequences under most models of molecular evolution. The assumption of constant variance will be reasonable when high probability alignments differ from each other by only a few indels and substitutions, relative to the total sequence length. As a concrete check of this assumption, we used the Monte Carlo search algorithm of Keightley and Johnson  and retained the set of all alignments visited that had probability at least 0.01 as large as the maximum probability. Within this set, the correlation between P(a|S) computed "exactly" (using quadrature) and P(a|S) exceeded 0.98.
It is worth mentioning that, to our knowledge, no better method has been found for eliminating divergence time as a nuisance parameter when estimating alignment. Most authors concentrate on finding the true MLE for t, summing over all possible alignments, using the EM algorithm [8–10, 35]. The best way to estimate the alignment has not been considered in detail, but a common approach is to use the most probable alignment conditional on the observed sequences and conditional on the MLE for t. Although our method has a more direct Bayesian justification, given the approximations made it is likely that the two approaches will give similar results.
Sequence alignment is a major issue for the evolutionary analysis of non-coding DNA. We developed a model-based method, MCALIGN2, as an improvement to the previous Monte Carlo method MCALIGN. MCALIGN2 uses a deterministic global optimiser to find the alignment with the highest posterior probability. It allows a rich class of evolutionary models of indel length along with any time reversible model of nucleotide substitution. As shown in the test results, MCALIGN2 outperforms other available non-coding DNA sequence alignment methods for all the cases we have considered.
Availability and requirements
Project name: MCALIGN2
Project home page: http://homepages.ed.ac.uk/eang33/
Operating system: Platform independent
Programming language: C++
Other requirements: C++ compiler if downloading and compiling the source code
Licence: FSF GENERAL public licence.
We thank Dr Bjarne Knudsen for providing their pair HMM program, Daniel Halligan for some useful comments and Daniel Gaffney for providing mouse and rat intronic sequences. JW was supported by Dorothy Hodgkin Postgraduate Studentship Award. TJ was supported by the Biotechnology and Biological Sciences Research Council grant #206/D 16977.
- Li WH: Molecular Evolution. Sinauer Associates, Sunderland, MA; 1997.Google Scholar
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860–921. 10.1038/35057062View ArticleGoogle Scholar
- International Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420: 520–562. 10.1038/nature01262View ArticleGoogle Scholar
- Halligan DL, Keightley PD: Ubiquitous selective constraints in the Drosophila genome revealed by a genome-wide interspecies comparison. Genome Research 2006. Accepted AcceptedGoogle Scholar
- Keightley PD, Gaffhey DJ: Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents. Proc Natl Acad Sci 2003, 100: 13402–13406. 10.1073/pnas.2233252100PubMed CentralView ArticlePubMedGoogle Scholar
- Dermitzakis ET, Reymond A, Lyle R, Scamuffa N, Ucla C, Deutsch S, Stevenson BJ, Flegel V, Bucher P, Jongeneel CV, Antonarakis SE: Numerous potentially functional but non-genie conserved sequences on human chromosome 21. Nature 2002, 420: 578–582. 10.1038/nature01251View ArticlePubMedGoogle Scholar
- Pollard DA, Bergman CM, Stoye J, Celniker SE, Eisen MB: Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics 2004, 5: 6. 10.1186/1471-2105-5-6PubMed CentralView ArticlePubMedGoogle Scholar
- Thorne JL, Kishino H, Felsenstein J: An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 1991, 33: 114–124. 10.1007/BF02193625View ArticlePubMedGoogle Scholar
- Thorne JL, Kishino H, Felsenstein J: Inching toward reality-An improved likelihood model of sequence evolution. J Mol Evol 1992, 34: 3–16. 10.1007/BF00163848View ArticlePubMedGoogle Scholar
- Miklos I, Toroczkai Z: An improved model for statistical alignment. In WABI, LNCS 2149 Edited by: Gascuel O, Moret BME. 2001, 1–10.Google Scholar
- Knudsen B, Miyamoto MM: Sequence alignments and pair hidden markov models using evolutionary history. J Mol Biol 2003, 333: 453–460. 10.1016/j.jmb.2003.08.015View ArticlePubMedGoogle Scholar
- Miklos I, Lunter GA, Holmes I: A "long indel" model for evolutionary sequence alignment. Mol Biol Evol 2004, 21(3):529–540. 10.1093/molbev/msh043View ArticlePubMedGoogle Scholar
- Keightley PD, Johnson T: MCALIGN: Stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution. Genome Res 2004, 14: 442–450. 10.1101/gr.1571904PubMed CentralView ArticlePubMedGoogle Scholar
- Haddrill PR, Charlseworth B, Halligan DL, Andolfatto P: Patterns of intron sequence evolution in Drosophila are dependent upon length and GC content. Genome Biology 2005, 6: R67. 10.1186/gb-2005-6-8-r67PubMed CentralView ArticlePubMedGoogle Scholar
- Keightley PD, Lercher MJ, Eyre-Walker A: Evidence for widespread degradation of gene control regions in hominid genomes. PLoS Biology 2005, 3: 872–877. 10.1371/journal.pbio.0030042View ArticleGoogle Scholar
- Jukes TH, Cantor CR: Evolution of protein molecules. In Mammalian protein metabolism. Edited by: Munro HN. Academic Press, New York; 1969:21–123.View ArticleGoogle Scholar
- Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Res 2003, 13: 97–102. 10.1101/gr.789803PubMed CentralView ArticlePubMedGoogle Scholar
- Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution 1984, 20: 86–93. 10.1007/BF02101990View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W-Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis chapter 1 and 12. Chapman and Hall/CRC Press, New York; 2003.Google Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis: Probabilistic models of proteins and nucleic acids, chapters 2, 3 and 4. Cambridge University Press, Cambridge, UK; 1998.View ArticleGoogle Scholar
- Ewens WJ, Grant GR: Statistical Methods in Bioinformatics. Springer-Verlag, New York; 2001.View ArticleGoogle Scholar
- Lunter GA, Drummond AJ, Miklós I, Hein J: Statistical Alignment: Recent Progress, New Applications, and Challenges. Edited by: Rasmus Nielsen. "Statistical methods in Molecular Evolution", Springer Verlag's Series in Statistics in Health and Medicine; 2004.Google Scholar
- Miller W, Myers EW: Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology 1988, 50: 97–120. 10.1016/S0092-8240(88)80016-8View ArticlePubMedGoogle Scholar
- Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 1980, 16: 111–120. 10.1007/BF01731581View ArticlePubMedGoogle Scholar
- Felsenstein J: Inferring Phylogenies. Volume 13. Sinauer Associates, Sunderland, MA; 2004.Google Scholar
- O'Hagan A, Forster J: Bayesian Inference, volume 2B of Kendall's Advanced Theory of Staistics. Volume 9. 2nd edition. Arnold, London; 2004.Google Scholar
- Press WH, Flannery BP, Teukolsky SA, Vetterling WT: Numerical recipes in C: the art of scientific computing. Volume 10. Cambrige University Press, Cambridge, UK; 1992.Google Scholar
- Zhang ZL, Gerstein M: Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Research 2003, 31: 5338–5348. 10.1093/nar/gkg745PubMed CentralView ArticlePubMedGoogle Scholar
- Halligan DL, Eyre-Walker A, Andolfatto P, Keightley PD: Patterns of evolutionary constraints in intronic and intergenic DNA of Drosophila. Genome Res 2004, 14: 273–279. 10.1101/gr.1329204PubMed CentralView ArticlePubMedGoogle Scholar
- Reese JT, Pearson WR: Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 2002, 18: 1500–1507. 10.1093/bioinformatics/18.11.1500View ArticlePubMedGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure. Volume 5. Edited by: Dayhoff MO. National Biomedical Research Foundation, Silver Spring, Washington D.C; 1978:345–352.
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Holmes I, Bruno WJ: Evolutionary HMMs: A Bayesian approach to multiple alignment. Bioinformatics 2001, 17: 803–810. 10.1093/bioinformatics/17.9.803View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.