General continuoustime Markov model of sequence evolution via insertions/deletions: local alignment probability computation
 Kiyoshi Ezawa^{1, 2}Email authorView ORCID ID profile
https://doi.org/10.1186/s1285901611676
© The Author(s). 2016
Received: 1 April 2016
Accepted: 9 August 2016
Published: 27 September 2016
Abstract
Background
Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a method to reliably calculate the occurrence probabilities of sequence alignments via evolutionary processes on an entire sequence. Previously, we presented a perturbative formulation that facilitates the ab initio calculation of alignment probabilities under a continuoustime Markov model, which describes the stochastic evolution of an entire sequence via indels with quite general rate parameters. And we demonstrated that, under some conditions, the ab initio probability of an alignment can be factorized into the product of an overall factor and contributions from regions (or local alignments) delimited by gapless columns.
Results
Here, using our formulation, we attempt to approximately calculate the probabilities of local alignments under spacehomogeneous cases. First, for each of all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs), we numerically computed the total contribution from all parsimonious indel histories and that from all nextparsimonious histories, and compared them. Second, for some common types of local PWAs, we derived two integral equation systems that can be numerically solved to give practically exact solutions. We compared the total parsimonious contribution with the practically exact solution for each such local PWA. Third, we developed an algorithm that calculates the firstapproximate MSA probability by multiplying total parsimonious contributions from all local MSAs. Then we compared the firstapproximate probability of each local MSA with its absolute frequency in the MSAs created via a genuine sequence evolution simulator, Dawg. In all these analyses, the total parsimonious contributions approximated the multiplication factors fairly well, as long as gap sizes and branch lengths are at most moderate. Examination of the accuracy of another indel probabilistic model in the light of our formulation indicated some modifications necessary for the model’s accuracy improvement.
Conclusions
At least under moderate conditions, the approximate methods can quite accurately calculate ab initio alignment probabilities under biologically more realistic models than before. Thus, our formulation will provide other indel probabilistic models with a sound reference point.
Keywords
Background
The evolution of DNA, RNA, and protein sequences is driven by mutations such as base substitutions, insertions and deletions (indels), recombination, and other genomic rearrangements (e.g., [1–3]). Thus far, analyses on substitutions have predominated in the field of molecular evolutionary study, in particular using the probabilistic (or likelihood) theory of substitutions that is now widely accepted (e.g., [4–6]). This is probably because evolutionary models describing residue substitutions are relatively easier to handle. However, it must be remembered that the study of indels is at least as crucial as the study of substitutions. There are two major reasons for this. First, it is indels but not substitutions that yield the skeletons (or the gap configurations) of the sequence alignments (reviewed, e.g., in [7]), which provide essential inputs to most homologybased analyses in computational biology. And second, some recent comparative genomic analyses have revealed that indels account for more base differences between the genomes of closely related species than substitutions (e.g., [8–12]). These circumstances make it imperative to develop a stochastic model that enables us to reliably calculate the probability of sequence evolution via mutations including insertions and deletions. Since the groundbreaking works by Bishop and Thompson [13] and by Thorne, Kishino and Felsenstein [14], many studies have been done to develop and apply methods to calculate the probabilities of pairwise alignments (PWAs) and multiple sequence alignments (MSAs) under the probabilistic models aiming to incorporate the effects of indels. Such methods have greatly improved in terms of the computational efficiency and the scope of application. See excellent reviews for details on this topic (e.g., [15–17]). A majority of these studies are based on hidden Markov models (HMMs) (e.g., [18]) or transducer theories (e.g., [19]). Both of them calculate the indel component of an alignment probability as a product of intercolumn transition probabilities or of blockwise contributions. And the study on these methods is still advancing, strengthening their mathematical and algorithmic bases (e.g., [20, 21]). Unfortunately, most of these methods have at least either of two fundamental problems, one regarding the evolutionary consistency and the other regarding the biological realism. (For details, see Background of [22].) Regarding the evolutionary consistency, it is a priori unclear whether or how a HMM or a transducer is related with any genuine stochastic evolutionary model (or “evolutionary model” for short), which describes the evolution of an entire sequence via indels along a time axis. Regarding the biological realism, the standard HMMs or transducers can at best handle mixed geometric distributions of indel lengths (e.g., [23]) (and usually implement simple geometric distributions), whereas many empirical studies showed that the real indel lengths are distributed according to powerlaws (e.g., [24] and references therein). Besides, very few studies thus far (e.g., [25]) addressed the issue of indel rate variation across regions.
In a previous study [22], we presented a theoretical formulation that facilitates the ab initio calculation of alignment probabilities under a genuine stochastic evolutionary model, specifically, a general continuoustime Markov model of sequence evolution via indels. Our evolutionary model was created as a result of incorporating the idea of positionspecific evolutionary rates [25] into the most general “substitution/insertion/deletion model” [26]. Thus, the model is naturally devoid of the aforementioned two problems. Aided by some techniques of timedependent perturbation theory in quantum mechanics [27–29], we formally expanded the ab initio probability of an alignment into a series of terms with different numbers of indels. This expansion gave an intuitively clearer representation of Feller’s theorems [30]. And it theoretically underpinned the stochastic evolutionary simulation method of Gillespie [31], which provides the foundation for genuine sequence evolution simulators (e.g., [32–34]). And we also showed that, if the indel model parameters satisfy a certain set of conditions, the ab initio probability of an alignment is indeed factorable into the product of an overall factor and contributions from local alignments delimited by preserved ancestral sites (PASs), i.e., gapless columns. This suggested that the evolutionary models satisfying such conditions could provide a sort of generalized HMMs, which extend the space and timehomogeneous “long indel” model [26] to some space and timeheterogeneous situations.
In this paper, we focus on how to concretely calculate the contribution from each local alignment, assuming that the indel model satisfies the conditions for factorable alignment probabilities. (To clearly illustrate the concrete computations, we deal with spacehomogeneous models in the bulk of this manuscript (i.e., in sections R2R6), and briefly discuss extensions to more general cases near the end (i.e., in subsection R7.1).) As noted in [26] and section R1 of Results and discussion of this manuscript, the contribution from each local alignment is a summation over an infinite number of local indel histories. Thus it cannot be computed literally exactly within a finite amount of time. This makes it necessary to devise some approximation methods, each of which sums contributions from a finite number of indel histories (as first proposed in [26]). An auspice is that indel rates (say, λ _{ ID } indels per site per unit time) are known to be at most around 1/10 of the substitution rates (say, λ _{ S } substitutions per site per unit time) (e.g., [24, 35]). And the probability of an indel history involving N _{ ID } indels is roughly \( O\left({\left({\lambda}_{ID}\;t\right)}^{N_{ID}}\right) \) times the probability of a history with no indel, where t is a time scale characteristic of the system under consideration. In conjunction, these suggest that taking account only of histories with minimum and nearminimum required numbers of indels may provide a good approximation to each local alignment probability, as long as the sequence divergences (e.g., λ _{ S } t) are within the scope of phylogenetic analyses (i.e., at most O(1) substitutions per site).
In section R1 of Results and discussion, we briefly review the relevant portion of the theoretical basis that was established in our previous study [22]. We introduce simplified notation so that we can focus on a single local alignment. In sections R2R4, we demonstrate how our perturbative formulation can be concretely used to approximately calculate the contributions to the ab initio alignment probabilities from local alignments, i.e., alignment regions separated by gapless columns. We examine all types of local pairwise alignments (PWAs) in section R2, and some typical types of local multiple sequence alignments (MSAs) in section R4. For each local alignment type, we calculate the total parsimonious contribution and the total nextparsimonious contribution to its probability (more precisely, its multiplication factor). In section R3, we discuss two systems of integral equations that can be numerically solved to give practically exact solutions (or “exact” solutions, for short) for some common types of local PWAs. There, we also study the behaviors of the “exact” solutions. Then, by comparing the total parsimonious contribution with the total nextparsimonious contribution, or with the “exact” solution, we investigate the parameter regions in which the total parsimonious contribution can approximate the alignment probability quite accurately (in sections R2R4). In section R5, we perform simulation analyses with a genuine evolutionary simulator, Dawg [32], to examine whether or not the conclusions from sections R2R4 also apply to local MSAs of more general types. For this purpose, we developed an algorithm to calculate the “firstapproximate” probability of a given MSA under a given parameter setting (including a given tree) by multiplying the overall factor and the total parsimonious contributions from all local MSAs. And we examine the accuracy of the firstapproximate multiplication factors calculated by the algorithm. In section R6, we use our ab initio formulation as a “yardstick” to measure the accuracy of other indel probabilistic models. As a representative model, we chose the generalized HMM of [36], which aims for the biological realism but not fully for the evolutionary consistency. In section R7, we discuss some outstanding issues and possible improvements, extensions and applications of the presented algorithm and methods. The topics include the risks associated with the naïve application of our algorithm or methods to reconstructed alignments. The sections in Methods describe the settings for numerical analyses (M1) and simulation analyses (M2). And the sections in Supplementary methods in Additional file 1 explain methodological details on concrete perturbation calculations and the firstapproximate algorithm.
This paper basically uses the same conventions and notations as used in [22]. Briefly, a sequence state s (∈ S ^{ II }) is represented as an array of sites, each of which is equipped with an ancestry index (υ _{ x } ∈ ϒ).^{1} (In this study, we focus on indels. Hence, we do not consider the residue states of sequences. For the incorporation of residue states and substitutions, see, e.g., [37, 38].) And each indel event is represented as an operator acting on the bravector, 〈s, representing a sequence state. More specifically, the operator \( {\widehat{M}}_I\left(x,\kern0.5em l\right) \) denotes the insertion of l sites between the xth and (x + 1) th sites, and the operator \( {\widehat{M}}_D\left({x}_B,\kern0.5em {x}_E\right) \) denotes the deletion of a subarray between (and including) the x _{ B }th and the x _{ E }th sites. Readers unfamiliar with the braket notation (as adapted from theoretical physics (e.g., [27, 28])) can simply regard a bravector (〈s), a ketvector (s ^{′}〉) and an operator \( \left(\widehat{M}\right) \) as convenient reminders of a row vector, a column vector and a matrix, respectively, just as in the standard representation of a continuoustime Markov model. (See section SA1 in Additional file 2 of [22] for the equivalence between them.) And, also as in [22], the following terminology is used. The term “an indel process” means a series of successive indel events with both the order and the timing specified. And the term “an indel history” means a series of successive indel events with only the order specified.
As a last note, an “alignment,” a “PWA” and a “MSA” in this paper will mean their homology structures [39]. Briefly, the homology structure of an alignment is a set of alignment columns (i.e., sets of homologous sites in the aligned sequences) that are spatially arranged in a looser way than in a usual alignment, i.e., constrained only by the spatial relationships between the sites within each aligned sequence.^{2}
Results and discussion
[Descriptions given here are somewhat sketchy. For methodological details, as well as the relationship with the results of [22], see the relevant sections of Supplementary methods in Additional file 1.]
R1. Perturbation expansion of multiplication factor for local alignment
In this section, we briefly explain some results in [22] that are essential for this paper. Similarly to that of the probabilistic alignment methods in general, one of the main goals of our theoretical formulation (presented in [22]) is to calculate the absolute occurrence probabilities of the alignments and to compare the calculated alignment probabilities. Therefore, unless stated otherwise, the probabilities considered in this paper are not conditioned on a particular alignment (or even on extant sequences). (Once the absolute probabilities are calculated, such conditional probabilities (e.g., of indel histories) could be obtained by dividing the absolute probabilities of the outcomes (e.g., the indel histories) by the absolute probability of the condition (e.g., the resulting alignment), similarly to Eq. (SM5.3.6a) in Additional file 1.)

Condition (i): Each indel rate parameter is independent of the portion of the sequence state outside of the local region where the indel occurred.

Condition (ii): The increment of the exit rate due to each indel event is independent of the portion of the sequence state outside of the local region where the indel occurred. (The “exit rate” of a state is the rate at which the system “exits” the state, that is, the total rate at which the state changes to any of other states.)
Here, N _{ min }[α(s ^{ A }, s ^{ D }); γ _{ κ }] is the minimum number of indels required for the portion of α(s ^{ A }, s ^{ D }) in γ _{ κ }. And the term μ _{ P (N)}[γ _{ κ }; (α(s ^{ A }, s ^{ D }), [t _{ I }, t _{ F }])  (s ^{ A }, t _{ I })] is the portion of the multiplication factor contributed from all localPWAconsistent Nindel local histories in γ _{ κ }.^{3} (For more details, see the first half of SM1 of Supplementary methods in Additional file 1.)
It should be noted that the multiplication factor, \( {\tilde{\mu}}_P\left[\dots \right] \) (e.g., in Eq. (2)), is not a probability; actually, it is not even a conditional probability, and it can exceed 1 (unity) in some cases (in such manners that the entire right hand side of Eq. (2) will always be less than 1 (unity)). In this sense, the “generalized HMM” given by Eq. (2) differs from normal HMMs. The contribution of each local indel history to a multiplication factor is the ratio of the probability of the history (given an ancestral state) to the probability that the ancestral state underwent no indel. (See Eq. (SM1.7) in Additional file 1 for the mathematical definition.) When comparing the contributions from two different sets of histories (potentially giving rise to the same local alignment), the denominator (i.e., the probability of no indel) is usually identical. Therefore, in general, the comparison of two multiplication factor contributions gives the same result as the comparison of the corresponding probabilities. (Similar notes apply also to the analyses of local MSAs below.)

Condition (iii): the (prior) probability of each root state (s ^{ Root }) is given by the probability of s _{0} ^{ Root } multiplied by the product of factors over the local regions, where each factor depends only on the difference between s ^{ Root } and s _{0} ^{ Root } in a local region.
Here, N _{ min }[C _{ Κ }] is the minimum number of indels required for the portion of the MSA in C _{K}. And the term \( {\overset{\smile }{M}}_{P\kern0.5em (N)}\left[\alpha \left[{s}_1,\kern0.5em {s}_2,\dots, \kern0.5em {s}_{N^X}\right];\kern0.5em {s}_0^{Root};\kern0.5em {C}_K\kern0.5em \left\kern0.5em T\right.\right] \) is the fraction of the multiplication factor contributed from all localMSAconsistent Nindel local histories in C _{K}. (For more details, see the second half of SM1 of Supplementary methods in Additional file 1.)
As agued above, under conditions (i) and (ii) (and, in addition, (iii) for a MSA), the probability of a given alignment is factorized into the product of an overall factor and local contributions (as in Eq. (2) and Eq. (5)). This factorization could drastically speed up the computation of the probability. However, each local contribution is still a summation over an infinite number of indel histories (Eq. (3) and Eq. (6)), and its literally exact calculation would take infinitely long. This study examines two kinds of approximation methods. One is the “first approximation,” which approximates each multiplication factor with the “total parsimonious contribution,” i.e., the summation of contributions over all possible parsimonious indel histories. (It corresponds to the term, μ _{ P (N)}[γ _{ κ }; (α(s ^{ A }, s ^{ D }), [t _{ I }, t _{ F }])  (s ^{ A }, t _{ I })] with N = N _{ min }[α(s ^{ A }, s ^{ D }); γ _{ κ }] or \( {\overset{\smile }{M}}_{P\kern0.5em (N)}\left[\alpha \left[{s}_1,\kern0.5em {s}_2,\dots, \kern0.5em {s}_{N^X}\right];\kern0.5em {s}_0^{Root};\kern0.5em {C}_K\kern0.5em \left\kern0.5em T\right.\right] \) with N = N _{ min }[C _{ Κ }].) And the other is calculating a practically exact solution (or an “exact” solution, for short) of each local contribution from a local PWA of a certain type (see section R3). Especially, we examine the accuracy of the first approximation by comparing the total parsimonious contribution either with the total nextparsimonious contribution (sections R2 and R4), with the “exact solution” (section R3) or with the results of simulations (section R5). Here, the “total nextparsimonous contribution” is the summation of contributions over all nextparsimonious indel histories. (It usually corresponds to the above term with N = N _{ min }[α(s ^{ A }, s ^{ D }); γ _{ κ }] + 1 or with N = N _{ min }[C _{ Κ }] + 1).
Hereafter, we will often employ shorthand notations for the aforementioned (fractions of) multiplication factors, e.g., μ _{ P (N)}[γ _{ κ }; (α(s ^{ A }, s ^{ D }), [t _{ I }, t _{ F }])  (s ^{ A }, t _{ I })] for a PWA and \( {\overset{\smile }{M}}_{P(N)}\left[\alpha \left[{s}_1,\kern0.5em {s}_2,\dots, \kern0.5em {s}_{N^X}\right];\kern0.5em {s}_0^{Root};\kern0.5em {C}_K\kern0.5em \left\kern0.5em T\right.\right] \) for a MSA, either by omitting the arguments (like “μ _{ P (N)}” and “\( {\overset{\smile }{M}}_{P\kern0.5em (N)} \)”) or by replacing the arguments with simpler ones representing more concrete situations (like “μ _{ P (N)}[case (ii); ΔL ^{ A }]” or “\( {\overset{\smile }{M}}_{P\kern0.5em (N)}\left[ case\kern0.5em (II);\kern0.5em \varDelta {L}^{D12}\right] \)”). Unless stated otherwise, we consider the sequence evolution during time interval [t _{ I }, t _{ F }] (for a PWA) or along a given tree, T (for a MSA).
Here, \( {\varDelta}^{Dawg}\left[{\lambda}_I,{\lambda}_D,{f}_D(.)\right]\kern0.5em \equiv \kern0.5em {\lambda}_I\kern0.5em +\kern0.5em \left({\overline{l}}_D1\right){\lambda}_D \) is a “universal” constant factor, and \( {\overline{l}}_D\kern0.5em \equiv \kern0.5em {\displaystyle {\sum}_{l=1}^{L_D^{CO}}l\kern0.5em {g}_D\left(l,t\right)} \) is the average deletion length [32]. In this study, we use the powerlaw indel length distribution: \( {f}_I(l)\kern0.5em =\kern0.5em {f}_D(l)\kern0.5em =\kern0.5em {l}^{1.6}/\left[{\displaystyle {\sum}_{k=1}^{L_I^{CO}}{k}^{1.6}}\right] \), which is among the typical ones empirically observed (e.g., [24] and references therein). We also set λ _{ I } = λ _{ D } according to a genomewide data analysis [41], unless otherwise stated. As for the sequence state probabilities at the root, we assume a uniform sequence length distribution hereafter.^{5} See sections M1 and M2 of Methods for more specific settings.
R2. Numerical comparison between parsimonious and nextparsimonious contributions (1): for local PWAs
Perturbation analysis on local PWA probabilities in case (iv)
(∆L ^{ A }, ∆L ^{ D })  0.01 indels/site  0.04 indels/site  0.1 indels/site  0.2 indels/site 

(1, 1)  0.003  0.010  0.024  0.045 
(3, 1)  0.021  0.084  0.204  0.393 
(3, 3)  0.042  0.166  0.402  0.768 
(5, 5)  0.073  0.283  0.672  1.256 
(10, 1)  0.064  0.246  0.572  1.013 
(10, 10)  0.149  0.561  1.292  2.288 
(25, 1)  0.151  0.547  1.112  1.541 
(25, 4)  0.198  0.723  1.519  2.234 
(30, 10)  0.288  1.038  2.164  3.072 
(100, 1)  0.537  1.333  1.507  1.574 
(100, 3)  0.607  1.593  1.894  2.033 
(300, 1)  1.165  1.394  1.427  1.527 
R3. Numerical comparison between parsimonious contribution and "exact solution" for local PWAs
It is difficult to calculate the summed contributions from local histories involving more indels, especially in case (iv). We could exactly calculate the contribution from a single local history involving any number of indels if we use the algorithm for a “trajectory likelihood” given by Miklós et al. [26]. As we exemplified in Appendix A1.2 of [43], however, it is already quite hard to enumerate even all the possible nextparsimonious local indel histories for case (iv). Nevertheless, if we consider only cases (i), (ii), and (iii) under a (locally) spacehomogeneous model, we can work out systems of exact integral equations that could in principle provide the numerical solutions for the total sum of contributions up to a desired level of accuracy, i.e., \( {\tilde{\mu}}_P^{\left\langle {N}_{ID}\right\rangle}\equiv \kern0.5em {\displaystyle {\sum}_{N={N}_{min}}^{N_{ID}}{\mu}_{P(N)}} \) with a desired upperbound indel count (N _{ ID }), at the expense of some time and memory.
Similarly to our ab initio formulation itself, these systems of integral equations can accommodate any practical indel length distributions. Therefore, we could even examine cases where insertions and deletions follow different length distributions and/or models that incorporate transposon insertions (e.g., [44, 45]) as well. Such analyses should be interesting and important.
R4. Numerical comparison between parsimonious and nextparsimonious contributions (2): for local MSAs
Perturbation analysis on local MSA probabilities in case (IV)
(∆L ^{ D1}, ∆L ^{ D2})  0.01 indels/site  0.04 indels/site  0.1 indels/site  0.2 indels/site 

(2, 1)  0.004  0.016  0.037  0.067 
(3, 1)  0.016  0.063  0.150  0.279 
(3, 2)  0.012  0.049  0.118  0.225 
(10, 1)  0.050  0.190  0.432  0.751 
(10, 2)  0.060  0.225  0.509  0.895 
(10, 8)  0.060  0.231  0.543  0.980 
(10, 9)  0.047  0.182  0.421  0.740 
(30, 1)  0.140  0.487  0.915  1.203 
(30, 5)  0.163  0.563  1.118  1.645 
(30, 25)  0.173  0.620  1.251  1.850 
(30, 29)  0.139  0.486  0.909  1.189 
(100, 1)  0.418  0.985  1.180  1.304 
(100, 99)  0.418  0.981  1.170  1.290 
Taken together, the results in sections R2R4 suggest that the first approximation by the parsimonious indel histories alone will estimate the multiplication factor for each local alignment fairly well, as long as the local alignment size and the branch lengths (or the time interval) are at most moderate.
R5. Simulation analyses to see goodness of first approximation for local MSAs
Thus far, we examined all cases of local PWAs and some typical cases of local MSAs. To study a much wider variety of local MSAs, we developed an algorithm that calculates the first approximation of the probability of a given MSA under a given parameter setting including a phylogenetic tree. Briefly, the algorithm first chops the input MSA into gapped and gapless segments. Second, it attempts to enumerate all parsimonious indel histories that can give rise to each gapped segment (i.e. local MSA) via what we call a “local multipath downhill search” algorithm. Third, it computes their contributions to the multiplication factor for each gapped segment. And finally, it computes the firstapproximate MSA probability as the product of an overall factor and the total parsimonious contributions to the multiplication factors from all gapped segments. (For details on the algorithm, see SM5 of Supplementary methods in Additional file 1, as well as Additional file 1: Figures S5S8.)
After manually validating the subalgorithm to enumerate all parsimonious indel histories (described in [48]), we conducted simulation analyses using Dawg [32]. (See section M2 of Methods for the settings of the simulations.) We created five homogeneous sets of simulated MSAs, namely, sets 1A, 1B, 3P, 3M and 3F. Each of sets 1A and 1B consists of 100,000 MSAs simulated along a threeOTU tree that has equally long branches and is rooted at its sole internal node. The expected number of indels per site along each branch (E[n _{ ID }]) is 0.01 (small) for set 1A and 0.04 (medium) for set 1B. Sets 3P, 3M and 3F consist of 10,000 MSAs each, which are simulated along the trees of 12 primates, 15 mammals and 9 fastevolving mammals, respectively (Additional file 1: Figure S9). These sets were designed to mimic typically encountered MSAs of selectively neutral genome sequences whose sequence divergences are small, moderate and large, respectively.^{9}
Every simulation started with a random DNA sequence that is 1000 bases long. For reasons of computational time, we excluded local MSAs containing gaps longer than 100 bases. The numbers of subject local MSAs in sets 1A, 1B, 3P, 3M and 3F were 2,676,332, 7,695,575, 397,455, 935,553 and 984,321, respectively. Among them, 0.15 %, 1.38 %, 0.12 %, 0.23 % and 0.49 %, respectively, exhibited nonparsimonious ancestral sequence states. (See SM6 of Supplementary methods in Additional file 1 for how the MSAs were compared.)
The results in this section suggest that the first approximation would work fairly well also for a majority of local MSAs we are likely to encounter, as long as the local MSA and the branches are at most moderately long.^{10}
R6. Examining other indel models and methods in light of our formulation
One of the major merits of our ab initio perturbative formulation is that it can be applied to considerably realistic evolutionary models of indels [22]. Therefore, it will enable us to examine, e.g., the parameter ranges where other indel probabilistic models can well approximate the ab initio alignment probability under a fairly realistic evolutionary model.
First we briefly study the goodness of approximation by the geometric indel length distribution, which most of commonly used indel models are based on. As in the previous sections, we use the powerlaw distribution of the indel length (l), f _{1.6} ^{ PL } (l) = l ^{− 1.6}/[∑ _{ k = 1} ^{∞} k ^{− 1.6}], as a reference. Then, we fitted the scaled geometric distribution, f ^{ SG }(l; A, q) = A (1 − q) q ^{ l − 1}, to the powerlaw under a leastsquare criterion. The bestfit parameters were A _{ LS } = 0.7125 and q _{ LS } = 0.3957. Calculating the ratio, R _{ LS }(l) ≡ f ^{ SG }(l; A _{ LS }, q _{ LS })/f _{1.6} ^{ PL } (l), for different indel lengths, we find, e.g., R _{ LS }(5) = 0.3168, R _{ LS }(7) = 0.08495 and R _{ LS }(13) = 8.775 × 10^{− 3}. The ratio decreases rapidly as the indel length increases. If, for example, we allow the ratio as small as 1/3, the geometric distribution is regarded as a decent approximation only for l ≤ 4. With f _{1.6} ^{ PL } (l), the indels with l ≥ 5 account for about 30 % of all indels. These mean that, for 30 % of actually occurring indels, the geometric distribution substantially underestimates their frequencies even according to the above lenient criterion. This reconfirms the importance of using biologically realistic indel length distributions, as pointed out, e.g., in [23, 24]. ^{11}
Comparison of KimSinha’s probability with ab initio probability for case (iv) local PWA
(∆L ^{ A }, ∆L ^{ D })  Ratio (= ref/KS) ^{a}  Overlapping ^{b} 

(1, 1)  1.667  0.167 
(3, 1)  1.883  0.383 
(3, 3)  2.449  0.949 
(5, 5)  3.325  1.825 
(10, 1)  2.165  0.665 
(10, 10)  5.572  4.072 
(25, 1)  2.355  0.855 
(25, 4)  4.714  3.214 
(30, 10)  8.300  6.800 
(100, 1)  2.561  1.061 
(100, 3)  4.896  3.396 
(300, 1)  2.659  1.159 
As exemplified above, our ab initio perturbative formulation provides other indel probabilistic models with a sound reference point, under which the models can be examined to improve their accuracy and evolutionary consistency. A related topic is the ChapmanKolmogorov (CK) equation, which must be satisfied by genuine stochastic evolutionary models. Unfortunately, most of the currently common indel probabilistic models violate the CK equation (as argued, e.g., briefly in [41]). Because our perturbative formulation satisfies the CK equation up to any desired order of the perturbation expansion (Appendix A3 of [49]), our formulation could also examine the effects of the violation of the CK equation. For example, under a geometric indel length distribution, the effects become conspicuous only when the indel lengths exceed a “critical value” of O(10), where the geometric distribution substantially underestimates the real indel frequencies. (See, e.g., subsection 2.3 of [43].) This seems to explain the results of past studies (e.g., [50, 51]), which did not detect remarkable effects of the violation of the CK equation (or the effects of its cause, i.e., inadequately incorporating overlapping indels).
R7. Outstanding issues and possible improvements, extensions and applications
Here, we briefly discuss some outstanding issues and their possible solutions in the forms of methodological improvements and extensions, and also possible applications of the (improved/extended) methods to practical problems. (For more details on most of these topics, see Discussion of [48].)
R7.1. Possible improvements and extensions of our computational methods and algorithms
In this study, we successfully constructed two integral equation systems to calculate “exact” multiplication factors for case (i), (ii) and (iii) local PWAs. For case (iv) local PWAs (Additional file 1: Figure S1d), we only provided methods to analytically calculate the total parsimonious (i.e., 2indel) and total nextparsimonious (i.e., 3indel) contributions. Although in principle we could calculate contributions from indel histories with more than 3 indels each, the question should be how we can do this within a reasonable amount of time. Even if we can construct an integral equation system for case (iv), it is expected to contain terms with complex gap configurations, and thus it would be difficult to solve it “exactly.” Therefore, a key for this case should be how we can reasonably quickly obtain approximate multiplication factors each of which estimates the exact factor more accurately than the summation over all parsimonious and nextparsimonious contributions.
The main purpose for the current algorithm to calculate the firstapproximate probability of a given MSA (SM5 in Additional file 1) was to see whether or not the first approximation works also for local MSAs of general types. This algorithm merely constitutes the first step toward an automatic application of our ab initio perturbative formulation. Consequently, the algorithm still has some rooms for improvements. For example, the algorithm could be very slow when applied to a local MSA containing a long gap. Roughly speaking, the length of time consumed by the algorithm applied to an input MSA is the summation of the lengths of the consumed time over all local MSAs in it. We estimated that the algorithm applied to each local MSA has the time complexity roughly greater than \( O\left(B\;{2}^{N_{ShG}}\right) \). (See subsection D1.1 of [48] for details.) Here, B is the number of blocks of distinct gap patterns in the local MSA, and N _{ ShG } is the number of short gaps that are spatially overlapping and phylogenetically neighboring the long gap. For example, if B = N _{ ShG } = 20, the time complexity is greater than O(20 × 2^{20}) ≈ O(10^{7}). B should be roughly on the order of N _{ ShG }. And N _{ ShG } is roughly expected to be around E[N _{ ShG }] ≈ (E[n _{ ID1}] + E[n _{ ID2}]) × ΔL. Here E[n _{ ID1}] and E[n _{ ID2}] are the expected numbers of indels per site along the two neighboring branches of the branch where the indel resulted in the long gap. (These three branches share the node that accommodates the subsequence aligned with the long gap.) And ∆L is the size of the local MSA. For example, if we assume a rather large value, E[n _{ ID1}] = E[n _{ ID2}] = 0.4/8 = 0.05, we have E[N _{ ShG }] ≈ 0.1 × ΔL. In this case, we expect N _{ ShG } ≈ 20 almost always when ΔL = 200. Thus, such local MSAs could virtually stop the current algorithm. ^{12} One way to quickly process a local MSA containing long gaps should be to treat the gaps hierarchically, first long gaps alone and second the remaining short and medium gaps (Additional file 1: Figure S11; see Discussion D1.1 of [48] for more details). If this strategy indeed works, the \( O\left({2}^{N_{ShG}}\right) \) component of the time complexity would reduce to O(2N _{ ShG }), because the short gaps that are phylogenetically neighboring the long gap can be handled independently of one another (panel d of Figure S11). Another very similar strategy should be to narrow down the ancestral sequence states to be searched for, regardless of the presence/absence of gapless columns, by exploiting the “phylogenetic correctness” condition (e.g., [46, 47]). The condition must always be satisfied by the ancestral sequence states consistent with MSAs, and thus it should be very powerful.
Another possible improvement should be to incorporate nonparsimonious indel histories so that we can enhance the accuracy of the probability estimation. As in section R4, we can classify nonparsimonious histories into two broad categories: (A) those each of which shares the set of all ancestral sequence states with a parsimonious history, and (B) those that share the set with no parsimonious history. Each nonparsimonious history in category (A) yields the same ancestordescendant PWAs along all branches as a parsimonious history does (section R7 of [22]). Hence, we could easily incorporate the effects of category (A) histories by using local PWA multiplication factors that take account of nonparsimonious contributions, as we calculated in sections R2 and R3. As the result in section R4 (Fig. 5) indicates, this could considerably improve the accuracy relatively easily. Incorporating histories in category (B) should require an algorithm to systematically enumerate such histories. Some hints may come from the examples in section R4 and SM4 of Supplementary methods in Additional file 1, and the “branchandmerge” operation (SM5.2 of Supplementary methods). The real challenge, though, should be to devise a method to enumerate such histories efficiently. ^{13}
In this paper, we presented the results of computing local alignment probabilities (or multiplication factors) under a spacehomogeneous model implemented in Dawg [32], mainly in order to avoid excessive presentational complications. At least theoretically, however, the computational methods (in Additional file 1) could be extended to spaceheterogeneous situations relatively easily. All we have to do is substitute spaceheterogeneous counterparts for the spacehomogeneous indel rates (and exit rates) in the final formulas in SM2, SM3 and SM4 (in Additional file 1), and replace multiplication by some integer factors (such as (ΔL ^{ A } + 1) in Eq. (SM3.2) and Eq. (SM3.4b) in Additional file 1) with summation over possible positions (of a positionflexible indel event). The time integration can be performed analytically (except in SM3) as long as the model is timehomogeneous. And the computation could be automated as long as the indel rates are specified according to some programmable rules. In some cases, tricks or approximations may be necessary so that the computation (involving the aforementioned summations) can be finished quickly enough. It should be kept in mind, however, that such computation will make sense only if the probabilities of the entire alignments are factorable. This means that the indel rates (and the exit rates) must satisfy the conditions (i) and (ii) (and (iii) for MSAs) explained in section R1, which may bring in some complications. For example, in the most general indel model we currently know to have factorable alignment probabilities (described in subsection R83 of [22]), each locally heterogeneous set of indel rates is confined in a region that does not necessarily coincide perfectly with a gapped segment (i.e., a local alignment). When the region accommodates only one gapped segment (and some gapless columns), there should be no serious problem; although each position between contiguous gapless columns may experience some indels, the effects of such indels should be negligible (as shown in sections R2 and R3), allowing us to focus on the single gapped segment. On the other hand, a serious problem may arise when the region accommodates two or more gapped segments. In this case, the contributions from the gapped segments (overlapping the region) can no longer be factorized, and thus all possible relative orders will have to be considered among indels in different gapped segments overlapping the region (while retaining the order in each segment). This could substantially slow down the computation, especially regarding nonparsimonious indel histories (including practically exact solutions), and some new measures may be necessary for reasonably fast computation. In addition, it should be remembered that, in order to pursue further biological realism, one must also overcome some other hurdles, such as more realistic boundary conditions and mutations other than indels and substitutions (discussed, e.g., in section R9 of [22] and [37]).
R7.2. Risks associated with naive applications to reconstructed alignments
Some readers may consider conducting some evolutionary analyses by applying the algorithm presented here to a MSA reconstructed by one of the stateoftheart aligners (reviewed, e.g., in [7]). We strongly caution the readers that it would be premature to conduct such analyses at this point. What we demonstrated here is that the algorithm estimates the probabilities quite accurately, provided that it is fed a correct MSA. Unfortunately, however, recent analyses (e.g., [38, 52, 53]) showed that reconstructed MSAs are considerably errorprone, even if they were reconstructed via stateoftheart aligners. Thus, a naive application of the algorithm to a reconstructed MSA would likely lead to incorrect predictions. Therefore, the readers should avoid such analyses whenever possible. Even if they need to perform such analyses, the possibility of MSA errors must be fully taken into account when interpreting the results.
R7.3. Possible applications
Originally, we developed our theoretical formulation [22] and the algorithm presented here for the purpose of comparing candidate MSAs in terms of their occurrence probabilities, i.e., their likelihoods. This purpose should be adequately fulfilled considering the accuracy of the predicted probabilities under moderate conditions, as demonstrated in this paper. If the algorithm can be coupled with a sampler that can preferentially explore quite likely regions of the MSA space, we could obtain an approximate probability distribution of MSAs. Such a distribution should be very useful, because a substantial fraction of alignment errors turned out to be due to the stochastic nature of evolutionary processes [38]. In the previous study [22], we showed that the “long indel” model [26] is virtually equivalent to our ab initio formulation under spacehomogeneous indel rates. Hence, their dynamic programming (DP) could be applicable to the problem, possibly with some modifications. Although the full version of their DP is quite slow, a device similar to those used recently (e.g., [21, 41, 54]) might speed up the MSA space exploration. It remains to be seen if such a device could be successfully adapted to our formulation or not. Most of currently available MSA aligners, whether they implement probabilistic or singleoptimumsearch algorithms, are based on geometric distributions. Because biologically realistic indel length distributions were shown to improve the accuracy of pairwise sequence comparisons (e.g., [23, 24]), we expect that this will be the case also with multiple sequence comparisons. (This expectation was partially confirmed in [38].)
Up to here, we assumed that the phylogenetic tree is given. In many cases, however, the phylogenetic trees must also be inferred from the input sequence data. A theoretically ideal way would be to infer the joint distribution of MSAs and phylogenetic trees, as it is expected to minimize possible prediction biases (e.g., [18, 39, 55, 56]). A major problem is that such an analysis would be very timeconsuming in general. In this sense, the traditional method of inferring the phylogenetic tree from an input MSA (e.g., [5]) and the incorporation of indel information into the method (e.g., [57]) would still be useful. When trying to adapt our algorithm or formulation to any of these methods, we will have to further speed up the calculation of approximate alignment probabilities, especially under the moves in the tree space exploration, such as the nearestneighbor interchange (NNI) and the subtree pruning and regrafting (SPR). At present, it is a totally open question whether we can really do this without compromising the accuracy of the predicted probabilities. This should be a challenging, formidable yet crucial problem of phylogeneticlevel molecular evolution.
Conclusions
In the previous study [22], we proposed a theoretical formulation that facilitates the ab initio calculation of the probabilities of given PWAs and MSAs under the general continuoustime Markov model, which describes the evolution of an entire sequence along a time axis via indels. And we explicitly demonstrated that, under a certain set of conditions, each ab initio alignment probability is factorable into the product of an overall factor and multiplication factors originated from local alignments delimited by preserved ancestral sites, thus providing a sort of generalized HMMs (Eq. (2) and Eq. (5)).
In this study (especially in Supplementary methods in Additional file 1), we provided some methods and an algorithm to concretely calculate the total parsimonious contribution and the total nextparsimonious contribution to the multiplication factor, Eq. (3) or Eq. (6), originated from each local alignment, under spacehomogeneous situations for illustration purposes. Our analyses indicated that even the total parsimonious contribution approximates the multiplication factor fairly well as long as (λ _{ I } + λ _{ D })(t _{ F } − t _{ I })ΔL is within an O(1) threshold. Here, (λ _{ I } + λ _{ D })(t _{ F } − t _{ I }) is the expected number of indels per site and ∆L is the local alignment size. Moreover, again under spacehomogeneous situations, we deduced two systems of integral equations that can be numerically solved to give practically exact multiplication factors for local PWAs of cases (i), (ii) and (iii). An inspection of the practically exact factors indicated that the finitetime transition probabilities in these local PWAs keep following the powerlaw, and that the exponent only slightly deviates from the original exponent for the instantaneous indel length distribution. Equipped with these results and new methods, the theoretical formulation we proposed in [22] is expected to provide other indel probabilistic models with a sound reference point, which could suggest necessary modifications to improve the accuracy of the models (as exemplified in section R6).
However, considering that the commonly used aligners are considerably errorprone (as shown e.g., in [38, 52, 53]), it would be very risky to naively apply the presented algorithm or methods to reconstructed MSAs. Thus, it should be preferable to first develop programs that exploit the fruits of the previous study [22] and this study to accurately estimate the uncertainties in, and rectify the errors of, reconstructed alignments under a genuine stochastic model of sequence evolution via indels that is biologically more realistic than almost all models studied in the past.
Methods
M1. Parameter settings for numerical analyses
We performed all numerical analyses in this paper using the spacehomogeneous indel model implemented in Dawg [32] (see Eqs.(79)). Unless otherwise stated, the total insertion rate was set equal to the total deletion rate (that is, λ _{ I } = λ _{ D }), according to a genomewide data analysis [41]. We used the powerlaw indel length distribution for both insertions and deletions: \( {f}_I(l)\kern0.5em =\kern0.5em {f}_D(l)\kern0.5em =\kern0.5em {f}_{1.6}^{PL}\left(l;\kern0.5em {L}^{CO}\right)\kern0.5em \equiv \kern0.5em {l}^{1.6}/\left({\displaystyle {\sum}_{k=1}^{L^{CO}}{k}^{1.6}}\right) \). The powerlaw exponent of 1.6 is among the typical values observed empirically (e.g., [24] and references therein). The cutoff indel length, L ^{ CO }, was set at 500 sites for the perturbation analyses to assess the goodness of the first approximation (Tables 1 and 2, Figs. 2, 3 and 5; Additional file 1: Tables S1, S2 and Figure S4), whose results were almost independent of L ^{ CO }. It was set at 5000 sites when assessing the goodness of approximation by the HMM of Kim and Sinha [36] (Table 3), because the result stabilized around this value of L ^{ CO }. In the perturbation analyses on local MSAs (e.g., Table 2 and Fig. 5), we used a 3OTU tree with equally long branches (Fig. 4a). The tree was rooted at its sole internal node. For the iterative perturbation analysis (Figs. 2 and 3; Additional file 1: Tables S1 and S2), the subtimeinterval for the numerical time integration was set at E[n _{ ID }]/N _{ P } = 0.001 (indels per site). ^{14} For local MSA analyses (both perturbative and simulationbased), the uniform sequence length distribution was employed as the prior probability of the root sequence state.
M2. Simulations to prepare input MSA sets
To validate the entire algorithm described in section SM5 of Supplementary methods in Additional file 1, we prepared five sets of MSAs using a genuine sequence evolution simulator, Dawg [32]. We performed all simulations using the same Zipf powerlaw distribution, \( {f}_D\;(l)\kern0.5em =\kern0.5em {f}_I\;(l)\kern0.5em =\kern1em {l}^{\gamma }/\left({\displaystyle {\sum}_{k=1}^{L^{CO}}{k}^{\gamma }}\right) \), with the exponent γ = 1.6 and the cutoff indel length of L ^{ CO }(=L _{ I } ^{ CO } = L _{ D } ^{ CO } ) = 100 bases. The exponent γ = 1.6 is typical among empirically observed values (e.g., [24] and references therein). The cutoff length was chosen in order to prevent it from taking extremely long to search for parsimonious local indel histories. This is because, with our current implementation, the search could be very timeconsuming when a gapped segment contains at least one long gap (see subsection R7.1 of Results and discussion for a possible solution). Each of the simulations started with a random ancestral DNA sequence that is l000 bases long. In each simulation, we labeled all the internal nodes of the input tree, in order to keep the ancestral sequences aligned with the “extant” sequences (at the external nodes). Other parameters and options were set at default values unless otherwise stated. We created five input MSA sets, namely, 1A, 1B, 3P, 3M and 3F.
Set 1A consists of 100,000 MSAs, each of which was simulated along a 3taxon tree starting at a root with three child branches. The lengths of the three branches were all set at 0.05 (substitutions per base). The total rates of insertions and deletions were set at λ _{ I } = λ _{ D } = 0.1 (per expected substitution), which are close to the upperbounds for neutrally evolving mammalian DNA sequences [24, 35].
Set 1B is nearly the same as Set 1A, expect that all branch lengths were set at 0.2 (substitutions per base).
We prepared 1A and 1B, because validating the theoretically predicted occurrence probabilities of local homology structures necessitated a large number of MSAs simulated under identical parameter settings.
Each of Sets 3P, 3M and 3F consists of 10,000 MSAs. The settings for these three sets differ only in the phylogenetic tree used for the simulations. The MSAs in these sets were simulated along the tree of 12 primates (panel a of Additional file 1: Figure S9), the tree of 15 mammals (panel b) and the tree of 9 fastevolving mammals (panel c), respectively. These three sets, 3P, 3M and 3F, were intended to mimic typically encountered MSAs among selectively neutral DNA sequences with small, moderate and large sequence divergences, respectively. The total indel rates for these three sets were set at λ _{ I } = λ _{ D } = 1/16 = 0.0625 (per expected substitution), according to genomewide data analyses [35, 41]. For more details on these three sets, see [38].
The Dawg control files used to generate these simulated datasets, including the phylogenetic trees and indel model parameters, are available as a part of Additional file 2.
Before the analyses, all simulated MSAs were preprocessed so that the MSAs with an identical homology structure will be replaced with a unique representative MSA. See Methods of [38] for details.
M3. Program implementation
The Perl modules and main Perl scripts used in this study are available (under the GNU General Public License) as a package named “LOLIPOG” (for “LOgLIkelihood for the Pattern Of Gaps (in MSA)”) (version: “FA_LOLIPOG_P.ver0.6.1.6”), which is archived in Additional file 2. The latest version of the package will be available in the “lolipog” directory at the FTP repository of http://Bioinformatics.Org [58].
An ancestry index (or an “ancestry” for short) is assigned to each site, in order to distinguish between the sites of different evolutionary origins. Once an ancestry is assigned to a site, it will not change throughout the evolutionary history. Sharing of the same ancestry between the sites of different sequences indicates that the sites are mutually homologous (more precisely, orthologous). See section R2 of [22] for more details. The ancestry indices also help realize the rate heterogeneity across regions (section R3 of [22]).
In this study, the homology structures we consider will usually be among extant sequences, i.e., excluding ancestral sequences. When dealing with each particular indel history, however, ancestral sequences will also be included.
This “perturbative” calculation of a multiplication factor may look similar to the calculation of a “chopzone” probability proposed by Miklós et al. [26] under their “long indel” model. In fact, our “perturbative” calculation could be regarded as an (modified) extension of their calculation method to somewhat more general evolutionary models. (See section SA3 of Additional file 2 of [22] for their mutual equivalence with caveats.)
For example, the results under the “long indel” model [26] are essentially the same as those obtained here, if the former is freed from the timereversibility.
It should be noted that this setting is just a representative of seemingly common situations. For example, the empirically estimated powerlaw exponents vary between near 1 and near 2, although the values within 1.41.8 are quite common ([24] and references therein), regardless of whether the sequences are genomic or proteincoding. In the past, the deletion rate was often observed as greater than the insertion rate (e.g., [59]). But a simulation study [52] indicated that such observations are likely due to a bias intrinsic to the similaritybased aligners (as classified in [60]). And a recent analysis of mammalian sequences via an evolutionbased probabilistic aligner (expected to be devoid of the above bias) indicated that the deletion rate is nearly equal to the insertion rate [41]. It remains to be seen whether this rate equality applies also to the taxa other than mammals. Regarding the sequence length distribution, we do not know any empirical results. Although we believe that our choice of the uniform distribution should be theoretically very reasonable for neutrally evolving regions sampled randomly from long genomes/chromosomes (see endnote 14 of [22]), distributions of regions under selection may show quite different behaviors. It should be interesting, and important, to examine how the results in this paper will change in response to the deviation of the setting from that given here.
These cases correspond to the “chopzones” having probabilities N _{ ij } [26], with: (i) i = j = 0; (ii) i > 0, j = 0; (iii) i = 0, j > 0; and (iv) i > 0, j > 0.
Because of the symmetry between the finitetime transition probabilities under the time reversal, the latter results become equal to the former when the insertion and deletion parameters are swapped.
On top of them, we also prepared yet another set, Set 2, created by simulations under 33 sets of parameters typical of the structurebased benchmark MSAs. (See section M2.2 of Methods and Figure 28, both of [48], for details on this input MSA set.) The validation analyses on this Set 2 gave nearly as good results as those on the 5 sets described here. (See Results of [48] for details.)
Incidentally, for the analysis on relative probabilities, we only used local MSAs each of which can result from two or more parsimonious local indel histories. The instances of such gapped segments accounted for 4.5 %, 12.0 %, 17.5 %, 19.2 % and 14.2 % of all “parsimonious” instances in sets 1A, 1B, 3P, 3 M and 3 F, respectively. Out of them, the most likely (ML) histories were wrong in 42.3 %, 43.4 %, 36.1 %, 13.1 % and 38.4 % of the instances in the respective sets. Therefore, even if we make an unlikely assumption that the aforementioned “nonparsimonious” instances were all due to the ML histories that are nonparsimonious, any algorithm that searches for a single ML history would have overlooked the true indel history in 1.9 %, 5.2 %, 6.3 %, 2.5 % and 5.4 % of the cases in the respective sets. These frequencies are much larger than those of the “nonparsimonious” instances, i.e., 0.15 %, 1.4 %, 0.12 %, 0.23 % and 0.49 %, respectively. This indicates that, given correct MSAs and correct trees, our algorithm can recover the true indel histories more frequently than any algorithm to search for a single ML history.
Mixed geometric distributions could decently approximate a powerlaw distribution in wider ranges. For example, the approximation by a mixed distribution with two geometric components (used, e.g., in [23]) is fairly good up to a few dozen residues, but it gets poor when indels become as long as hundreds of residues. To decently approximate the realistic distributions of such long indels, more geometric components would be necessary.
In SM8 of Supplementary methods of [38], we roughly estimated the expected mean length \( \left({\overline{L}}_{loc}(T)\right) \) of a correct (i.e., not reconstructed) local MSA as: \( {\overline{L}}_{loc}(T)\kern0.5em \approx \kern0.5em \left[ exp\left(\leftT\right\kern0.5em {\lambda}_D\kern0.5em {\overline{l}}_D\right)\kern0.5em +\kern0.5em exp\left(\leftT\right\kern0.5em {\lambda}_I\kern0.5em {\overline{l}}_I\right)\kern0.5em \kern0.5em 2\right]/\;\left[1 exp\left(\leftT\right\kern0.2em \left({\lambda}_D\kern0.5em +\kern0.5em {\lambda}_I\right)\right)\right] \). Here, T is the total branch length across the tree (T) (in units of the expected number of substitutions per site). λ _{ D } and λ _{ I }, respectively, denote the deletion rate and the insertion rate, both per expected substitution. And \( {\overline{l}}_D \) and \( {\overline{l}}_I \) are the average lengths of a deleted subsequence and an inserted subsequence, respectively. Under the parameter setting for the simulation analyses of this study, we have \( {\overline{l}}_D\kern0.5em =\kern0.5em {\overline{l}}_I\kern0.5em =\kern0.5em 6.35 \). If we set λ _{ D } = λ _{ I } = 1/16 = 0.0625 according to [35, 41], then, \( {\overline{L}}_{loc}(T) \) exceeds 100 when T is greater than 9, which is quite large. Thus, as long as we are dealing with correct MSAs, our firstapproximate algorithm is expected to work on a majority of local MSAs until T becomes this large. It should be noted, however, that reconstructed MSAs could be seriously erroneous even if T is, e.g., less than 3 [38].
Strictly speaking, the current “localmultipath downhill search” algorithm is not perfect, in the sense that it misses some deletiondominated parsimonious local histories that can yield gap configurations belonging to the class of “intersection between cousins” described in [61]. Fortunately, this drawback is not expected to be so serious, because such local histories require at least four indels each and thus should be very rare.
We calculated the multiplication factors, \( {\tilde{\mu}}_P^{\left\langle {N}_{ID}=200\right\rangle}\left[ case\kern0.5em (iii);\kern0.5em \varDelta {L}^D;\kern0.5em \left[{t}_I,\kern0.5em t\right]\right] \) with t ∈ [t _{ I }, t _{ F }], using E[n _{ ID }]/N _{ P } = 0.001, and compared them to those calculated using E[n _{ ID }]/N _{ P } = 0.0005. In each pair of values of ∆L ^{ D } and t, the difference was within 0.2 % of the multiplication factor itself. Thus we concluded that they are virtually exact.
Declarations
Acknowledgements
This study is dedicated to the late Dr. Keiji Kikkawa, a key pioneer of string theory of elementary particle physics and the author’s best mentor. The author (KE) would like to greatly thank Dr. Dan Graur at the University of Houston and Dr. Giddy Landan at HeinrichHeine University Düsseldorf for their indispensable assistance with this study; this study and the previous study [22] were originally born out of the need to score alternative alignments in their project (described in Funding), and they coauthored with KE some unpublished predecessors of these manusctipts (including [43, 48, 49]). The author is grateful to Dr. R. A. Cartwright at Arizona State University for his useful information and discussions that inspired this study. The author appreciates the logistic support and the feedback of Dr. Tetsushi Yada at Kyushu Institute of Technology, as well as the encouragement and the feedback of Dr. Toshiaki Takitani, a professor emeritus at Nagoya University. The author would also like to thank Dr. Ian Holmes at the University of California, Berkeley, for his critical comments on the previous version of the manuscript and his information on some relevant previous studies, both of which helped substantially improve the manuscript. The author is also grateful to all of the reviewers and handling editors of this manuscript and its predecessors, as their feedback definitely helped improve the manuscript.
Funding
This work was a part of the project, “Error Correction in Multiple Sequence Alignments,” which was funded by US National Library of Medicine (grant number: LM01000901 to Dan Graur and Giddy Landan, then at the University of Houston). The later stage of this work was supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan (grant numbers: MEXT KAKENHI Grant Numbers 221S0002, 15H01358, both to Tetsushi Yada).
Availability of data and materials
The data sets supporting the conclusions of this article are either included within the article and its additional files, or available in [43] (bioRxiv DOI: http://dx.doi.org/10.1101/023606) or [48] (bioRxiv DOI: http://dx.doi.org/10.1101/023614). (Numerical and simulation analyses described in this article should be reproducible by using the programs provided in Additional file 2, following the instructions contained therein).
Author’s contributions
Not applicable because the paper was written by a single author (KE).
Author’s information
The author (KE) used to be a mathematical physicist, who studied theoretical elementary particle physics and quantum gravitational theories from 1991 till 1999. Then, since 2002, after having studied theoretical biophysics from 1999 till 2002, he has studied molecular evolution (including population genetics), mainly focusing on homologybased computational analyses of DNA and protein sequences. For more detailed information, including the list of his publications, refer to his ORCID record [62].
Competing interests
The author declares that he has no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Graur D, Li WH. Fundamentals of molecular evolution. 2nd ed. Sunderland: Sinauer Associates; 2000.Google Scholar
 Gascuel O, editor. Mathematics of evolution and phylogeny. New York: Oxford University Press; 2005.Google Scholar
 Lynch M. The origins of genome architecture. Sunderland: Sinauer Associates; 2007.Google Scholar
 Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–76.View ArticlePubMedGoogle Scholar
 Felsenstein J. Inferring phylogenies. Sunderland: Sinauer Associates; 2004.Google Scholar
 Yang Z. Computational molecular evolution. New York: Oxford University Press; 2006.View ArticleGoogle Scholar
 Notredame C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol. 2007;3:e123.View ArticlePubMedPubMed CentralGoogle Scholar
 Britten RJ. Divergence between samples of chimpanzee and human DNA sequences is 5 %, counting indels. Proc Natl Acad Sci U S A. 2002;99:13633–5.View ArticlePubMedPubMed CentralGoogle Scholar
 Britten RJ, Rowen L, Willians J, Cameron RA. Majority of divergence between closely related DNA samples is due to indels. Proc Natl Acad Sci U S A. 2003;100:4661–5.View ArticlePubMedPubMed CentralGoogle Scholar
 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003;100:11484–9.View ArticlePubMedPubMed CentralGoogle Scholar
 The International Chimpanzee Chromosome 22 Consotrium. DNA sequence and comparative analysis of chimpanzee chromosome 22. Nature. 2004;429:382–8.View ArticleGoogle Scholar
 The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87.View ArticleGoogle Scholar
 Bishop MJ, Thompson EA. Maximum likelihood alignment of DNA sequences. J Mol Biol. 1986;190:159–65.View ArticlePubMedGoogle Scholar
 Thorne JL, Kishino H, Felsenstein J. An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol. 1991;33:114–24.View ArticlePubMedGoogle Scholar
 Rivas E. Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinformatics. 2005;6:63.View ArticlePubMedPubMed CentralGoogle Scholar
 Bradley RK, Holmes I. Transducers: an emerging probabilistic framework for modeling indels on trees. Bioinformatics. 2007;23:3258–62.View ArticlePubMedGoogle Scholar
 Miklós I, Novák Á, Satija R, Lyngsø R, Hein J. Stochastic models of sequence evolution including insertiondeletion events. Stat Methods Med Res. 2009;18:453–85.View ArticlePubMedGoogle Scholar
 Holmes I, Bruno WJ. Evolutionary HMMs: a Bayesian approach to multiple sequence alignment. Bioinformatics. 2001;17:803–20.View ArticlePubMedGoogle Scholar
 Holmes I. Using guide trees to construct multiplesequence evolutionary HMMs. Bioinformatics. 2003;19:i147–57.View ArticlePubMedGoogle Scholar
 BouchardCôté A. A note on probabilistic models over strings: the linear algebra approach. Bull Math Biol. 2013;75:2529–50.View ArticlePubMedGoogle Scholar
 Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J. Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics. 2015;16:108.View ArticlePubMedPubMed CentralGoogle Scholar
 Ezawa K. General continuoustime Markov model of sequence evolution via insertions/deletions: are alignment probabilities factorable? BMC Bioinformatics. 2016;17:304.Google Scholar
 Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 2008;18:298–309.View ArticlePubMedPubMed CentralGoogle Scholar
 Cartwright RA. Problems and solutions for estimating indel rates and length distribution. Mol Biol Evol. 2009;26:473–80.View ArticlePubMedGoogle Scholar
 Rivas E, Eddy SR. Parameterizing sequence alignment with an explicit evolutionary model. BMC Bioinformatics. 2015;16:406.View ArticlePubMedPubMed CentralGoogle Scholar
 Miklós I, Lunter GA, Holmes I. A “long indel” model for evolutionary sequence alignment. Mol Biol Evol. 2004;21:529–40.View ArticlePubMedGoogle Scholar
 Dirac PAM. The principles of quantum mechanics. 4th ed. London: Oxford University Press; 1958.Google Scholar
 Messiah A. Quantum Mechanics, Volume 1. (Translated from French to English by Temmer GM). Amsterdam: NorthHolland; 1961.Google Scholar
 Messiah A. Quantum Mechanics, Volume II. (Translated from French to English by Potter J). Amsterdam: NorthHolland; 1961.Google Scholar
 Feller W. On the integrodifferential equations of purely discontinuous markov processes. T Am Math Soc. 1940;48:488–515.View ArticleGoogle Scholar
 Gillespie DT. Exact stochastic simulation of coupled chemical reactions. J Phys Chem. 1977;81:2340–61.View ArticleGoogle Scholar
 Cartwright RA. DNA assembly with gap (Dawg): simulating sequence evolution. Bioinformatics. 2005;21:iii31–8.View ArticlePubMedGoogle Scholar
 Fletcher W, Yang Z. INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol. 2009;26:1879–88.View ArticlePubMedPubMed CentralGoogle Scholar
 Strope CL, Abel K, Scott SD, Moriyama EN. Biological sequence simulation for testing complex evolutionary hypothesis: indelSeqGen version 2.0. Mol Biol Evol. 2009;26:2581–93.View ArticlePubMedPubMed CentralGoogle Scholar
 Lunter G. Probabilistic wholegenome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007;23:i289–96.View ArticlePubMedGoogle Scholar
 Kim J, Sinha S. Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment. Bioinformatics. 2007;23:289–97.View ArticlePubMedGoogle Scholar
 Ezawa K, Graur D, Landan G. Perturbative formulation of general continuoustime Markov model of sequence evolution via insertions/deletions, Part IV: incorporation of substitutions and other mutations. bioRxiv. 2015. doi:10.1101/023622. Accessed 4 Aug 2015.
 Ezawa K. Characterization of multiple sequence alignment errors using completelikelihood score and positionshift map. BMC Bioinformatics. 2016;17:133.View ArticlePubMedPubMed CentralGoogle Scholar
 Lunter GA, Miklós I, Drummond A, Jensen JL, Hein J. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics. 2005;6:83.View ArticlePubMedPubMed CentralGoogle Scholar
 Redelings BD, Suchard MA. Joint Bayesian estimation of alignment and phylogeny. Syst Biol. 2005;54:401–18.View ArticlePubMedGoogle Scholar
 Westesson O, Lunter G, Paten B, Holmes I. Accurate reconstruction of insertiondeletion histories by statistical phylogenetics. PLoS One. 2012;7:e34572.View ArticlePubMedPubMed CentralGoogle Scholar
 Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20:110–21.View ArticlePubMedPubMed CentralGoogle Scholar
 Ezawa K, Graur D, Landan G. Perturbative formulation of general continuoustime Markov model of sequence evolution via insertions/deletions, Part II: perturbation analyses. bioRxiv. 2015. doi: 10.1101/023606. Accessed 4 Aug 2015.
 Morgante M, De Paoli E, Radovic S. Transposable elements and the plant pangenomics. Curr Opin Plant Biol. 2007;10:149–55.View ArticlePubMedGoogle Scholar
 Chalopin D, Naville M, Plard F, Galiana D, Volff JN. Comparative analysis of transposable elements highlights mobilome diversity and evolution in vertebrates. Genome Biol Evol. 2015;7:567–80.View ArticlePubMedPubMed CentralGoogle Scholar
 Chindelevitch L, Li Z, Blais E, Blanchette M. On the inference of parsimonious evolutionary scenarios. J Bioinform Comput Biol. 2006;4:721–44.View ArticlePubMedGoogle Scholar
 Diallo AB, Makarenkov V, Blanchette M. Exact and heuristic algorithms for the indel maximum likelihood problem. J Comput Biol. 2007;14:446–61.View ArticlePubMedGoogle Scholar
 Ezawa K, Graur D, Landan G. Perturbative formulation of general continuoustime Markov model of sequence evolution via insertions/deletions, Part III: algorithm for first approximation. bioRxiv. 2015. doi:10.1101/023614. Accessed 4 Aug 2015.
 Ezawa K, Graur D, Landan G. Perturbative formulation of general continuoustime Markov model of sequence evolution via insertions/deletions, Part I: theoretical basis. bioRxiv. 2015. doi: 10.1101/023598. Accessed 4 Feb 2016.
 Knudsen B, Miyamoto MM. Sequence alignments and pair hidden Markov models using evolutionary history. J Mol Biol. 2003;333:453–60.View ArticlePubMedGoogle Scholar
 Metzler D. Statistical alignment based on fragment insertion and deletion models. Bioinformatics. 2003;19:490–9.View ArticlePubMedGoogle Scholar
 Löytynoja A, Goldman N. Phylogenyaware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008;320:1632–5.View ArticlePubMedGoogle Scholar
 Landan G, Graur D. Characterization of pairwise and multiple sequence alignment errors. Gene. 2009;441:141–7.View ArticlePubMedGoogle Scholar
 Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, Birney E. Genomewide nucleotidelevel mammalian ancestor reconstruction. Genome Res. 2008;18:1829–43.View ArticlePubMedPubMed CentralGoogle Scholar
 Suchard MA, Redelings BD. BAliPhy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006;22:2047–8.View ArticlePubMedGoogle Scholar
 Novák Á, Miklós I, Lyngsø R, Hein J. StatAlign: an extendable software package for join Bayesian estimation of alignments and evolutionary trees. Bioinformatics. 2008;24:2403–4.View ArticlePubMedGoogle Scholar
 Rivas E, Eddy SR. Probabilistic phylogenetic inference with insertions and deletions. PLoS Comput Biol. 2008;4:e1000172.View ArticlePubMedPubMed CentralGoogle Scholar
 Ezawa K. LOLIPOG: Loglikelihood for the pattern of gaps in MSA. 2013. http://www.bioinformatics.org/ftp/pub/lolipog/. Accessed 31 Jul 2016.Google Scholar
 Fang Y, Wang W, Ma G, Liang L, Shi Q, Tao S. Patterns of insertion and deletion in mammalian genomes. Current Genomics. 2007;8:370–8.View ArticleGoogle Scholar
 Blackburne BP, Whelan S. Class of multiple sequence alignment algorithm affects genomic analysis. Mol Biol Evol. 2013;30:642–53.View ArticlePubMedGoogle Scholar
 Fredslund J, Hein J, Scharling T. A large version of the small parsimony problem. In: Benson G, Page R, editors. WABI 2003, LNBI 2812. Heidelberg: Springer; 2003. p. 417–32.Google Scholar
 The ORCID register of Kiyoshi Ezawa. http://orcid.org/0000000349068578. Accessed 19 May 2016.