General continuous-time Markov model of sequence evolution via insertions/deletions: local alignment probability computation

Background Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a method to reliably calculate the occurrence probabilities of sequence alignments via evolutionary processes on an entire sequence. Previously, we presented a perturbative formulation that facilitates the ab initio calculation of alignment probabilities under a continuous-time Markov model, which describes the stochastic evolution of an entire sequence via indels with quite general rate parameters. And we demonstrated that, under some conditions, the ab initio probability of an alignment can be factorized into the product of an overall factor and contributions from regions (or local alignments) delimited by gapless columns. Results Here, using our formulation, we attempt to approximately calculate the probabilities of local alignments under space-homogeneous cases. First, for each of all types of local pairwise alignments (PWAs) and some typical types of local multiple sequence alignments (MSAs), we numerically computed the total contribution from all parsimonious indel histories and that from all next-parsimonious histories, and compared them. Second, for some common types of local PWAs, we derived two integral equation systems that can be numerically solved to give practically exact solutions. We compared the total parsimonious contribution with the practically exact solution for each such local PWA. Third, we developed an algorithm that calculates the first-approximate MSA probability by multiplying total parsimonious contributions from all local MSAs. Then we compared the first-approximate probability of each local MSA with its absolute frequency in the MSAs created via a genuine sequence evolution simulator, Dawg. In all these analyses, the total parsimonious contributions approximated the multiplication factors fairly well, as long as gap sizes and branch lengths are at most moderate. Examination of the accuracy of another indel probabilistic model in the light of our formulation indicated some modifications necessary for the model’s accuracy improvement. Conclusions At least under moderate conditions, the approximate methods can quite accurately calculate ab initio alignment probabilities under biologically more realistic models than before. Thus, our formulation will provide other indel probabilistic models with a sound reference point. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1167-6) contains supplementary material, which is available to authorized users.


Background
The evolution of DNA, RNA, and protein sequences is driven by mutations such as base substitutions, insertions and deletions (indels), recombination, and other genomic rearrangements (e.g., [1][2][3]). Thus far, analyses on substitutions have predominated in the field of molecular evolutionary study, in particular using the probabilistic (or likelihood) theory of substitutions that is now widely accepted (e.g., [4][5][6]). This is probably because evolutionary models describing residue substitutions are relatively easier to handle. However, it must be remembered that the study of indels is at least as crucial as the study of substitutions. There are two major reasons for this. First, it is indels but not substitutions that yield the skeletons (or the gap configurations) of the sequence alignments (reviewed, e.g., in [7]), which provide essential inputs to most homology-based analyses in computational biology. And second, some recent comparative genomic analyses have revealed that indels account for more base differences between the genomes of closely related species than substitutions (e.g., [8][9][10][11][12]). These circumstances make it imperative to develop a stochastic model that enables us to reliably calculate the probability of sequence evolution via mutations including insertions and deletions. Since the groundbreaking works by Bishop and Thompson [13] and by Thorne, Kishino and Felsenstein [14], many studies have been done to develop and apply methods to calculate the probabilities of pairwise alignments (PWAs) and multiple sequence alignments (MSAs) under the probabilistic models aiming to incorporate the effects of indels. Such methods have greatly improved in terms of the computational efficiency and the scope of application. See excellent reviews for details on this topic (e.g., [15][16][17]). A majority of these studies are based on hidden Markov models (HMMs) (e.g., [18]) or transducer theories (e.g., [19]). Both of them calculate the indel component of an alignment probability as a product of inter-column transition probabilities or of block-wise contributions. And the study on these methods is still advancing, strengthening their mathematical and algorithmic bases (e.g., [20,21]). Unfortunately, most of these methods have at least either of two fundamental problems, one regarding the evolutionary consistency and the other regarding the biological realism. (For details, see Background of [22].) Regarding the evolutionary consistency, it is a priori unclear whether or how a HMM or a transducer is related with any genuine stochastic evolutionary model (or "evolutionary model" for short), which describes the evolution of an entire sequence via indels along a time axis. Regarding the biological realism, the standard HMMs or transducers can at best handle mixed geometric distributions of indel lengths (e.g., [23]) (and usually implement simple geometric distributions), whereas many empirical studies showed that the real indel lengths are distributed according to power-laws (e.g., [24] and references therein). Besides, very few studies thus far (e.g., [25]) addressed the issue of indel rate variation across regions.
In a previous study [22], we presented a theoretical formulation that facilitates the ab initio calculation of alignment probabilities under a genuine stochastic evolutionary model, specifically, a general continuous-time Markov model of sequence evolution via indels. Our evolutionary model was created as a result of incorporating the idea of position-specific evolutionary rates [25] into the most general "substitution/insertion/deletion model" [26]. Thus, the model is naturally devoid of the aforementioned two problems. Aided by some techniques of time-dependent perturbation theory in quantum mechanics [27][28][29], we formally expanded the ab initio probability of an alignment into a series of terms with different numbers of indels. This expansion gave an intuitively clearer representation of Feller's theorems [30]. And it theoretically underpinned the stochastic evolutionary simulation method of Gillespie [31], which provides the foundation for genuine sequence evolution simulators (e.g., [32][33][34]). And we also showed that, if the indel model parameters satisfy a certain set of conditions, the ab initio probability of an alignment is indeed factorable into the product of an overall factor and contributions from local alignments delimited by preserved ancestral sites (PASs), i.e., gapless columns. This suggested that the evolutionary models satisfying such conditions could provide a sort of generalized HMMs, which extend the spaceand time-homogeneous "long indel" model [26] to some space-and time-heterogeneous situations.
In this paper, we focus on how to concretely calculate the contribution from each local alignment, assuming that the indel model satisfies the conditions for factorable alignment probabilities. (To clearly illustrate the concrete computations, we deal with space-homogeneous models in the bulk of this manuscript (i.e., in sections R2-R6), and briefly discuss extensions to more general cases near the end (i.e., in subsection R7.1).) As noted in [26] and section R1 of Results and discussion of this manuscript, the contribution from each local alignment is a summation over an infinite number of local indel histories. Thus it cannot be computed literally exactly within a finite amount of time. This makes it necessary to devise some approximation methods, each of which sums contributions from a finite number of indel histories (as first proposed in [26]). An auspice is that indel rates (say, λ ID indels per site per unit time) are known to be at most around 1/10 of the substitution rates (say, λ S substitutions per site per unit time) (e.g., [24,35]). And the probability of an indel history involving N ID indels is roughly O λ ID t ð Þ N ID times the probability of a history with no indel, where t is a time scale characteristic of the system under consideration. In conjunction, these suggest that taking account only of histories with minimum and near-minimum required numbers of indels may provide a good approximation to each local alignment probability, as long as the sequence divergences (e.g., λ S t) are within the scope of phylogenetic analyses (i.e., at most O(1) substitutions per site).
In section R1 of Results and discussion, we briefly review the relevant portion of the theoretical basis that was established in our previous study [22]. We introduce simplified notation so that we can focus on a single local alignment. In sections R2-R4, we demonstrate how our perturbative formulation can be concretely used to approximately calculate the contributions to the ab initio alignment probabilities from local alignments, i.e., alignment regions separated by gapless columns. We examine all types of local pairwise alignments (PWAs) in section R2, and some typical types of local multiple sequence alignments (MSAs) in section R4. For each local alignment type, we calculate the total parsimonious contribution and the total next-parsimonious contribution to its probability (more precisely, its multiplication factor). In section R3, we discuss two systems of integral equations that can be numerically solved to give practically exact solutions (or "exact" solutions, for short) for some common types of local PWAs. There, we also study the behaviors of the "exact" solutions. Then, by comparing the total parsimonious contribution with the total nextparsimonious contribution, or with the "exact" solution, we investigate the parameter regions in which the total parsimonious contribution can approximate the alignment probability quite accurately (in sections R2-R4). In section R5, we perform simulation analyses with a genuine evolutionary simulator, Dawg [32], to examine whether or not the conclusions from sections R2-R4 also apply to local MSAs of more general types. For this purpose, we developed an algorithm to calculate the "first-approximate" probability of a given MSA under a given parameter setting (including a given tree) by multiplying the overall factor and the total parsimonious contributions from all local MSAs. And we examine the accuracy of the firstapproximate multiplication factors calculated by the algorithm. In section R6, we use our ab initio formulation as a "yardstick" to measure the accuracy of other indel probabilistic models. As a representative model, we chose the generalized HMM of [36], which aims for the biological realism but not fully for the evolutionary consistency. In section R7, we discuss some outstanding issues and possible improvements, extensions and applications of the presented algorithm and methods. The topics include the risks associated with the naïve application of our algorithm or methods to reconstructed alignments. The sections in Methods describe the settings for numerical analyses (M1) and simulation analyses (M2). And the sections in Supplementary methods in Additional file 1 explain methodological details on concrete perturbation calculations and the first-approximate algorithm.
This paper basically uses the same conventions and notations as used in [22]. Briefly, a sequence state s (∈ S II ) is represented as an array of sites, each of which is equipped with an ancestry index (υ x ∈ ϒ). 1 (In this study, we focus on indels. Hence, we do not consider the residue states of sequences. For the incorporation of residue states and substitutions, see, e.g., [37,38].) And each indel event is represented as an operator acting on the bra-vector, 〈s|, representing a sequence state. More specifically, the oper-atorM I x; l ð Þ denotes the insertion of l sites between the x-th and (x + 1) th sites, and the operatorM D x B ; x E ð Þdenotes the deletion of a sub-array between (and including) the x B -th and the x E -th sites. Readers unfamiliar with the bra-ket notation (as adapted from theoretical physics (e.g., [27,28])) can simply regard a bra-vector (〈s|), a ket-vector (|s 0 〉) and an operatorM À Á as convenient reminders of a row vector, a column vector and a matrix, respectively, just as in the standard representation of a continuous-time Markov model. (See section SA-1 in Additional file 2 of [22] for the equivalence between them.) And, also as in [22], the following terminology is used. The term "an indel process" means a series of successive indel events with both the order and the timing specified. And the term "an indel history" means a series of successive indel events with only the order specified.
As a last note, an "alignment," a "PWA" and a "MSA" in this paper will mean their homology structures [39]. Briefly, the homology structure of an alignment is a set of alignment columns (i.e., sets of homologous sites in the aligned sequences) that are spatially arranged in a looser way than in a usual alignment, i.e., constrained only by the spatial relationships between the sites within each aligned sequence. 2

Results and discussion
[Descriptions given here are somewhat sketchy. For methodological details, as well as the relationship with the results of [22], see the relevant sections of Supplementary methods in Additional file 1.]

R1. Perturbation expansion of multiplication factor for local alignment
In this section, we briefly explain some results in [22] that are essential for this paper. Similarly to that of the probabilistic alignment methods in general, one of the main goals of our theoretical formulation (presented in [22]) is to calculate the absolute occurrence probabilities of the alignments and to compare the calculated alignment probabilities. Therefore, unless stated otherwise, the probabilities considered in this paper are not conditioned on a particular alignment (or even on extant sequences). (Once the absolute probabilities are calculated, such conditional probabilities (e.g., of indel histories) could be obtained by dividing the absolute probabilities of the outcomes (e.g., the indel histories) by the absolute probability of the condition (e.g., the resulting alignment), similarly to Eq. (SM-5.3.6a) in Additional file 1.) be the probability that a PWA (α(s A , s D )) between an ancestral sequence state (s A ) and a descendant (s D ) result from the evolution of a sequence during a time interval ([t I , t F ]), given s A at t I . In [22], we formally showed that is given as a series: Here, N min [α(s A , s D )] is the minimum number of indels required to create α(s A , s D ). And contributed from all N-event indel histories that can yield α(s A , s D ). A "preserved ancestral site" (PAS) is a site of s A that was hit by no indel and thus was preserved all through [t I , t F ]. Now, using some (but not necessarily all) PASs, we partition α(s A , s D ) into "local regions" (i.e., inter-PAS regions), γ 1 ; γ 2 ; …; γ κ max , in which all potentially causative indels are confined. In [22], we derived the two conditions. Condition (i): Each indel rate parameter is independent of the portion of the sequence state outside of the local region where the indel occurred. Condition (ii): The increment of the exit rate due to each indel event is independent of the portion of the sequence state outside of the local region where the indel occurred. (The "exit rate" of a state is the rate at which the system "exits" the state, that is, the total rate at which the state changes to any of other states.) Under these conditions, the PWA probability, Eq. (1), can be factorized as: Here P[([ ], [t I , t F ]) | (s A , t I )] is the probability that the sequence underwent no indels during [t I , t F ], given s A at t I . Andμ P γ κ ; α s A ; s D À Á ; t I ; t F ½ À Á s A ; t I À Á Â Ã is the multiplication factor contributed from the local region, γ κ . Because the multiplication factor is a summation of contributions over all local indel histories that can yield the local PWA confined in γ κ , it can also be expressed as a series similar to Eq. (1): Here, N min [α(s A , s D ); γ κ ] is the minimum number of indels required for the portion of α(s A , s D ) in γ κ . And the term μ P It should be noted that the multiplication factor,μ P … ½ (e.g., in Eq. (2)), is not a probability; actually, it is not even a conditional probability, and it can exceed 1 (unity) in some cases (in such manners that the entire right hand side of Eq. (2) will always be less than 1 (unity)). In this sense, the "generalized HMM" given by Eq. (2) differs from normal HMMs. The contribution of each local indel history to a multiplication factor is the ratio of the probability of the history (given an ancestral state) to the probability that the ancestral state underwent no indel. (See Eq. (SM-1.7) in Additional file 1 for the mathematical definition.) When comparing the contributions from two different sets of histories (potentially giving rise to the same local alignment), the denominator (i.e., the probability of no indel) is usually identical. Therefore, in general, the comparison of two multiplication factor contributions gives the same result as the comparison of the corresponding probabilities. (Similar notes apply also to the analyses of local MSAs below.) Similar arguments hold also for the probability, P α s 1 ; , that a MSA (α s 1 ; s 2 ; …; s N X ½ ) of N X sequences, s 1 ; s 2 ; …; s N X , results from the evolution along a given phylogenetic tree (T) [22]. Basically in line with the idea in [18,19,40], we can build up the probability of a MSA, first by multiplying the root state probability and the probabilities of ancestor-descendant PWAs along branches, and second by summing such products over all MSA-consistent ancestral states. The ab initio MSA probability thus composed can be expressed as a series: Here, N min is the minimum number of indels required for creating the MSA. (For simplicity, we omitted the obvious dependence of N min on the MSA and the tree.) is the portion of the probability contributed from all MSA-consistent N-event indel histories. A MSA-counterpart of a PAS is a gapless column, which indicates that the corresponding site was hit by no indel throughout the evolution along T. (Hereafter, a gapless column in a MSA is also called a "PAS.") Using some PASs, we partition the MSA into local regions, C 1 ; C 2 ; …; C Κ max . Meanwhile, there are infinitely many possible root sequence states (s Root ' s) consistent with the MSA. Among them, we choose one as the "reference" root state (s 0 Root ). Then, in addition to the aforementioned conditions (i) and (ii), we impose the following condition. Under these conditions, the MSA probability is factorized as: Here, P 0 [s 0 Root | T] is the probability that the root sequence state is s 0 Root and that it was hit by no indel all across T. And e Μ P α s 1 ; s 2 ; …; s N X ½ ; s Root 0 ; C Κ T j Â Ã is the multiplication factor contributed from the local region, C K . As in Eq. (4), the multiplication factor also can be expressed as a series: Here, N min [C Κ ] is the minimum number of indels required for the portion of the MSA in C K . And the term Μ is the fraction of the multiplication factor contributed from all local-MSA-consistent N-indel local histories in C K . (For more details, see the second half of SM-1 of Supplementary methods in Additional file 1.) As agued above, under conditions (i) and (ii) (and, in addition, (iii) for a MSA), the probability of a given alignment is factorized into the product of an overall factor and local contributions (as in Eq. (2) and Eq. (5)). This factorization could drastically speed up the computation of the probability. However, each local contribution is still a summation over an infinite number of indel histories (Eq. (3) and Eq. (6)), and its literally exact calculation would take infinitely long. This study examines two kinds of approximation methods. One is the "first approximation," which approximates each multiplication factor with the "total parsimonious contribution," i.e., the summation of contributions over all possible parsimonious indel histories. (It corresponds to the term, ) And the other is calculating a practically exact solution (or an "exact" solution, for short) of each local contribution from a local PWA of a certain type (see section R3). Especially, we examine the accuracy of the first approximation by comparing the total parsimonious contribution either with the total next-parsimonious contribution (sections R2 and R4), with the "exact solution" (section R3) or with the results of simulations (section R5). Here, the "total next-parsimonous contribution" is the summation of contributions over all next-parsimonious indel histories. (It usually corresponds to the above term with In the following sections, we will work with a model that satisfies the conditions (i), (ii) and (iii), and we will focus on calculating the multiplication factor that comes from a single local region (i.e., a "local alignment") flanked by a pair of PASs. As in [22], we will work in the state space S II . This means that we will calculate the probability of the homology structure of each local alignment (e.g., [39]). Let ΔL(s) be the number of sites that a sequence s ∈ S II has between the pair of PASs. We will re-assign the site numbers so that the left-and right-flanking PASs are numbered 1 and ΔL(s) + 2, respectively, and the sites in between them are numbered 2, …, ΔL(s) + 1. This will make it easy to apply our theoretical formulation [22] to the current situation. We will re-assign the ancestries υ(1) = L and υ(ΔL(s) + 2) = R to the left-and right-flanking PASs, respectively. (See endnote (1) for a brief description of the ancestry.) And we will usually (but not always) reassign the ancestries υ(x) = x − 1 to the sites in between the PASs, x = 2, …, ΔL(s) + 1, of the ancestral sequence, s = s A (for a PWA), or the root sequence, s = s Root (for a MSA). See Fig. 1 for an illustration.
Hereafter, we will often employ shorthand notations for the aforementioned (fractions of ) multiplication factors, e.g., for a MSA, either by omitting the arguments (like "μ P (N) " and "Μ ⌣ P N ð Þ ") or by replacing the arguments with simpler ones representing more concrete situations (like "μ P (N) [case (ii); . Unless stated otherwise, we consider the sequence evolution during time interval [t I , t F ] (for a PWA) or along a given tree, T (for a MSA). For illustration, we will use the indel evolutionary model of Dawg [32], though the analyses could be extended with due modifications to more general models (discussed in subsection R7.1). 4 Its indel rates are spacehomogeneous and time-homogeneous, and they are parametrized as follows. Let L(s) be the length of the sequence with state s. The rate of insertionM I x; l ð Þis: Here, λ I is the total insertion rate (per site), f I (l) is the insertion length distribution, and L I CO is the cut-off insertion length. The rate of deletionM D x B ; x E ð Þis: Here, λ D is the total deletion rate (per site), f D (l) is the deletion length distribution, and L D CO is the cut-off deletion length. Consequently, the exit rate from state s is: a b Fig. 1 Notation and re-numbering of sites typical in this study. a An example MSA. The C κ 's (with κ = 1, 2, …, 10) label the regions that actually or potentially accommodate local indel histories (marked by bottom curly brackets and yellow wedges, respectively). As an illustration, we choose the local MSA confined in the region C 4 (the portion in the green dashed box), and re-assign the ancestry indices (in the cells) as shown in panel (b). Ancestries L and R were newly assigned to the flanking PASs. The ancestry indices in between the PASs are just examples and not always assigned in this way. (The indices in panel (a) could be regarded as hexadecimal numerals, if preferred.) c Subsequences extracted from the local MSA. Shown above each site (i.e., each cell) is the site number (i.e., spatial coordinate) re-assigned to it. And the ΔL(s) on the right of each sequence (s = s 1 , …, s 6 or s Root ) is the count of sites in between the PASs. In panels a and b, the boldface characters in the leftmost columns stand for the sequence IDs. (More precisely, the number 'i' stands for sequence s i , and the 'R' stands for the root sequence (s Root ).) A dash ('-') in a cell represents a gap. In panel b, triple dots ('…') in a cell indicate that the alignment continues outwards. Panel a was adapted from Figure S2 b of [22]. Panels b and c were adapted from Figure 1 Here, versal" constant factor, and l D ≡ X L D CO l¼1 l g D l; t ð Þ is the average deletion length [32]. In this study, we use the power-law indel length distribution: , which is among the typical ones empirically observed (e.g., [24] and references therein). We also set λ I = λ D according to a genome-wide data analysis [41], unless otherwise stated. As for the sequence state probabilities at the root, we assume a uniform sequence length distribution hereafter. 5 See sections M1 and M2 of Methods for more specific settings.

R2. Numerical comparison between parsimonious and next-parsimonious contributions (1): for local PWAs
Here, we examine how accurately the first approximation will estimate the multiplication factor from each local PWA by comparing the total parsimonious contribution with the total next-parsimonious contribution, both calculated via numerical computations of their analytical expressions (given in SM-2 of Supplementary methods in Additional file 1). In this study, we are concerned only with the homology structures [39] of alignments. Hence, local PWAs flanked by a pair of conserved ancestral sites (PASs) can be broadly classified into four cases, according to the sites between the PASs: (i) no ancestral or descendant sites (panel a of Additional file 1: Figure  S1); (ii) some (ΔL A > 0) ancestral sites but no descendant sites (panel b); (iii) some (ΔL D > 0) descendant sites but no ancestral sites (panel c); and (iv) some (ΔL A > 0) ancestral sites and some (ΔL D > 0) descendant sites, but with no ancestor-descendant homology (panel d). 6 (See Additional file 1: Figure S2 for parsimonious and next-parsimonious indel histories in case (ii), and Figure S3 for parsimonious histories in case (iv).) Our numerical analyses indicated the following. In case (i), the total next-parsimonious contribution (μ P(2) [case (i)]) was negligibly smaller than the total parsimonious contribution (μ P(0) [case (i)] (=1)) for any realistic situation we likely encounter, as far as a single inter-PAS region is concerned. In case (ii) and case (iii), the total next-parsimonious contribution (μ P (2) [case (ii); ΔL A ] or μ P (2) [case (iii); ΔL D ]) amounted to 1/2 of the total parsimonious contribution (μ P(1) [case (ii); ΔL A ] or μ P(1) [case (iii); ΔL D ]), when the size of the local PWA (i.e., ΔL A or ΔL D ) is equal to a threshold value, (ΔL) 0.5 (NP) ≈ 1.2/ E[n ID ] (Additional file 1: Table S1 and Figure S4). Here is the expected number of indels per site during the sequence evolution. For example, in typical analyses of neutral genomic sequences from eutherian mammals, the branch length is around 0.2 expected substitutions per site (e.g., [42]). And the total indel rate was estimated as 1/8 of the total substitution rate [35].
Using these values, E[n ID ] is approximately 0.2/8 = 0.025, which gives the threshold (ΔL) 0.5 (NP) roughly equal to 50 sites. For the analyses of more closely related sequences, the threshold becomes longer. For example, in a comparison between primate sequences, a typical branch length would be 0.05 expected substitutions per site (e.g., [42]). Then, (ΔL) 0.5 (NP) would be roughly equal to 200 sites. In case (iv), the total next-parsimonious contribution (μ P(3) [case (iv); ΔL A , ΔL D ]) did not substantially exceed 1/2 of the total parsimonious contribution (μ P (2) [case (iv); ΔL A , ΔL D ]) until the local PWA or the time interval became quite long (Table 1). For more details on these analyses, see SM-2 of Supplementary methods in Additional file 1. (Further details on the calculations for cases (iii) and (iv) are given in sections A1.1 and A1.2, respectively, in [43].)

R3. Numerical comparison between parsimonious contribution and "exact solution" for local PWAs
It is difficult to calculate the summed contributions from local histories involving more indels, especially in case (iv). We could exactly calculate the contribution from a single local history involving any number of indels if we use the algorithm for a "trajectory likelihood" given by Miklós et al. [26]. As we exemplified in Appendix A1.2 of [43], however, it is already quite hard to enumerate even all the possible next-parsimonious local indel histories for case (iv). Nevertheless, if we consider only cases (i), (ii), and (iii) under a (locally) space-homogeneous model, we can work out systems of exact integral equations that could in principle provide the numerical solutions for the total sum of contributions up to a desired level of accuracy, i.e., , at the expense of some time and memory.
Applying the fundamental defining integral equations of our evolutionary model (Eqs.(R4.4, R4.5) in [22]) to the local MSAs of cases (i), (ii) and (iii), two systems of integral equations can be derived. One system is for cases (i) and (ii) (see SM-3 of Supplementary methods in Additional file 1), and the other is for cases (i) and (iii) (described in Appendix A1.3 of [43]). 7 These systems of integral equations can be numerically solved by iteration, and the results after N ID iterations give the aforemen- . Typically, the computation finished in the order of an hour (using a single thread) in a Macintosh computer with two quad-core 2.26 GHz Intel Xeon processors and 8GB memory. One round of the computation provides the multiplication factors for all local PWA sizes (ΔL = 0, 1, …, L CO ) and for time-interval sizes of k (t F − t I )/N P with k = 1, 2, …, N P . Our numerical analyses confirmed that N ID = 200 would be enough to give the practically exact (or "exact") solution for the local PWAs of 300 sites or less in likely situations of phylogenetic-level sequence analyses (data not shown). Thus, we used the results of N ID = 200 iterations as the "exact" multiplication factors, and compared the parsimonious contributions with them (Fig. 2). We define another threshold value, (ΔL) 0.5 (1) , at which the parsimonious contribution becomes 1/2 of the "exact" solution. The results indicated (ΔL) 0.5 (1) ≈ 1.6/E[n ID ] (Additional file 1: Table S1). Thus, (ΔL) 0. 5 (1) is approximately 4/3 of (ΔL) 0.5 (NP) in the previous section, implying that (ΔL) 0.5 (NP) actually gives a somewhat conservative criterion for the goodness of the first approximation. A fringe benefit of these iteration analyses is that we can also assess the "n-th approximation," which is given byμ n h i P (Fig. 2). We define (ΔL) 0.5 (n) as the local PWA size at which the n-th approximation becomes 1/2 of the "exact" solution. It seemed that (ΔL) 0.5 (n) > n × (ΔL) 0.5 (1) in general (Additional file 1: Table S1). This suggests the benefit of incorporating non-parsimonious local indel histories, especially when we deal with long local PWAs resulting from a long time evolution. Now that we have "exact" probabilities for local PWAs of cases (i), (ii), and (iii), it would be interesting to examine their behaviors. Panel a of Fig. 3 shows the log-log plots of exact solutions for different time intervals (in units of the expected number of indels per site). We see that even finite-time transition probabilities are well approximated by the power-law, with very high correlation coefficients for the log-log plots (0.9998 or more in the absolute value, see Additional file 1: Table S2). And panel b of Fig. 3 indicates that, as the time interval increases, the power-law exponent deviates gradually (yet only slightly) from its value for the instantaneous indel rates (1.6 here). Meanwhile, the coefficient seems almost proportional to the time interval (panel c). The slopes for these quantities differed for different values of the ratio, λ I : λ D . (Additional file 1: Table S2 gives also their numerical values for some representative cases.) These results may be useful for future data analyses on indels, e.g., when inferring the power-law exponents for the indel rate parameters from the comparison of relatively divergent homologous sequences.
Similarly to our ab initio formulation itself, these systems of integral equations can accommodate any practical indel length distributions. Therefore, we could even examine cases where insertions and deletions follow different length distributions and/or models that incorporate transposon insertions (e.g., [44,45]) as well. Such analyses should be interesting and important.

R4. Numerical comparison between parsimonious and next-parsimonious contributions (2): for local MSAs
We next studied some typical cases of local MSAs. (The analytical calculations of the multiplication factors are detailed in SM-4 of Supplementary methods in Additional file 1.) We only examined MSAs resulting from the evolution along a 3-OTU tree (Fig. 4a). This is because a next-parsimonious indel history typically differs from its parsimonious counterpart in the sequence state at the internal node that phylogenetically delimits an indel event. We examined the following four cases, which differ in the sets of homologous sites in between the PASs: (I) none of the three sequences has any site (Fig. 4b); (II) two sequences share a homologous run of sites, but the third sequence has no site (Fig. 4c); (III) one sequence has a run of sites, but the other two sequences have no site (Fig. 4d); and (IV) one sequence (s 1 ) has a run of sites, another sequence (s 3 ) has no site, and yet another sequence (s 2 ) shares the homologous sites   This table is identical to Table 2 of [43] with s 1 except a contiguous subset of sites it lacks (Fig. 4e).
In case (I), similarly to case (i) local PWAs, the total next-parsimonious contribution Μ ⌣ P 2 ð Þ case I ð Þ ½ À Á was negligibly smaller than the total parsimonious contribution Μ The comparison in case (II) reduces to that in case (ii) local PWAs, thanks to the phylogenetic consistency condition that the ancestral sequence states must satisfy (e.g., [46,47] Our numerical analyses showed that this contribution is much smaller than the parsimonious contribution (Fig. 5), even if the branch or the local MSA is quite long. Actually, the relative contribution decreased as the local MSA got longer (Fig. 5). In case (IV), the total next-parsimonious contribution Μ  (Table 2). 8 Taken together, the results in sections R2-R4 suggest that the first approximation by the parsimonious indel histories alone will estimate the multiplication factor for each local alignment fairly well, as long as the local alignment size and the branch lengths (or the time interval) are at most moderate.

R5. Simulation analyses to see goodness of first approximation for local MSAs
Thus far, we examined all cases of local PWAs and some typical cases of local MSAs. To study a much wider variety of local MSAs, we developed an algorithm that calculates the first approximation of the probability of a given MSA under a given parameter setting including a phylogenetic tree. Briefly, the algorithm first chops the input MSA into gapped and gapless segments. Second, it attempts to enumerate all parsimonious indel histories that can give rise to each gapped segment (i.e. local MSA) via what we call a "local multi-path downhill search" algorithm. Third, it computes their contributions to the multiplication factor for each gapped segment. And finally, it computes the first-approximate MSA probability as the product of an overall factor and the total parsimonious contributions to the multiplication factors from all gapped segments. (For details on the algorithm, see SM-5 of Supplementary methods in Additional file 1, as well as Additional file 1: Figures S5-S8.) After manually validating the sub-algorithm to enumerate all parsimonious indel histories (described in [48]), we conducted simulation analyses using Dawg [32]. (See section M2 of Methods for the settings of the simulations.) We created five homogeneous sets of simulated MSAs, namely, sets 1A, 1B, 3P, 3M and 3F. Each of sets 1A and 1B consists of 100,000 MSAs simulated along a three-OTU tree that has equally long branches and is rooted at its sole internal node. The expected number of indels per site along each branch (E[n ID ]) is 0.01 (small) for set 1A and 0.04 (medium) for set 1B. Sets 3P, 3M and 3F consist of 10,000 MSAs each, which are simulated along the trees of 12 primates, 15 mammals and 9 fast-evolving mammals, respectively (Additional file 1: Figure S9). These sets were designed to mimic typically encountered MSAs of selectively neutral genome sequences whose sequence divergences are small, moderate and large, respectively. 9 Every simulation started with a random DNA sequence that is 1000 bases long. For reasons of computational time, we excluded local MSAs containing gaps longer than 100 bases. The  For each of sets 1A and 1B, we compared the absolute occurrence frequency of each of the homology structures of local MSAs with its first approximate prediction (Fig. 6). (See SM-7 in Additional file 1 for how the analysis was performed.) The approximation was pretty good for both set 1A (Fig. 6a) and set 1B (Fig. 6b), with correlation coefficients 0.9996 and 0.9975, respectively (Additional file 1: Table S3). The scatter plot for set 1B (Fig. 6b) Table S2). Panels b and c show the power-law exponent (γ) and the coefficient (A), respectively, as functions of the distance ((λ I + λ D )(t − t I ) indels/site, abscissa) and the rate ratio (λ I : λ D , different curves). Here, we assumed the approximate power-law (See Additional file 1: Table S2 also for the results of correlation and regression analyses.) Note that the results apply also to case (ii) local PWAs with due modifications one or more unobservable indels are expected (Fig. 6c). This extends the conclusions in sections R2 and R3 that the first approximation estimates the probability of a local PWA fairly well as long as its size is within a threshold ((ΔL) 0.5 (NP) or (ΔL) 0.5 (1) ). Then, for each of simulated sets 1A, 1B, 3P, 3M and 3F, we first examined the relative frequencies of actual occurrences among different sets of parsimonious ancestral states consistent with each local MSA. Then we put the relative frequencies into 20 bins of width 0.05 each. And, finally, we compared the average frequencies in the bins with their first-approximate predictions ( Fig. 7 and Additional file 1: Figure S10). The predictions were shown to estimate the actual relative frequencies quite well, with the correlation coefficients ranging from 0.997 to 0.99999 (Additional file 1: Table S4). (See SM-8 in Additional file 1 for how this analysis was done. ) The results in this section suggest that the first approximation would work fairly well also for a majority of local MSAs we are likely to encounter, as long as the local MSA and the branches are at most moderately long. 10

R6. Examining other indel models and methods in light of our formulation
One of the major merits of our ab initio perturbative formulation is that it can be applied to considerably realistic evolutionary models of indels [22]. Therefore, it will enable us to examine, e.g., the parameter ranges where other indel probabilistic models can well approximate the ab initio alignment probability under a fairly realistic evolutionary model.
First we briefly study the goodness of approximation by the geometric indel length distribution, which most of commonly used indel models are based on. As in the previous sections, we use the power-law distribution of the indel length (l), f 1.6 , as a reference. Then, we fitted the scaled geometric distribution, f SG (l; A, q) = A (1 − q) q l − 1 , to the power-law under a least-square criterion. The best-fit parameters were A LS = 0.7125 and q LS = 0.3957. Calculating the ratio, R LS (l) ≡ f SG (l; A LS , q LS )/f 1.6 PL (l), for different indel lengths, we find, e.g., R LS (5) = 0.3168, R LS (7) = 0.08495 and R LS (13) = 8.775 × 10 − 3 . The ratio decreases rapidly as the indel length increases. If, for example, we allow the ratio as small as 1/3, the geometric distribution is regarded as a decent approximation only for l ≤ 4. With f 1.6 PL (l), the indels with l ≥ 5 account for about 30 % of all indels. These mean that, for 30 % of actually occurring indels, the geometric distribution substantially underestimates their frequencies even according to the above lenient criterion. This reconfirms the importance of using biologically realistic indel length distributions, as pointed out, e.g., in [23,24]. 11 Next, as an example of indel models that incorporate some biological realism, we investigate the HMM of Kim and Sinha [36]. Their HMM can accommodate power-law indel length distributions. However, similarly to most other HMMs and transducers, it cannot correctly handle overlapping indels along the same branch, although it can handle overlapping indels along different branches. Another characteristic of their method is that it applies the same indel length distributions to all branches. The behaviors of the "exact" solutions (Fig. 3, Additional file 1: Table S2) indicate that their HMM could approximate the probabilities of local PWAs fairly well in cases (i), (ii) and (iii), as long as branch lengths are reasonable for phylogenetic analyses. Almost the same conclusions were drawn also from the analyses using up to next-parsimonious contributions in the perturbation expansion, Eq. (3). (See SM-9 of Supplementary methods in Additional file 1.) Regarding case (iv) local PWAs, however, our analysis indicated that their HMM could substantially underestimate the ab initio probabilities, especially when long indels are involved (Table 3). (How we performed this analysis is also described in SM-9.) This is because their HMM, like most HMMs and transducers, neglects 1/3 of the effects of non-overlapping indels (panels a, b and c of Additional file 1: Figure S3), as well as most effects of overlapping ones (panels d and e), that can yield each case (iv) local PWA. (Briefly, most HMMs neglect one of panels b and c. See SM-9 for details.) These results suggest that a potentially effective way to improve the accuracy of the HMM of Kim and Sinha [36] would be to modify the transition probabilities between a deletion-type block and an insertion-type block. This measure will enable to incorporate the effects of overlapping indels in case (iv).
As exemplified above, our ab initio perturbative formulation provides other indel probabilistic models with a sound reference point, under which the models can be examined to improve their accuracy and evolutionary consistency. A related topic is the Chapman-Kolmogorov (CK) equation, which must be satisfied by genuine stochastic evolutionary models. Unfortunately, most of the currently common indel probabilistic models violate the CK equation (as argued, e.g., briefly in [41]). Because our perturbative formulation satisfies the CK equation up to any desired order of the perturbation expansion (Appendix A3 of [49]), our formulation could also examine the effects of the violation of the CK equation. For example, under a geometric indel length distribution, the effects become conspicuous only when the indel lengths exceed a "critical value" of O(10), where the geometric distribution substantially underestimates  [43].) This seems to explain the results of past studies (e.g., [50,51]), which did not detect remarkable effects of the violation of the CK equation (or the effects of its cause, i.e., inadequately incorporating overlapping indels).

R7. Outstanding issues and possible improvements, extensions and applications
Here, we briefly discuss some outstanding issues and their possible solutions in the forms of methodological improvements and extensions, and also possible applications of the (improved/extended) methods to practical problems. (For more details on most of these topics, see Discussion of [48].)

R7.1. Possible improvements and extensions of our computational methods and algorithms
In this study, we successfully constructed two integral equation systems to calculate "exact" multiplication factors for case (i), (ii) and (iii) local PWAs. For case (iv) local PWAs (Additional file 1: Figure S1d), we only provided methods to analytically calculate the total parsimonious (i.e., 2-indel) and total next-parsimonious (i.e., 3-indel) contributions. Although in principle we could calculate contributions from indel histories with more than 3 indels each, the question should be how we can do this within a reasonable amount of time. Even if we can construct an integral equation system for case (iv), it is expected to contain terms with complex gap configurations, and thus it would be difficult to solve it "exactly." Therefore, a key for this case should be how we can reasonably quickly obtain approximate multiplication factors each of which estimates the exact factor more accurately than the summation over all parsimonious and next-parsimonious contributions. The main purpose for the current algorithm to calculate the first-approximate probability of a given MSA (SM-5 in Additional file 1) was to see whether or not the first approximation works also for local MSAs of general types. This algorithm merely constitutes the first step toward an automatic application of our ab initio perturbative formulation. Consequently, the algorithm still has some rooms for improvements. For example, the algorithm could be a b c Fig. 6 Simulation analyses on absolute alignment frequencies. Each panel compares the predicted absolute frequency of each local homology structure (ordinate) against the number of times that it actually occurred in a simulated dataset (abscissa). The predicted absolute frequency was calculated using Eq. (SM-7.1) in Additional file 1. Note the logarithmic scaling for both axes, which tends to exaggerate sampling errors on the lower-left region in each panel. Panels a, b and c, respectively, show the results with the simulated sets 1A, 1B and set 1B after removing long gapped segments. The panels a-c are reformatted versions of panels A-C of Figure 29 of [48] very slow when applied to a local MSA containing a long gap. Roughly speaking, the length of time consumed by the algorithm applied to an input MSA is the summation of the lengths of the consumed time over all local MSAs in it. We estimated that the algorithm applied to each local MSA has the time complexity roughly greater than O B2 N ShG   7 Simulation analyses on relative frequencies among local indel histories. Each panel compares the predicted relative frequencies (ordinate) against the actual relative frequencies in simulations (abscissa). The relative frequencies are among parsimonious local indel histories that potentially yield the same local MSA. A blue diamond, a red 'X' and a black cross represent a bin of all parsimonious local indel histories, that of most likely (ML) parsimonious histories, and that of least likely (LL) parsimonious histories, respectively. Panels a, b and c show the results with sets 1A, 1B and 3M, respectively. See section SM-8 in Additional file 1 for how the analysis was performed. Panels a and b are reformatted versions of panels D and E, respectively, of Figure 29 of [48]  Each row gives values for a local PWA with ΔL A ancestral sites and ΔL D descendant sites in between a pair of PASs. See section M1 of Methods for the parameter setting. When λ I = λ D , the ratio for (ΔL A , ΔL D ) = (L 1 , L 2 ) is identical to that for (ΔL A , ΔL D ) = (L 2 , L 1 ). Thus, we only showed the results for ΔL A ≥ ΔL D . The ratios less than 2 are in boldface. This table is a modified version of Table 4 of [43] a The ratio of the ab initio probability to the corresponding probability given by the HMM of Kim and Sinha [36], in the limit where the time interval (or branch length) approaches zero b The portion of the ratio contributed by overlapping indels (such as those in panels d and e of Additional file 1: Figure S3) branch where the indel resulted in the long gap. (These three branches share the node that accommodates the subsequence aligned with the long gap.) And ΔL is the size of the local MSA.  12 One way to quickly process a local MSA containing long gaps should be to treat the gaps hierarchically, first long gaps alone and second the remaining short and medium gaps (Additional file 1: Figure S11; see Discussion D1.1 of [48] for more details). If this strategy indeed works, the O 2 N ShG À Á component of the time complexity would reduce to O(2N ShG ), because the short gaps that are phylogenetically neighboring the long gap can be handled independently of one another (panel d of Figure  S11). Another very similar strategy should be to narrow down the ancestral sequence states to be searched for, regardless of the presence/absence of gapless columns, by exploiting the "phylogenetic correctness" condition (e.g., [46,47]). The condition must always be satisfied by the ancestral sequence states consistent with MSAs, and thus it should be very powerful.
Another possible improvement should be to incorporate non-parsimonious indel histories so that we can enhance the accuracy of the probability estimation. As in section R4, we can classify non-parsimonious histories into two broad categories: (A) those each of which shares the set of all ancestral sequence states with a parsimonious history, and (B) those that share the set with no parsimonious history. Each non-parsimonious history in category (A) yields the same ancestor-descendant PWAs along all branches as a parsimonious history does (section R7 of [22]). Hence, we could easily incorporate the effects of category (A) histories by using local PWA multiplication factors that take account of nonparsimonious contributions, as we calculated in sections R2 and R3. As the result in section R4 (Fig. 5) indicates, this could considerably improve the accuracy relatively easily. Incorporating histories in category (B) should require an algorithm to systematically enumerate such histories. Some hints may come from the examples in section R4 and SM-4 of Supplementary methods in Additional file 1, and the "branch-and-merge" operation (SM-5.2 of Supplementary methods). The real challenge, though, should be to devise a method to enumerate such histories efficiently. 13 In this paper, we presented the results of computing local alignment probabilities (or multiplication factors) under a space-homogeneous model implemented in Dawg [32], mainly in order to avoid excessive presentational complications. At least theoretically, however, the computational methods (in Additional file 1) could be extended to space-heterogeneous situations relatively easily. All we have to do is substitute space-heterogeneous counterparts for the space-homogeneous indel rates (and exit rates) in the final formulas in SM-2, SM-3 and SM-4 (in Additional file 1), and replace multiplication by some integer factors (such as (ΔL A + 1) in Eq. (SM-3.2) and Eq. (SM-3.4b) in Additional file 1) with summation over possible positions (of a position-flexible indel event). The time integration can be performed analytically (except in SM-3) as long as the model is time-homogeneous. And the computation could be automated as long as the indel rates are specified according to some programmable rules. In some cases, tricks or approximations may be necessary so that the computation (involving the aforementioned summations) can be finished quickly enough. It should be kept in mind, however, that such computation will make sense only if the probabilities of the entire alignments are factorable. This means that the indel rates (and the exit rates) must satisfy the conditions (i) and (ii) (and (iii) for MSAs) explained in section R1, which may bring in some complications. For example, in the most general indel model we currently know to have factorable alignment probabilities (described in subsection R8-3 of [22]), each locally heterogeneous set of indel rates is confined in a region that does not necessarily coincide perfectly with a gapped segment (i.e., a local alignment). When the region accommodates only one gapped segment (and some gapless columns), there should be no serious problem; although each position between contiguous gapless columns may experience some indels, the effects of such indels should be negligible (as shown in sections R2 and R3), allowing us to focus on the single gapped segment. On the other hand, a serious problem may arise when the region accommodates two or more gapped segments. In this case, the contributions from the gapped segments (overlapping the region) can no longer be factorized, and thus all possible relative orders will have to be considered among indels in different gapped segments overlapping the region (while retaining the order in each segment). This could substantially slow down the computation, especially regarding nonparsimonious indel histories (including practically exact solutions), and some new measures may be necessary for reasonably fast computation. In addition, it should be remembered that, in order to pursue further biological realism, one must also overcome some other hurdles, such as more realistic boundary conditions and mutations other than indels and substitutions (discussed, e.g., in section R9 of [22] and [37]).

R7.2. Risks associated with naive applications to reconstructed alignments
Some readers may consider conducting some evolutionary analyses by applying the algorithm presented here to a MSA reconstructed by one of the state-ofthe-art aligners (reviewed, e.g., in [7]). We strongly caution the readers that it would be premature to conduct such analyses at this point. What we demonstrated here is that the algorithm estimates the probabilities quite accurately, provided that it is fed a correct MSA. Unfortunately, however, recent analyses (e.g., [38,52,53]) showed that reconstructed MSAs are considerably error-prone, even if they were reconstructed via stateof-the-art aligners. Thus, a naive application of the algorithm to a reconstructed MSA would likely lead to incorrect predictions. Therefore, the readers should avoid such analyses whenever possible. Even if they need to perform such analyses, the possibility of MSA errors must be fully taken into account when interpreting the results.

R7.3. Possible applications
Originally, we developed our theoretical formulation [22] and the algorithm presented here for the purpose of comparing candidate MSAs in terms of their occurrence probabilities, i.e., their likelihoods. This purpose should be adequately fulfilled considering the accuracy of the predicted probabilities under moderate conditions, as demonstrated in this paper. If the algorithm can be coupled with a sampler that can preferentially explore quite likely regions of the MSA space, we could obtain an approximate probability distribution of MSAs. Such a distribution should be very useful, because a substantial fraction of alignment errors turned out to be due to the stochastic nature of evolutionary processes [38]. In the previous study [22], we showed that the "long indel" model [26] is virtually equivalent to our ab initio formulation under space-homogeneous indel rates. Hence, their dynamic programming (DP) could be applicable to the problem, possibly with some modifications. Although the full version of their DP is quite slow, a device similar to those used recently (e.g., [21,41,54]) might speed up the MSA space exploration. It remains to be seen if such a device could be successfully adapted to our formulation or not. Most of currently available MSA aligners, whether they implement probabilistic or single-optimum-search algorithms, are based on geometric distributions. Because biologically realistic indel length distributions were shown to improve the accuracy of pairwise sequence comparisons (e.g., [23,24]), we expect that this will be the case also with multiple sequence comparisons. (This expectation was partially confirmed in [38].) Up to here, we assumed that the phylogenetic tree is given. In many cases, however, the phylogenetic trees must also be inferred from the input sequence data. A theoretically ideal way would be to infer the joint distribution of MSAs and phylogenetic trees, as it is expected to minimize possible prediction biases (e.g., [18,39,55,56]). A major problem is that such an analysis would be very time-consuming in general. In this sense, the traditional method of inferring the phylogenetic tree from an input MSA (e.g., [5]) and the incorporation of indel information into the method (e.g., [57]) would still be useful. When trying to adapt our algorithm or formulation to any of these methods, we will have to further speed up the calculation of approximate alignment probabilities, especially under the moves in the tree space exploration, such as the nearest-neighbor interchange (NNI) and the sub-tree pruning and re-grafting (SPR). At present, it is a totally open question whether we can really do this without compromising the accuracy of the predicted probabilities. This should be a challenging, formidable yet crucial problem of phylogenetic-level molecular evolution.

Conclusions
In the previous study [22], we proposed a theoretical formulation that facilitates the ab initio calculation of the probabilities of given PWAs and MSAs under the general continuous-time Markov model, which describes the evolution of an entire sequence along a time axis via indels. And we explicitly demonstrated that, under a certain set of conditions, each ab initio alignment probability is factorable into the product of an overall factor and multiplication factors originated from local alignments delimited by preserved ancestral sites, thus providing a sort of generalized HMMs (Eq. (2) and Eq. (5)).
In this study (especially in Supplementary methods in Additional file 1), we provided some methods and an algorithm to concretely calculate the total parsimonious contribution and the total next-parsimonious contribution to the multiplication factor, Eq. (3) or Eq. (6), originated from each local alignment, under spacehomogeneous situations for illustration purposes. Our analyses indicated that even the total parsimonious contribution approximates the multiplication factor fairly well as long as (λ I + λ D )(t F − t I )ΔL is within an O(1) threshold. Here, (λ I + λ D )(t F − t I ) is the expected number of indels per site and ΔL is the local alignment size. Moreover, again under space-homogeneous situations, we deduced two systems of integral equations that can be numerically solved to give practically exact multiplication factors for local PWAs of cases (i), (ii) and (iii). An inspection of the practically exact factors indicated that the finite-time transition probabilities in these local PWAs keep following the power-law, and that the exponent only slightly deviates from the original exponent for the instantaneous indel length distribution. Equipped with these results and new methods, the theoretical formulation we proposed in [22] is expected to provide other indel probabilistic models with a sound reference point, which could suggest necessary modifications to improve the accuracy of the models (as exemplified in section R6).
However, considering that the commonly used aligners are considerably error-prone (as shown e.g., in [38,52,53]), it would be very risky to naively apply the presented algorithm or methods to reconstructed MSAs. Thus, it should be preferable to first develop programs that exploit the fruits of the previous study [22] and this study to accurately estimate the uncertainties in, and rectify the errors of, reconstructed alignments under a genuine stochastic model of sequence evolution via indels that is biologically more realistic than almost all models studied in the past.

M1. Parameter settings for numerical analyses
We performed all numerical analyses in this paper using the space-homogeneous indel model implemented in Dawg [32] (see Eqs. (7)(8)(9)). Unless otherwise stated, the total insertion rate was set equal to the total deletion rate (that is, λ I = λ D ), according to a genome-wide data analysis [41]. We used the power-law indel length distribution for both insertions and deletions: The power-law exponent of 1.6 is among the typical values observed empirically (e.g., [24] and references therein). The cut-off indel length, L CO , was set at 500 sites for the perturbation analyses to assess the goodness of the first approximation (Tables 1 and 2, Figs. 2, 3 and 5; Additional file 1: Tables S1, S2 and Figure S4), whose results were almost independent of L CO . It was set at 5000 sites when assessing the goodness of approximation by the HMM of Kim and Sinha [36] (Table 3), because the result stabilized around this value of L CO . In the perturbation analyses on local MSAs (e.g., Table 2 and Fig. 5), we used a 3-OTU tree with equally long branches (Fig. 4a). The tree was rooted at its sole internal node. For the iterative perturbation analysis (Figs. 2 and 3; Additional file 1: Tables S1 and S2), the sub-time-interval for the numerical time integration was set at E[n ID ]/N P = 0.001 (indels per site). 14 For local MSA analyses (both perturbative and simulation-based), the uniform sequence length distribution was employed as the prior probability of the root sequence state.

M2. Simulations to prepare input MSA sets
To validate the entire algorithm described in section SM-5 of Supplementary methods in Additional file 1, we prepared five sets of MSAs using a genuine sequence evolution simulator, Dawg [32]. We performed all simulations using the same Zipf power-law distribu- with the exponent γ = 1.6 and the cut-off indel length of L CO (= L I CO = L D CO ) = 100 bases. The exponent γ = 1.6 is typical among empirically observed values (e.g., [24] and references therein). The cut-off length was chosen in order to prevent it from taking extremely long to search for parsimonious local indel histories. This is because, with our current implementation, the search could be very time-consuming when a gapped segment contains at least one long gap (see subsection R7.1 of Results and discussion for a possible solution). Each of the simulations started with a random ancestral DNA sequence that is l000 bases long. In each simulation, we labeled all the internal nodes of the input tree, in order to keep the ancestral sequences aligned with the "extant" sequences (at the external nodes). Other parameters and options were set at default values unless otherwise stated. We created five input MSA sets, namely, 1A, 1B, 3P, 3M and 3F.
Set 1A consists of 100,000 MSAs, each of which was simulated along a 3-taxon tree starting at a root with three child branches. The lengths of the three branches were all set at 0.05 (substitutions per base). The total rates of insertions and deletions were set at λ I = λ D = 0.1 (per expected substitution), which are close to the upper-bounds for neutrally evolving mammalian DNA sequences [24,35].
Set 1B is nearly the same as Set 1A, expect that all branch lengths were set at 0.2 (substitutions per base).
We prepared 1A and 1B, because validating the theoretically predicted occurrence probabilities of local homology structures necessitated a large number of MSAs simulated under identical parameter settings.
Each of Sets 3P, 3M and 3F consists of 10,000 MSAs. The settings for these three sets differ only in the phylogenetic tree used for the simulations. The MSAs in these sets were simulated along the tree of 12 primates (panel a of Additional file 1: Figure S9), the tree of 15 mammals (panel b) and the tree of 9 fast-evolving mammals (panel c), respectively. These three sets, 3P, 3M and 3F, were intended to mimic typically encountered MSAs among selectively neutral DNA sequences with small, moderate and large sequence divergences, respectively. The total indel rates for these three sets were set at λ I = λ D = 1/ 16 = 0.0625 (per expected substitution), according to genome-wide data analyses [35,41]. For more details on these three sets, see [38].
The Dawg control files used to generate these simulated datasets, including the phylogenetic trees and indel model parameters, are available as a part of Additional file 2.
Before the analyses, all simulated MSAs were preprocessed so that the MSAs with an identical homology structure will be replaced with a unique representative MSA. See Methods of [38] for details. long indels, more geometric components would be necessary. 12 In SM-8 of Supplementary methods of [38], we roughly estimated the expected mean length L loc T ð Þ À Á of a correct (i.e., not reconstructed) local MSA as: . Here, |T| is the total branch length across the tree (T) (in units of the expected number of substitutions per site). λ D and λ I , respectively, denote the deletion rate and the insertion rate, both per expected substitution. And l D and l I are the average lengths of a deleted subsequence and an inserted subsequence, respectively. Under the parameter setting for the simulation analyses of this study, we have l D ¼ l I ¼ 6:35 . If we set λ D = λ I = 1/16 = 0.0625 according to [35,41], then, L loc T ð Þ exceeds 100 when |T| is greater than 9, which is quite large. Thus, as long as we are dealing with correct MSAs, our firstapproximate algorithm is expected to work on a majority of local MSAs until |T| becomes this large. It should be noted, however, that reconstructed MSAs could be seriously erroneous even if |T| is, e.g., less than 3 [38]. 13 Strictly speaking, the current "local-multi-path downhill search" algorithm is not perfect, in the sense that it misses some deletion-dominated parsimonious local histories that can yield gap configurations belonging to the class of "intersection between cousins" described in [61]. Fortunately, this drawback is not expected to be so serious, because such local histories require at least four indels each and thus should be very rare. 14 We calculated the multiplication factors,μ N ID ¼200 h i P case iii ð Þ; ΔL D ; t I ; t ½ Â Ã with t ∈ [t I , t F ], using E[n ID ]/ N P = 0.001, and compared them to those calculated using E[n ID ]/N P = 0.0005. In each pair of values of ΔL D and t, the difference was within 0.2 % of the multiplication factor itself. Thus we concluded that they are virtually exact.

Additional files
Additional file 1: Supplementary methods (sections SM-1 through SM-9), Tables S1 through S4, and Figures S1 through S11. The sections of Supplementary methods describe methodological details on the (analytical and algorithmic) computations of local alignment probabilities (SM-1,2,3,4,5,9) and on the simulation analyses for validating the algorithm (SM- 6,7,8). Tables S1-S4 show the results of some analyses. Figures S1, S2, S3 and S11 illustrate some important concepts. Figures S4 and S10 show the results of some analyses. Figures S5, S6, S7 and S8 explain the algorithm implemented in LOLIPOG. Figure S9 shows the phylogenetic trees used for the simulations. (PDF 12355 kb) Additional file 2: A package (the version used for the analyses in this paper), which contains Perl modules that implement the key algorithms and formulas, as well as some main Perl scripts that we used for the actual data analyses. (The package is available under the GNU General Public License. The modules and scripts will run on a Mac OS X terminal, and were confirmed to run also on Red Had Enterprise Linux (6.4). And they will probably run on other UNIX platforms as well, although we have not tested whether they indeed do.) The package also contains Dawg control files necessary for creating the simulated MSA sets 1A, 1B, 2, 3P, 3M and 3F. The latest version of the package ("FA_LOLIPOG_P.verxxx") will be available from the "LOLIPOG" (LOg-LIkelihood for the Pattern Of Gaps) project at the FTP repository of the Bioinformatics Organization [58].

Acknowledgements
This study is dedicated to the late Dr. Keiji Kikkawa, a key pioneer of string theory of elementary particle physics and the author's best mentor. The author (KE) would like to greatly thank Dr. Dan Graur at the University of Houston and Dr. Giddy Landan at Heinrich-Heine University Düsseldorf for their indispensable assistance with this study; this study and the previous study [22] were originally born out of the need to score alternative alignments in their project (described in Funding), and they co-authored with KE some unpublished predecessors of these manusctipts (including [43,48,49]). The author is grateful to Dr. R. A. Cartwright at Arizona State University for his useful information and discussions that inspired this study. The author appreciates the logistic support and the feedback of Dr. Tetsushi Yada at Kyushu Institute of Technology, as well as the encouragement and the feedback of Dr. Toshiaki Takitani, a professor emeritus at Nagoya University. The author would also like to thank Dr. Ian Holmes at the University of California, Berkeley, for his critical comments on the previous version of the manuscript and his information on some relevant previous studies, both of which helped substantially improve the manuscript. The author is also grateful to all of the reviewers and handling editors of this manuscript and its predecessors, as their feedback definitely helped improve the manuscript.

Funding
This work was a part of the project, "Error Correction in Multiple Sequence Alignments," which was funded by US National Library of Medicine (grant number: LM010009-01 to Dan Graur and Giddy Landan, then at the University of Houston). The later stage of this work was supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan (grant numbers: MEXT KAKENHI Grant Numbers 221S0002, 15H01358, both to Tetsushi Yada).

Availability of data and materials
The data sets supporting the conclusions of this article are either included within the article and its additional files, or available in [43] (bioRxiv DOI: http://dx.doi.org/10.1101/023606) or [48] (bioRxiv DOI: http://dx.doi.org/ 10.1101/023614). (Numerical and simulation analyses described in this article should be reproducible by using the programs provided in Additional file 2, following the instructions contained therein).
Author's contributions Not applicable because the paper was written by a single author (KE).

Author's information
The author (KE) used to be a mathematical physicist, who studied theoretical elementary particle physics and quantum gravitational theories from 1991 till 1999. Then, since 2002, after having studied theoretical biophysics from 1999 till 2002, he has studied molecular evolution (including population genetics), mainly focusing on homology-based computational analyses of DNA and protein sequences. For more detailed information, including the list of his publications, refer to his ORCID record [62].

Competing interests
The author declares that he has no competing interests.

Consent for publication
Not applicable.