General continuoustime Markov model of sequence evolution via insertions/deletions: are alignment probabilities factorable?
 Kiyoshi Ezawa^{1, 2}Email author
https://doi.org/10.1186/s1285901611057
© Ezawa. 2016
Received: 2 March 2016
Accepted: 26 May 2016
Published: 17 September 2016
The Erratum to this article has been published in BMC Bioinformatics 2016 17:457
Abstract
Background
Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, indel probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of columntocolumn transitions or blockwise contributions along the alignment. However, it is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the timeaxis. Moreover, currently none of these models can fully accommodate biologically realistic features, such as overlapping indels, powerlaw indellength distributions, and indel rate variation across regions.
Results
Here, we theoretically dissect the ab initio calculation of the probability of a given sequence alignment under a genuine stochastic evolutionary model, more specifically, a general continuoustime Markov model of the evolution of an entire sequence via insertions and deletions. Our model is a simple extension of the general “substitution/insertion/deletion (SID) model”. Using the operator representation of indels and the technique of timedependent perturbation theory, we express the ab initio probability as a summation over all alignmentconsistent indel histories. Exploiting the equivalence relations between different indel histories, we find a “sufficient and nearly necessary” set of conditions under which the probability can be factorized into the product of an overall factor and the contributions from regions separated by gapless columns of the alignment, thus providing a sort of generalized HMM. The conditions distinguish evolutionary models with factorable alignment probabilities from those without ones. The former category includes the “long indel” model (a spacehomogeneous SID model) and the model used by Dawg, a genuine sequence evolution simulator.
Conclusions
With intuitive clarity and mathematical preciseness, our theoretical formulation will help further advance the ab initio calculation of alignment probabilities under biologically realistic models of sequence evolution via indels.
Keywords
Stochastic evolutionary model Insertion/deletion (indel) Sequence alignment probability Factorability Biological realism Powerlaw distribution Rate variation Nonequilibrium evolutionBackground
The evolution of DNA, RNA and protein sequences is driven by mutations such as base substitutions, insertions and deletions (indels), recombination and other genomic rearrangements (e.g., [1–3]). Some recent comparative genomic analyses revealed that indels account for more base differences between closely related genomes than base substitutions do (e.g., [4–7]). It is therefore imperative to develop a method to reliably calculate the probability of sequence evolution via mutations including indels. Since the groundbreaking works by Bishop and Thompson [8] and by Thorne, Kishino and Felsenstein [9], many studies have been made to calculate the probabilities of pairwise alignments (PWAs) and multiple sequence alignments (MSAs) under probabilistic models aiming to incorporate the effects of indels. And the methods have greatly improved in terms of the computational efficiency and the scope of application (reviewed, e.g., in [10–12]). Most of these studies are based on hidden Markov models (HMMs) (e.g., [13]) or transducer theories (e.g., [14]). Even today, the studies on these methods are steadily advancing (e.g., [15, 16]), and it seems that their mathematical and algorithmic bases are about to be established.
However, it is important to remember that these desirable properties alone are not sufficient for a model in natural science to be satisfactory. In addition to having the mathematical (or algorithmic) soundness, a satisfactory model must also approximate well, or at least decently, the real phenomena it is intended to describe. In the case of an indel probabilistic model, there are two key elements for this requisite: one is the evolutionary consistency of the model, and the other is the model’s flexibility to accommodate various biologically realistic features, i.e., the biological realism, of indels.
Regarding the biological realism, past empirical studies revealed some properties of real indels, in addition to their possibilities to affect multiple contiguous residues at a time and to overlap others. Among the most important would be the studies that showed powerlaw distributions of indel lengths (see, e.g., [19] and references therein). On the contrary, standard HMMs and transducers can usually implement geometric distributions of indel lengths, or at best mixed geometric distributions (e.g., [20]), but cannot implement the powerlaw distributions themselves. But some generalized HMMs (or transducers) (e.g., [21, 22]) can incorporate powerlaw indel length distributions. For example, the HMM of Kim and Sinha [22] is quite flexible, and it can incorporate the powerlaw distributions and also do away with the commonly imposed timereversibility. As discussed e.g., in [21] and [23], there is no biological reason for imposing the time reversibility, and they were usually imposed to reduce the computational time. In this sense, the HMM of Kim and Sinha is two steps closer to the biological reality than the standard HMMs (and transducers). Unfortunately, similarly to the standard HMMs and transducers, their HMM is not evolutionarily consistent and thus cannot correctly handle overlapping indels along the same branch, though they can handle overlapping indels along different branches. Another possibly important biologically realistic feature is the indel rate variation across sites (or regions) (e.g., [24]), due to selection and the mutational predispositions (caused, e.g., by the sequence or epigenetic contexts). Thus far, attempts to incorporate this feature have been rare (e.g., [25]), and most studies have handled spacehomogeneous models, whose indel rates are homogeneous along the sequence.
As far as we know, except the models implemented in some genuine sequence evolution simulators (e.g., [26–28]), there is only one class of genuine stochastic evolutionary models discussed thus far that is also considerably biologically realistic, which are the “substitution/insertion/deletion (SID) models” proposed by Miklós et al. [21]. The SID models in general do not impose the aforementioned unnatural restrictions on indels. Moreover, the general SID model can accommodate any indel length distributions, and also some indel rate variations across sites (albeit through the residue state context alone). Unfortunately, however, we have not seen any further theoretical development of the general SID model since it was proposed. Instead, Miklós et al. developed the “long indel” model [21], which is a spacehomogeneous, timereversible SID model. (More precisely, the insertion rate depends on the inserted sequence only through the product of the frequencies of its constituent residues. As mentioned above, the timereversibility was introduced just for computational convenience, and it could be dispensed with if desired.) And they gave a verbal justification that the probability of a PWA under the “long indel” model can be calculated via a generalized HMM, as a product of contributions from “chopzones” delimited by gapless columns. In the present viewpoint, this is the most satisfactory HMM that we know was used for actual sequence analyses, because it satisfies both the evolutionary consistency and the biological realism to some degree. Nevertheless, their justification, although plausible, has two problems. First, it is unclear exactly how their HMM is related to the ab initio probabilities of evolutionary (especially indel) processes under their evolutionary model. And second, their justification takes advantage of the spacehomogeneity of the model, which makes it unclear how their HMM can be extended to be spaceheterogeneous while keeping the evolutionary consistency. To solve these two problems, we need at least to get back to their origin, i.e., the general SID model. It seems, however, that this model has never been theoretically dissected thus far, possibly due to the lack of mathematical or conceptual tools to handle it easily.
In this study, we examine a general continuoustime Markov model of the evolution of an entire sequence via insertions, deletions and substitutions. This model could be regarded as an extension of the general SID model, in the sense that it allows explicit (or inherent) rate variation across sites, not only due to residue state contexts. Such rate variation could be regarded as the effect of, e.g., the epigenetic context and/or the context within the 3D structure of the protein product (e.g., [25]).^{1} To theoretically dissect the ab initio calculation of alignment probabilities under this model, we introduce some useful tools. Among the most important would be the operator representation of mutations, namely, insertions, deletions and substitutions. This enabled us to shift our focus from the trajectory of sequence states, which played a central role in [21], to the history of mutations (especially indels). Moreover, the operator representation enabled to algebraically define the equivalence relationships between two different series of indels. They, in cooperation with the focus shift, enabled to define the local history set (LHS) equivalence classes. These equivalence classes play an essential role when deriving the “sufficient and nearly necessary” set of conditions under which alignment probabilities are indeed factorable, thus providing a sort of generalized HMM. We also adapt techniques from the timedependent perturbation expansion in quantum mechanics [29, 30], expanding each alignment probability into a summation over contributing mutational histories with different numbers of indels. It should be noted, however, that we formally deal with all terms in the expansion.^{2} Thus, at least formally, the probabilities we deal with are exact solutions of the model’s defining equation. For clarity, we will focus on insertions/deletions in the bulk of the manuscript. However, we can also incorporate substitutions; see, e.g., [31] for more details.
Key concepts and results in this paper
Concept/result  Description  Main location 

Ancestry index  An ancestry index is assigned to each site. Sharing of an ancestry index among sites indicates the sites’ mutual homology. As a fringe benefit, the indices enable the mutation rates to vary across regions (or sites) beyond the mere dependence on the residue state of the sequence.  Section R2 (1st and 2nd paragraphs), Fig. 2 
Operator representation of mutations  This enables the intuitively clear and yet mathematically precise description of mutations, especially insertions/deletions, on sequence states. This is a core tool in our ab initio theoretical formulation of the genuine stochastic evolutionary model.  Section R2 (3rd paragraph), Fig. 3 
Rate operator  An operator version of the rate matrix, which specifies the rates of the instantaneous transitions between the states in our evolutionary model. In other words, the rate operator describes the instantaneous stochastic effects of single mutations on a given sequence state.  Section R3, 
Finitetime transition operator  An operator version of the finitetime transition matrix, each element of which gives the probability of transition from a state to another after a finite timelapse. This results from the cumulative effects of the rate operator during a finite timeinterval.  Section R3, 
Defining equations (differential)  1storder time differential equations (forward and backward) that define our indel evolutionary model. They are operator versions of the standard defining equations of a continuoustime Markov model.  Section R3, 
Defining equations (integral)  Two integral equations (forward and backward) that are equivalent to the aforementioned differential equations defining our indel evolutionary model. They play an essential role when deriving the perturbation expansion of the finitetime transition operator.  Section R4, Eq. ( R4.4 ) (forward), Eq. ( R4.5 ) (backward) 
Perturbation expansion (transition operator)  The perturbation expansion of the finitetime transition operator. It was derived in an intuitively clear yet mathematically precise manner, by using the aforementioned defining integral equations.  
Perturbation expansion (ab initio PWA probability)  The perturbation expansion of the ab initio probability of a given PWA, conditioned on the ancestral sequence state, under a given model setting.  
Binary equivalence relation  An equivalence relation between the products of two indel operators each. The relations play key roles when defining LHS equivalence classes.  
Localhistoryset (LHS) equivalence class  An equivalence class consisting of global indel histories that share all local history components. The classes play an essential role when proving the factorability of a given PWA probability.  Section R5, below Eq. ( R5.4 ), (e.g., Fig. 5) 
Factorability ( ab initio PWA probability)  We proved that, under conditions (i) and (ii) (below Eq. (R6.4)), the ab initio probability of a given PWA is factorable into the product of an overall factor and contributions from local PWAs.  
Perturbation expansion (ab initio MSA probability)  The “perturbation expansion” of the ab initio probability of a given MSA, under a given model setting including a given phylogenetic tree.  
Factorability ( ab initio MSA probability)  We proved that, under conditions (i), (ii) (below Eq. (R6.4) and (iii) (Eq. ( R7.8 )), the ab initio probability of a given MSA is factorable into the product of an overall factor and contributions from local MSAs.  Section R7, Eq. ( R7.9 ) 
Totally spacehomogeneous model  Such a model gives factorable PWA probabilities, because the exit rate is an affine function of the sequence length (regardless of whether indel rates are timedependent or not). The indel model of Dawg [26] and the “long indel” model [21] belong to this class.  Subsection R81, Eqs. (R81.1,R81.2), Eqs. ( R81.3 , R81.4 ) 
Equivalence (with caveat) of the “chopzone” method and our ab initio method  We showed that the “chopzone” method in [21], adapted to calculate the probability of a given LHS equivalence class, is equivalent to our ab initio method, at least if the indel model is spatiotemporally homogeneous.  Subsection R81, Supplementary appendix SA3 
Model with simple insertion rate variation  If the deletion rates are spacehomogeneous and the insertion rates depend only on the insertions’ flanking sites, the PWA probabilities are still factorable.  
Spacehomogenous model flanked by essential sites  This kind of model is a simplest example of the indel model whose ab initio PWA probabilities are nonfactorable.  Subsection R82, 
Degree of nonfactorability  The “difference of exitrate differences” (Eq. (R82.4)) could measure the “degree of nonfactorability.”  
Spaceheterogeneous model with factorable PWA probability  We found that a class of indel models with rateheterogeneity across regions (Eqs. (R83.1,R83.2)) have partially factorable PWA probabilities.  Subsection R83, Eqs. ( R83.1 , R83.2 ), Eqs. (R83.3,R83.4,R83.5), Figure S3 
We end this section with two notes. First, in this paper, the term “an evolutionary (or indel) process” means a series of successive mutation (or indel) events with both the order and the specific timing specified, and the term “an evolutionary (or indel) history” means a series of successive events with only the order specified. This usage should conform to the common practice in this field. Second, we will describe the results in the braket notation, similar to that in quantum mechanics [29, 33]. However, those who are unfamiliar with the notation need not worry about it. Our formulation via the braket notation can be proven to be equivalent to the standard formulation of the continuoustime Markov model via the vectormatrix notation. (We refer the interested readers to Supplementary appendix SA1 in Additional file 2.) Therefore, if desired, the symbols of a bra (〈x), a ket (y〉), and an operator (Ô) could be regarded simply as convenient reminders of a row vector, a column vector, and a matrix, respectively.^{4}
Results and discussion
The key concepts and results proposed/obtained in this paper are summarized in Table 1. Readers can use the table to quickly grasp an overview of this paper, as well as to easily locate what they look for. Also, most mathematical symbols are briefly explained in Table S1 in Additional file 1.
R1. Brief review of general SID model

Substitution: \( s={s}_L\omega {s}_R\overset{\rho_S\left({s}_L,\omega, {\omega}^{\prime },{s}_R\right)}{\to }{s}^{\prime }={s}_L{\omega}^{\prime }{s}_R; \) (R1.1)

Insertion: \( s={s}_L{s}_R\overset{\rho_I\left({s}_L,{s}_I,{s}_R\right)}{\to }{s}^{\prime }={s}_L{s}_I{s}_R; \) (R1.2)

Deletion: \( s={s}_L{s}_D{s}_R\overset{\rho_D\left({s}_L,{s}_D,{s}_R\right)}{\to }{s}^{\prime }={s}_L{s}_R. \) (R1.3)
Here, ρ _{ m }(…) with m = S, I and D denote the rates of the substitution, the insertion, and the deletion, respectively, possibly depending on the arguments in the parentheses. In each of the above rules, s and s′, respectively, denote the sequence states before and after the mutation. The symbols s _{ L } and s _{ R } denote the subsequences flanking the mutated portion from the left and from the right, respectively.^{5} These SID models equipped with this “rate grammar” are genuine stochastic evolutionary models, and thus do not usually impose unnatural restrictions on the mutations (except possibly through restrictions on mutation rates). And the most general SID model can accommodate quite general mutation rates, including indel length distributions, by allowing their dependence on the sequence states before and after the mutation.^{6} As far as we know, however, this most general SID model was not theoretically examined further (at least thus far), maybe because adequate mathematical or conceptual tools were not devised and because of some other reasons (mentioned below). In the following sections, we will provide such tools, which in turn will help theoretically dissect an extended version of the most general SID model.
R2. Ancestry indices and operator representation of mutations
First, we slightly extend the framework of the SID models, by assigning an ancestry to each site, which is a unit position in the sequence that accommodates a single residue. Hereafter, we will consider that it is the sites, instead of the residues, that are inserted/deleted. For example, the above example sequence state, s = [ω _{1}, ω _{2}, …, ω _{ L }](∈Ω^{ L }), can be extended as:\( \breve{s}=\left[\left({\upsilon}_1,{\omega}_1\right),\left({\upsilon}_2,{\omega}_2\right),\dots, \left({\upsilon}_L,{\omega}_L\right)\right]\left(\in {\left(\varUpsilon \times \Omega \right)}^L\right) \) (Fig. 2b). Here, υ _{ x }(∈ϒ) is the ancestry index assigned to the xth site of the sequence (with x = 1, 2, …, L). Alternatively, the extended sequence state could also be represented as: \( \breve{s}=\left(\overrightarrow{\upsilon},\overrightarrow{\omega}\right), \) ^{7} where \( \overrightarrow{\upsilon}=\left[{\upsilon}_1,{\upsilon}_2,\dots, {\upsilon}_L\right]\left(\in {\varUpsilon}^L\right) \) is an array of ancestry indices assigned to the sites, and \( \overrightarrow{\omega}=\left[{\omega}_1,{\omega}_2,\dots, {\omega}_L\right]\left(\in {\Omega}^L\right) \) is an array of residue states that fill in the sites. (Note that \( \overrightarrow{\omega} \) corresponds to the sequence state (s) in the SID models (in section R1)). The ancestry indices follow a number of rules: (i) different sites in the same sequence always have different ancestry indices; (ii) the ancestry index of a site remain unchanged as long as the site exists; and (iii) every time when an insertion takes place, new ancestry indices are assigned to the newly inserted sites. Other than these rules, the assignment of the indices is arbitrary. Especially, their values themselves are not so important. The most essential thing is whether two sites of different sequences share the same ancestry index or not; if so, the sites are mutually homologous (actually orthologous unless duplications are considered). Another important thing is the spatial relationship of the site having each ancestry with other sites, especially preserved ancestral sites (PASs, explained shortly). For the space of ancestry indices, ϒ, we will tentatively use the set of all positive integers (Ν _{1}≡{1, 2, 3, …}), although there should be a more appropriate mathematical entity.^{8} Because of the rules imposed above, the space of the extended sequence states (denoted as \( {\breve{s}}^{II} \) in [31]) is included in but never equal to {ϒ × Ω}^{*} = ∪ _{ L = 0} ^{∞} {ϒ × Ω}^{ L }.
This alignment tells that the sites with ancestries 2, 3 and 4 were deleted and that the sites with ancestries 8, 9 and A were inserted. We can also see that the sites with 1, 5, 6 and 7 were preserved during the evolution. We will henceforth refer to such sites as “preserved ancestral sites (PASs)”. The PASs indicate that no indels occurred at or through the sites during the evolution under consideration. Thus, they can be used to narrow down the possible indel histories that might have resulted in the pairwise alignment (PWA) (as argued, e.g., in [21]). A fringe benefit of the ancestry indices is that they enable the mutation rates to vary beyond the dependence on the residue states (section R3). Hereafter, we refer to an array of ancestry indices (like \( \overrightarrow{\upsilon} \)) as a “basic sequence state” (abbreviated as a “sequence state” or a “basic state”), which is a backbone to be fleshed out by the residue states (like \( \overrightarrow{\omega} \)) to give the extended sequence state (like \( \breve{s} \)). Hereafter, the basic sequence state will often be denoted, e.g., as s (Fig. 2c). (Henceforth, symbols like s will never denote a residue state (like \( \overrightarrow{\omega} \)), which was a sequence state in section R1). And S ^{ II }(⊂ϒ * = ∪ _{ L = 0} ^{∞} ϒ ^{ L }) denotes the space of the basic states.
These new tools, the ancestry indices and the operator representation of mutations, will play essential roles in our theoretical development described below. In the SID models [21], each evolutionary process was expressed as a (timerecorded) trajectory of sequence states, each of which was represented by an array of residues (without ancestry assignments). In consequence, an instantaneous transition from a state to the next state was often expressed as a summation of multiple possible mutations. (For example, the transition from \( \overrightarrow{\omega}=\left[A,A\right] \) to \( {\overrightarrow{\omega}}^{\prime }=\left[A\right] \) could result from either \( {\widehat{M}}_D\left(1,1\right) \) or \( {\widehat{M}}_D\left(2,2\right) \)). Ancestry indices help avoid such ambiguous channels by uniquely defining each instantaneous statetostate transition as an action of a single mutation. (In the above example, the former causes the transition from \( \overrightarrow{\upsilon}=\left[1,2\right] \) to \( {\overrightarrow{\upsilon}}_{(1)}^{\prime }=\left[2\right] \), and the latter yields the transition from \( \overrightarrow{\upsilon} \) to \( {\overrightarrow{\upsilon}}_{(2)}^{\prime }=\left[1\right] \)). This, in conjunction with the operator representation of mutations, enables us to shift the focus from the trajectory of sequence states to the history of mutations, especially indels. This shift of focus, as well as the “equivalence relations” between the products of operators (section R5), facilitates the examination of the factorability of alignment probabilities, as we will see in sections R4R6.
R3. Instantaneous transition (or rate) operator and finitetime transition operator
Here \( {\widehat{Q}}_M^{SID}(t)\equiv {\widehat{Q}}_M^S(t)+{\widehat{Q}}_M^I(t)+{\widehat{Q}}_M^D(t) \) is the collection of all mutational transition terms, and \( {\widehat{Q}}_X^{SID}(t)\equiv {\widehat{Q}}_X^S(t)+{\widehat{Q}}_X^I(t)+{\widehat{Q}}_X^D(t) \) is the entire exit rate part, which attenuates the state retention probability at the total exit rate, \( {R}_X^{SID}\left(\breve{s},t\right)\equiv {R}_X^S\left(\breve{s},t\right)+{R}_X^I\left(\breve{s},t\right)+{R}_X^D\left(\breve{s},t\right). \) Actually, the total mutational part, \( {\widehat{Q}}_M^{SID}(t) \), is equivalent to (the extended version of) the “instantaneous rate matrix” of the general SID model (defined by Eq. (1) in [21]), although the two expressions appear quite different from each other. The difference mainly stems from the parametrization and the state representation. We use our parametrization because we believe it to clarify the meaning of each term in the rate operator. And, in this section, the sequence state after the mutation (say, \( \left\langle {\breve{s}}^{\prime}\right \)) was represented as the result of the mutation operator (say, \( {\widehat{M}}_m\left(\dots \right) \)) acting on the state before the mutation (say, \( \left\langle \breve{s}\right \)), like \( \left\langle {\breve{s}}^{\prime}\right=\left\langle \breve{s}\right{\widehat{M}}_m\left(\dots \right) \). This is legitimate because a mutation transfers a subject state uniquely to another state. This state representation will facilitate the unfolding of our theory below.
R4. Perturbation expansion of finitetime transition operator and pairwise alignment probability: brief description
Equation (R4.6) states that the finitetime transition operator (acting on 〈s _{0}) is the collection of the effects of all possible indel histories starting with s _{0}, each weighted by its probability (Eq. (R4.7)). Thus, it mathematically underpins Gillespie’s [34] famous stochastic simulation algorithm, which provides the basis of genuine molecular evolution simulators (e.g., [26–28]). Our derivation of Eq. (R4.6) and Eq. (R4.7) through the integral equation (Eq. (R4.4) or Eq. (R4.5)) bridges Gillespie’s own intuitive reasoning and Feller’s [35] mathematically rigorous proof of the solution.
Here, Η ^{ ID }[N; α(s ^{ A }, s ^{ D })] denotes the set of all indel histories with N indels each that can result in α(s ^{ A }, s ^{ D }), and N _{ min }[α(s ^{ A }, s ^{ D })] is the minimum number of indels required for creating the PWA.
Equation (R4.8) and Eq. (R4.9) are the formal expressions of the occurrence probability of α(s ^{ A }, s ^{ D }) derived in effect from the defining equations, Eqs. (R3.19, R3.20, R3.21), of our evolutionary model. Thus, they are the “ab initio probability” of the PWA. In the following, we will examine its factorability.
R5. Local historyset equivalence class of indel histories
Here, l ^{′}≡min{x _{ E } − x _{ B } + 1, x _{ E }} in Eq. (R5.2c), and l _{2} ^{′} ≡min{x _{ E2} − x _{ B2} + 1, x _{ E2}} in Eq. (R5.2d). If you will, these equivalence relations could be phrased as follows. “The operator representing the event on the left along the sequence will not change whether it comes first or second. The operator representing the event on the right will shift its operational position to the left/right by the number of sites deleted/inserted before its operation, when it comes second”.^{11}
Here \( \widehat{M}\left[k,{i}_k\right] \) is an operator that was obtained from \( {\widehat{M}}^{\prime}\left[k,{i}_k\right] \) through the series of equivalence relations Eqs. (R5.2aR5.2d) that brought Eq. (R5.3) into Eq. (R5.4). As in Eq. (R5.3), the operators in each pair of large square parentheses in Eq. (R5.4) are arranged in temporal order, so that the earliest event in each local subset will come leftmost. But it should be noted that the order among the pairs of large square parentheses is the opposite of the actual spatial order among the local subsets, so that the rightmost one along the sequence (the K th one here) will come leftmost. In this way, the operators in each local subset, e.g., \( \left\{\widehat{M}\left[k,1\right],\dots, \widehat{M}\left[k,{N}_k\right]\right\} \), are exactly the same as those when the events in the subset alone struck 〈s ^{ A }. Thus the series of operators, \( \left[\widehat{M}\left[k,1\right],\dots, \widehat{M}\left[k,{N}_k\right]\right] \), for the k th local subset defines the k th local indel history that was isolated from the global indel history, \( \left[{\widehat{M}}_1,{\widehat{M}}_2,\dots, {\widehat{M}}_N\right] \), on s _{ I } ∈ S.
Now, let us consider a history of N indel events other than \( \left[{\widehat{M}}_1,{\widehat{M}}_2,\dots, {\widehat{M}}_N\right] \). If the temporal operator product of the history is shown to be equivalent to Eq. (R5.4) through a series of Eqs. (R5.2aR5.2d), then it should also be connected to Eq. (R5.3) though another series of Eqs. (R5.2aR5.2d). Therefore, it should be equivalent to \( \left[{\widehat{M}}_1,{\widehat{M}}_2,\dots, {\widehat{M}}_N\right] \) in this sense. Hence, we can define a particular equivalence class, which is the set of all global indel histories that can be “decomposed” into the identical set of local indel histories, such as Eq. (R5.4), only through a series of Eqs. (R5.2aR5.2d), between indel operators separated by at least a PAS. We will call it the “localhistoryset (LHS) equivalence class”. In the equivalence class defined by a local history set (LHS), \( {\left\{\left[\widehat{M}\left[k,1\right],\dots, \widehat{M}\left[k,{N}_k\right]\right]\right\}}_{k=1,\dots, K} \) (with ∑ _{ k = 1} ^{ K } N _{ k } = N), on an initial sequence state s ^{ A } ∈ S ^{ II }, there are \( \frac{N!}{{\displaystyle {\prod}_{k=1}^K{N}_k}} \) LHSequivalent global indel histories beginning with s ^{ A }. Each of the global histories corresponds to a way of reordering N indel events while retaining the relative temporal order among N _{ k } events within the k th local indel history (for every k = 1, …, K).
In the simplest example at the beginning of this section (Eq. (R5.1) and above), the corresponding LHS is: \( \left\{\left[{\widehat{M}}_D\left(2,4\right)\right],\left[{\widehat{M}}_I\left(6,3\right)\right]\right\} \). The LHS consists of two local histories, each of which is a singleindel history (Fig. 5c). As a slightly more complex example, consider the history, \( \left[{\widehat{M}}_D\left(3,3\right),{\widehat{M}}_I\left(5,2\right),{\widehat{M}}_D\left(2,3\right),{\widehat{M}}_I\left(5,1\right)\right] \), illustrated in Fig. 4. This history belongs to the LHS equivalence class represented by the LHS: \( \left\{\left[{\widehat{M}}_D\left(3,3\right),{\widehat{M}}_D\left(2,3\right)\right],\left[{\widehat{M}}_I\left(6,2\right),{\widehat{M}}_I\left(8,1\right)\right]\right\} \), which consists of two 2indel local histories. If this LHS is recast into the form in Eq. (R5.4), we have: \( \widehat{M}\left[1,1\right]={\widehat{M}}_D\left(3,3\right) \), \( \widehat{M}\left[1,2\right]={\widehat{M}}_D\left(2,3\right) \), \( \widehat{M}\left[2,1\right]={\widehat{M}}_I\left(6,2\right) \), and \( \widehat{M}\left[2,2\right]={\widehat{M}}_I\left(8,1\right) \).
R6. Factorability of pairwise alignment probability: brief description
Now we are ready to examine the factorability of the ab initio probability of PWA α(s ^{ A }, s ^{ D }), P[(α(s ^{ A }, s ^{ D }), [t _{ I }, t _{ F }])(s ^{ A }, t _{ I })] in Eq. (R4.9), given the ancestral state (s ^{ A }) at the initial time (t _{ I }). Here the “factorability” means that the PWA probability can be reexpressed as the product of an overall factor and contributions from local regions. Natural candidates for the local regions would be those in between the PASs, because we know that indels never hit or pierced PASs (if the alignment is correct). We are not interested in trivial factorability. Thus, we only consider PWAs (or global histories) each of which requires at least two local indel histories. In the following, we will only briefly describe our demonstration of the PWA probability factorization. Its more detailed yet rather intuitive description is given in Supplementary methods SM2 in Additional file 1. (It is complemented by mathematically rigorous arguments in Supplementary appendix SA2 in Additional file 2).
Condition (i): The rate of an indel event (\( r\left({\widehat{M}}_{\nu };{s}_{\nu 1},{\tau}_{\nu}\right) \)) is independent of the portion of the sequence state (s _{ ν − 1}) outside of the region of the local history the event (\( {\widehat{M}}_{\nu } \)) belongs to.
Condition (ii): The increment of the exit rate due to an indel event (δR _{ X } ^{ ID } (s _{ ν }, s _{ ν − 1}, τ)≡R _{ X } ^{ ID } (s _{ ν }, τ) − R _{ X } ^{ ID } (s _{ ν − 1}, τ), with \( \left\langle {s}_{\nu}\right=\left\langle {s}_{\nu 1}\right{\widehat{M}}_{\nu } \)) is independent of the portion of the sequence state (s _{ ν − 1}) outside of the region of the local history the event (\( {\widehat{M}}_{\nu } \)) belongs to.
See Supplementary appendix SA2 in Additional file 2 for the derivation of the mathematically rigorous version of this set of conditions. (For illustration, in Supplementary methods SM3 in Additional file 1, the factorability of the probability was examined for the simplest concrete LHS equivalence class (in Fig. 5)). Condition (i) is somewhat similar to the “contextindependence” condition imposed on the “long indel” model [21], though our condition is slightly less restrictive. Condition (ii) has never been found or even discussed thus far. In fact, the “long indel” model trivially satisfies this condition (see subsection R81), thus [21] did not need to pay attention to it. However, this condition is not always satisfied. Indeed, as exemplified in subsection R82, some models have nonfactorable alignment probabilities due to the violation of this condition even though condition (i) is satisfied.
Equation (R6.7) states that the PWA probability is factorized into the product of an overall factor (P[([], [t _{ I }, t _{ F }])(s ^{ A }, t _{ I })]) and contributions from regions accommodating local indel histories (\( {\tilde{\mu}}_P\left[\left({\tilde{\Lambda}}^{ID}\left[{\gamma}_{\kappa };\alpha \left({s}^A,{s}^D\right)\right],\left[{t}_I,{t}_F\right]\right)\left\left({s}^A,{t}_I\right)\right.\right] \)’s). Therefore, the set of conditions, (i) and (ii), is sufficient for the factorability of the PWA probability. At present, we are not sure whether the set of conditions is also necessary or not. This may not be the case in the rigorous sense, and there may be some instances with factorable PWA probabilities despite the violation of condition (i) or (ii). Nevertheless, even if there are, we suspect that such cases should be isolated, requiring intricate cancellations of the terms. Thus, we will refer to the conditions (i) and (ii) as the “sufficient and nearly necessary set of conditions” for factorable PWA probabilities.
R7. Factorability of multiple sequence alignment probability: brief description
Thus far, we only examined the probability of a given PWA, conditioned on an ancestral state at initial time. Actually, once we know how to calculate such conditional PWA probabilities, we can build them up along the phylogenetic tree to calculate the probability of a given MSA, as described in the introductions of [13] and [14]. (See also [36] for an essentially equivalent method that appears different.) Here, we basically follow their procedures. However, it should be stressed that the MSA probability here will be calculated ab initio under a genuine evolutionary stochastic model and not under a HMM or a transducer, which is not necessarily evolutionarily consistent. This section briefly explains the derivation of the factorization of an ab initio MSA probability. For details on the derivation, see Supplementary methods SM4 in Additional file 1.
In this section, we formally calculate the ab initio probability of a MSA given a rooted phylogenetic tree, T = ({n}_{ T }, {b}_{ T }), where {n}_{ T } is the set of all nodes of the tree, and {b}_{ T } is the set of all branches of the tree. We decompose the set of all nodes as: {n}_{ T } = Ν ^{ IN }(T) + Ν ^{ X }(T), where Ν ^{ IN }(T) is the set of all internal nodes and \( {N}^X(T)=\left\{{n}_1,\dots, {n}_{N^X}\right\} \) is the set of all external nodes. (The N ^{ X }≡Ν ^{ X }(T) is the number of external nodes.) The root node plays an important role and will be denoted as n ^{ Root }. Because the tree is rooted, each branch b is directed. Thus, let n ^{ A }(b) denote the “ancestral node” on the upstream end of b, and let n ^{ D }(b) denote the “descendant node” on the downstream end of b. Let s(n) ∈ S ^{ II } be a sequence state at the node n ∈ {n}_{ T }. Especially, we use abbreviations: s ^{ A }(b)≡s(n ^{ A }(b)) ∈ S ^{ II } and s ^{ D }(b)≡s(n ^{ D }(b)) ∈ S ^{ II }. Finally, as mentioned in Background, we suppose that the branch lengths, {bb ∈ {b}_{ T }}, and the indel model parameters, {Θ _{ ID }(b)}_{ T }≡{Θ _{ ID }(b)b ∈ {b}_{ T }}, are all given. Note that the model parameters Θ _{ ID }(b) could depend on the branch, at least theoretically.
Here we explicitly showed the branchdependence of the model parameters.
Now, we can show that, if the “condition (iii)” given below in addition to conditions (i) and (ii) is satisfied, we can factorize the MSA probability into a form somewhat similar to Eq. (R6.7) for the PWA probability. In subsection 4.2 of [32], we demonstrated it using the historybased expansion of the MSA probability (i.e., Eq. (R7.4) supplemented with Eq. (R7.2)). In Supplementary methods SM4, we will use the ancestralstatebased expansion (i.e., Eq. (R7.5) supplemented with Eq. (R7.6)), as was only briefly sketched at the bottom of subsection 4.2 of (ibid.). In a MSA, gapless columns play almost the same role as PASs in a PWA. Because of the aforementioned “phylogenetic correctness” condition, a gapless column indicates that no indel event hit or pierced the site. Therefore, gapless columns will partition a MSA into regions each of which accommodates a local subset of every global history. Analogously to the argument above Eq. (R6.7), let \( {C}_1,{C}_2,\dots, {C}_{K_{max}} \) be the maximum possible set of such regions determined by a given MSA and a model setting (including the tree) (Figure S2 in Additional file 1). (As argued in subsection R83, all gapless columns are not necessarily needed to delimit the regions.) Because the summation in Eq. (R7.5) involves the summation over all MSAconsistent root states, it would be convenient to specify a “reference” root state, s _{0} ^{ Root } . It can be anything, as long as it is the state at the root consistent with \( \alpha \left[{s}_1,{s}_2,\dots, {s}_{N^X}\right] \). Technically, one good candidate for s _{0} ^{ Root } would be a root state obtained by applying the Dollo parsimony principle [39] to each column of the MSA, because it is arguably the most readily available state that satisfies the phylogenetic correctness condition along the entire MSA. Then, we will impose the following condition.
Here the multiplication factor, μ _{ P }[s ^{ Root }, s _{0} ^{ Root } , n ^{ Root }; C _{ Κ }], represents the change in the state probability at the root due to the difference between s ^{ Root } and s _{0} ^{ Root } within C _{ Κ }. This equation holds, e.g., when P[(s ^{ Root }, n ^{ Root })] is a geometric distribution or a uniform distribution of the root sequence length.^{14}
R8. Examples: Models with factorable/nonfactorable alignment probabilities
A merit of conditions (i) and (ii) given in section R6 is that they can draw the line between evolutionary models with factorable PWA probabilities and those with nonfactorable ones. To illustrate the use of these conditions, we here give three examples: (1) a simple model with factorable probabilities, (2) a simple model with nonfactorable probabilities, and (3) a nontrivial model with factorable probabilities. (For more examples, see section 5 of [32]).
R81. Totally spacehomogeneous model
Here \( \delta L\left({\widehat{M}}_{\nu}\right) \) is the length change caused by the event, \( {\widehat{M}}_{\nu } \). The rightmost hand side of this equation depends only on \( {\widehat{M}}_{\nu } \) and the time it occurred, but not on the sequence state (s(ν − 1)). Thus, condition (ii) is always satisfied under fully spacehomogenous models, which means that alignment probabilities calculated ab initio (as in section R4) under such models are factorable, as shown in section R6.
An important special case of the spacehomogeneous model is the model used by Dawg [26], whose indel rate parameters are given as: g _{ I }(l, t) = g _{ I;L }(l, t) = g _{ I;R }(l, t) = λ _{ I } f _{ I }(l) and g _{ D }(l, t) = λ _{ D } f _{ D }(l). Because this is a special case of Eqs. (R81.1,R81.2), it naturally provides factorable alignment probabilities. This model is probably among the most flexible indel evolutionary models used thus far. The model accommodates any distributions of indel lengths (f _{ I }(l) and f _{ D }(l)) that are independent of each other, and independent total rates for insertions and deletions (λ _{ I } and λ _{ D }). Some of our studies [40, 41] are mostly based on this model.
Another important special case is the “long indel” model [21], whose (timeindependent) rate parameters are given by: g _{ I }(l, t) = λ _{ l }, \( {g}_{I;L}\left(l,t\right)={g}_{I;R}\left(l,t\right)={\tilde{\lambda}}_l^{(end)} \) (if L(s) > 0), \( {g}_{I;L}\left(l,t\right)={\tilde{\tilde{\lambda}}}_l^{(whole)} \) (if L(s) = 0), and g _{ D }(l, t) = μ _{ l }. This model is less flexible than Dawg’s model, because its indel rates are subject to the detailedbalance conditions: λ _{ l } = (λ _{1}/μ _{1})^{ l } μ _{ l }, \( {\tilde{\lambda}}_l^{(end)}={\left({\lambda}_1/{\mu}_1\right)}^l{\displaystyle {\sum}_{l^{\prime }=l}^{L_D^{CO}}{\mu}_{l^{\prime }}} \), and \( {\tilde{\tilde{\lambda}}}_l^{(whole)}={\left({\lambda}_1/{\mu}_1\right)}^l{\displaystyle {\sum}_{l^{\prime }=l}^{L_D^{CO}}\left({l}^{\prime }1+1\right){\mu}_{l^{\prime }}} \). Like Dawg’s model, this model is a special case of the model defined by Eqs. (R81.1,R81.2). Thus, the alignment probabilities calculated under it are indeed factorable, as verbally justified in [21]. Indeed, we can explicitly show that, as far as each LHS equivalence class is concerned, the indel component of its probability calculated according to the recipe of [21] equals the product of P[([], [t _{ I }, t _{ F }])(s ^{ A }, t _{ I })] and Eq. (R6.2), i.e., the “total probability” of the LHS equivalence class via our ab initio formulation, calculated with the aforementioned indel rate parameters. The proof is given in Supplementary appendix SA3 in Additional file 2. It should be stressed that, although [21] ignored condition (ii), this caused no problem thanks to Eq. (R81.4) satisfied by any fully spacehomogeneous models. Actually, it is this condition (ii) that guarantees the equivalence of the probabilities calculated via the two methods, because it equates each increment of the exit rate of a chopzone with that of an entire sequence. The equivalence can be extended to between PWA probabilities, provided that the contributing local indel histories are correctly enumerated. (We are uncertain about whether this extended equivalence indeed holds, because [21] did not explicitly describe how the local indel histories were enumerated).
These rates satisfy condition (i). Eq. (R81.5) combined with the spacehomogeneous deletion rates, Eq. (R81.2), still gives an exit rate whose increment due to an indel event depends only on the inserted/deleted subsequence (and flanking sites) but not on the regions separated from it by at least a PAS. Hence the model also satisfies condition (ii), thus providing factorable alignment probabilities. Relaxing the spacehomogeneity of deletion rates, however, is somewhat difficult, particularly because of condition (ii). In subsection R83, we will attempt it.
R82. Spacehomogeneous model flanked by biologically essential sites/regions
R83. Model with rateheterogeneity across regions
In Supplementary appendix SA3 in Additional file 2, we explicitly showed that the probability of a LHS equivalence class via the recipe of [21] is identical to that calculated via our ab initio formulation. Although we assumed the spacehomogeneity there, the proof can probably be extended to the model in this subsection as well, by slightly modifying the definition of the “chopzone”.
R9. Merits, possible extensions & applications, and outstanding issues
In this paper, we presented a theoretical formulation built up by tools that help mathematically precise dissection of the ab initio calculation of alignment probabilities under genuine stochastic evolutionary models. Another merit of this formulation is that it gives intuitively clear pictures. For example, the insertion and deletion operators simply mathematically represent the intuitive pictures of the indels naturally acting on sequences (Fig. 3). Thus, the action of the rate operator, given by Eqs. (R3.6R3.10) (or Eqs. (R3.11R3.15)), is intuitively understandable as merely the collection of all possible singlemutational channels from a given sequence state (and some compensating terms). Then, the expansion formula for the action of the finitetime transition operator, Eqs. (R4.6,R4.7), can also be intuitively interpreted as the collection of contributions from all possible mutational processes starting with an initial sequence state. Importantly, this expansion was not posed via a handwaving argument but rigorously derived as the solution of the defining equations of the model (Eqs. (R3.19R3.21)), which justifies its ab initio status. And the integral equations, Eqs. (R4.4,R4.5), bridged the expansion’s mathematically rigorous and intuitive aspects. Finally, the binary equivalence relations, Eqs. (R5.2aR5.2d) (e.g., Fig. 5), and their resulting LHS equivalence classes also allow intuitive interpretations as the invariance of the local effects of indels under their relative orders with spatially separate events. Therefore, the conditions for the factorability of PWA probabilities are also intuitively understandable. Although their mathematically rigorous derivations (in Supplementary appendix SA2 in Additional file 2) might appear somewhat formidable, they are actually not so difficult once the aforementioned intuitive pictures are understood. Hence, by coupling the mathematical preciseness with the intuitive clarity, our theoretical formulation is expected to facilitate further advances of the study of ab initio alignment probabilities under genuine stochastic evolutionary models with some biological realism.
For clarity, this study focused only on indels among various mutational types, because indels are essential for creating a sequence alignment. If desired, however, our theoretical formulation could also incorporate substitutions ([31]; see also [42]). Moreover, the formulation could also be extended to incorporate other genome rearrangements, such as duplications and inversions. (See [31] for an initial, rudimentary attempt.) Such an extended formulation will provide tools to enable concrete analyses of “rate grammars” extended to incorporate genome rearrangements (briefly mentioned in [21]).
The practical use of our formulation depends on how efficiently it can calculate quite accurate alignment probabilities. Although the factorability of alignment probabilities will help greatly speed up the computation, the contribution from each local region (e.g., Eq. (R6.8) or Eq. (SM4.22) in Additional file 1) is still composed of infinitely many terms. Good news is that the first approximation of each local contribution, which is the summation of the terms from parsimonious local indel histories alone, is quite accurate, as long as the gap lengths and the branch lengths are at most moderate (Ezawa, unpublished; draft manuscripts: [40, 41]). Thus, considerably efficient computation of considerably accurate alignment probabilities may be possible based on our formulation. Especially, at least when the model is spatiotemporally homogeneous, our ab initio calculation was shown to be equivalent to the calculation of [21], with a caveat (see subsection R81). Thus, their dynamic programming (DP) may be applicable, possibly with some modifications, to a wider class of models with factorable probabilities.
Despite these favorable aspects, our theoretical formulation still has some limitations and outstanding issues. First, we did not examine whether our “sufficient and nearly necessary” set of conditions for the alignment probability factorization is exactly necessary and sufficient or not. Nor did we provide any counterexamples that violate our set of conditions and still have factorable alignment probabilities. Solving these problems may be interesting at least mathematically.
Second, although in this study we tentatively used the set of positive integers to represent the space of ancestry indices (ϒ), it is obviously not the best choice. Finding, or establishing, a mathematical entity (either a set or a space) that is more suitable for representing ϒ should be another mathematically interesting issue.
Third, in section R8 (and in section 5 of [32]), we only considered simple boundary conditions. Each sequence end was either freely mutable or flanked by a biologically essential region that allows no indels. These boundary conditions may remain good approximations if the subject sequences were extracted from wellcharacterized genomic regions. In real sequence analyses, however, the ends of the aligned sequences are often determined by artificial factors, such as the methods to sequence the genome, detect sequence homology, and annotate the sequences. Moreover, the constant cutoff lengths (L _{ I } ^{ CO } and L _{ D } ^{ CO } ) were introduced just for the sake of simplicity, to broadly take account of the effects of various factors that suppress very long indels (such as selection, chromosome size, genome stability, etc.). In reality, it is much more likely that the cutoff lengths will vary across regions. Then, the alignment probabilities would be only approximately factorable, as in subsection R82. In order to pursue further biological realism and to enable more accurate sequence analyses, it should be inevitable to address these issues seriously. Eq. (R82.4) may be useful for such studies.
Fourth, we strongly caution the readers that, at this point, a naïve application of our formulation or its algorithmic implementation [41] to a reconstructed MSA is fraught with high risks of incorrect predictions of indel histories, etc. This is because reconstructed MSAs, even if they were reconstructed via stateoftheart aligners (reviewed, e.g., in [43]), are known to be considerably erroneous (e.g., [42, 44, 45]). Thus, it should be preferable to first develop a method or a program that accurately assesses and rectifies alignment errors, preferably by estimating the distribution of fairly likely alternative MSAs (as, e.g., in [16, 46, 47]), before using our formulation to make some evolutionary or biological predictions.
Fifth, in this study, the phylogenetic tree was regarded as given. In many cases, however, the phylogenetic trees must also be inferred from the input sequence data. A theoretically ideal way would be to infer the joint distribution of MSAs and phylogenetic trees, as it is expected to minimize possible prediction biases (e.g., [13, 48–50]). A major problem is that such an analysis would be tremendously timeconsuming in general. At present, it is a totally open question whether our formulation can be adapted to infer a quite accurate joint distribution efficiently enough.
Conclusions
To the best of our knowledge, this is the first study to theoretically dissect the ab initio calculation of alignment probabilities under a genuine stochastic evolutionary model, which describes the evolution of an entire sequence via insertions and deletions (indels) along the time axis. The model handled here extends the previously most general evolutionary model, i.e., the general form of the “substitution/inserton/deletion models” [21]. It should be noted that we did not impose any unnatural restrictions such as the prohibition of overlapping indels. Nor did we make the preproof assumption that the probability is factorable into the product of columntocolumn transition probabilities or blockwise contributions. The essential tool introduced in this study was the operator representation of indels. This enabled us to shift the focus from the trajectory of sequence states (as in [21]) to the series of indel events, and to define localhistoryset (LHS) equivalence classes of indel histories. Moreover, the operator representation also facilitated the adaptation of the timedependent perturbation expansion (e.g., [29, 30, 33]), which enabled us to express the probability of an alignment as a summation of probabilities over all alignmentconsistent indel histories. Then, under a most general set of indel rate parameters, we exploited the LHS equivalence classes and found a “sufficient and nearly necessary” set of conditions on the indel rate parameters and exit rates under which the ab initio alignment probabilities can be factorized to provide a sort of generalized HMMs. We also showed that quite a wide variety of indel models could satisfy this set of conditions. Such models include not only the “long indel” model [21] and the indel model of a genuine molecular evolution simulator, Dawg [26], but also some sorts of models with rate variation across regions. Moreover, we explicitly showed (in Supplementary Appendix SA3 in Additional file 2) that, as far as each LHS equivalence class is concerned, the probability calculated via the method of [21] is equivalent to that calculated via our ab initio formulation, at least under their spatiotemporally homogeneous indel model.
To summarize, by depending purely on the first principle and by providing intuitively clear pictures, this study established firm theoretical grounds that will help further advance the ab initio calculation of alignment probabilities under genuine stochastic evolutionary models with some biological realism. And our theoretical formulation will also provide other indel probabilistic models with a sound reference point, provided that there exist approximate methods that can quite accurately estimate the ab initio alignment probabilities fairly efficiently. Such approximate methods will be the subject of a related study (Ezawa, unpublished; draft manuscripts available [40, 41]).
Methods
Abbreviations
HMM, hidden Markov model; indel(s), insertion(s)/deletion(s); LHS, local history set; MSA, multiple sequence alignment; PAS, preserved ancestral site; PWA, pairwise (sequence) alignment; SID model, substitution/insertion/deletion model
In a sense, the models implemented in the genuine sequence evolution simulators (e.g., [26–28]) could also be considered as special cases of this evolutionary model.
This poses no problem here because the summation is always bounded from above (by a number less than or equal to unity) and all terms in the summation are nonnegative, which guarantees the convergence of the expansion.
However, the two manuscripts differ in some aspects, e.g., their starting points. In [32], we described our model from scratch, as a continuoustime Markov model, and only briefly discussed its relationship with the general SID model [21]. In contrast, this manuscript explicitly presents our model as an extension of the general SID model, because that would put our model into the historical context of studies on indel evolutionary models.
It may be worth mentioning, though, that the operator notation allows flexible representations of general state changes (by mutations). In the vectormatrix notation, this is possible only via abstract mutation matrices (and state vectors); concrete matrices (and vectors) can at most describe some fixed specific indel histories.
In Eq.(R1.1), ω(∈Ω) and ω ^{′}(∈Ω), respectively, are the residues before and after the substitution. s _{ I } in Eq.(R1.2) denotes the inserted subsequence, and s _{ D } in Eq.(R1.3) denotes the deleted subsequence. And the equation, s = s _{ L } ωs _{ R }, for example, states that the sequence s is formed by concatenating the three factors, s _{ L }, ω and s _{ R }.
Besides, as claimed in [21], the rate grammar could be extended to formally describe a wide variety of evolutionary processes, including duplication, inversion and translocation.
This should be understood as a brief representation. The precise representation is: \( \breve{s}={\left({\overrightarrow{\upsilon}}^t,{\overrightarrow{\omega}}^t\right)}^t \), where A ^{ t } denotes the transposition (i.e., swapping the roles of rows and columns) of matrix A. (See Figure 2b.)
Inspired by profile HMMs (e.g., [51]) and the idea of positionspecific evolutionary rates [25], we originally devised ancestry indices in order to distinguish columns of a given alignment from one another. Their values themselves are not considered so important. Thus, it would be convenient to reassign the ancestry indices as follows, after an indel process (or history) created an alignment, α, whether it is among only extant sequences or among sequence states at all nodes of the phylogenetic tree. (1) Reassign 1, 2, …, α (= the number of columns in α) to the sites corresponding to the columns of α, from left to right. (2) Reassign α + 1, α + 2, …. to the “evanescent” sites with no corresponding alignment columns, again from left to right. This reassignment can be considered as replacing a set of aligned sequence states created by an indel process (or history) with a “representative” set “equivalent” to the former set in the sense that both give the same homology structure [48]. The reassignment may facilitate the understanding of the arguments in sections R5 through R8. (The figures in this paper do not undergo this reassignment, though.)
As far as the sequence (with state \( \breve{s} \)) is concerned, \( {\widehat{M}}_D\left({x}_B,{x}_E\right) \) with x _{ B } < 1, is indistinguishable from \( {\widehat{M}}_D\left(1,{x}_E\right) \), and \( {\widehat{M}}_D\left({x}_B,{x}_E\right) \) with \( {x}_E>L\left(\breve{s}\right) \) is indistinguishable from \( {\widehat{M}}_D\left({x}_B,L\left(\breve{s}\right)\right) \).
In this paper, α(…) (or α[…]) is intended to denote the homology structure [48] of an alignment, and thus the symbol doesn’t care about details on the ancestry indices or residue states filling in the alignment other than the distinction of different columns in the alignment. And, hereafter, the terms “alignment”, “PWA”, and “MSA” mean their homology structures.
Actually, we could also define the equivalence relationships between products of nonseparated indel operators (nonexhaustively listed in Appendix A1 of [32]), and they may assist further theoretical developments. However, discussing these extra equivalences is beyond the scope of this manuscript.
Such a maximum set does not necessarily consist of all PASs in the PWA. An example is given in subsection R83.
The “phylogenetic correctness” condition guarantees that the sites aligned in a MSA column should share an ancestry. The condition could be rephrased as: “if a site corresponding to the column is present at two points in the phylogenetic tree, the site must also be present all along the shortest path connecting the two points”.
HMMs commonly use geometric distributions of sequence lengths. The uniform distribution may be a good approximation if we can assume that the ancestral sequence was sampled randomly from a chromosome of length L _{ C }. In this case, the distribution of the sequence length, L(s)(< < L _{ C }), would be proportional to (1 − (L(s) − 1)/L _{ C }) ≈ 1.
\( {\tilde{\overset{\smile }{\mathrm{M}}}}_P\left[\alpha \left[{s}_1,{s}_2,\dots, {s}_{N^X}\right];{s}_0^{Root};{C}_K\leftT\right.\right] \) should be equivalent to \( {\tilde{\overset{\smile }{\mathrm{M}}}}_P\left[{\tilde{\Lambda}}_{\Psi}^{ID}\left[{C}_{\mathrm{K}};\alpha \left[{s}_1,{s}_2,\dots, {s}_{N^X}\right];T\right]\leftT\right.\right] \) given in Eq.(4.2.9c) of [32], although the two expressions may appear quite different at first glance.
Notes
Declarations
Acknowledgements
This study is dedicated to the late Dr. Keiji Kikkawa, a key pioneer of string theory of elementary particle physics and the author’s best mentor. The author would like to greatly thank Dr. Dan Graur at the University of Houston and Dr. Giddy Landan at HeinrichHeine University Düsseldorf for their indispensable assistance with this study. The author is grateful to Dr. Reed A. Cartwright at Arizona State University for his useful information and discussions that inspired this study. The author appreciates the logistic support and the feedback of Dr. Tetsushi Yada at Kyushu Institute of Technology, as well as the encouragement and the feedback of Dr. Toshiaki Takitani, a professor emeritus at Nagoya University. The author would also like to thank Dr. Ian Holmes at the University of California, Berkeley, for his critical comments on the previous version of the manuscript and his information on some relevant previous studies, both of which helped substantially improve the manuscript. The author is also grateful to all of the reviewers and the handling editors of this manuscript and its predecessors, as their feedback definitely helped improve the manuscript.
Funding
This work was a part of the project, “Error Correction in Multiple Sequence Alignments”, which was funded by US National Library of Medicine (grant number: LM01000901 to Dan Graur and Giddy Landan, then at the University of Houston). The later stage of this work was supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan (grant numbers: MEXT KAKENHI Grant numbers 221S0002, 15H01358, both to Tetsushi Yada).
Availability of data and material
The data sets supporting the conclusions of this article are included within the article and its additional files.
Author’s contributions
Not applicable because the paper was written by a single author (KE).
Author’s information
The author (KE) used to be a mathematical physicist, who studied theoretical elementary particle physics and quantum gravitational theories from 1991 till 1999. Then, since 2002, after having studied theoretical biophysics from 1999 till 2002, he has studied molecular evolution (including population genetics), mainly focusing on homologybased computational analyses of DNA and protein sequences. For more detailed information, including the list of his publications, refer to his ORCID record [52].
Competing interests
The author declares that he has no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Graur D, Li WH. Fundamentals of Molecular Evolution. 2nd ed. Sunderland: Sinauer Associates; 2000.Google Scholar
 Gascuel O, editor. Mathematics of Evolution and Phylogeny. New York: Oxford University Press; 2005.Google Scholar
 Lynch M. The Origins of Genome Architecture. Sunderland: Sinauer Associates; 2007.Google Scholar
 Britten RJ. Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. P Natl Acad Sci USA. 2002;99:13633–5.View ArticleGoogle Scholar
 Britten RJ, Rowen L, Willians J, Cameron RA. Majority of divergence between closely related DNA samples is due to indels. P Natl Acad Sci USA. 2003;100:4661–5.View ArticleGoogle Scholar
 The International Chimpanzee Chromosome 22 Consotrium. DNA sequence and comparative analysis of chimpanzee chromosome 22. Nature. 2004;429:382–8.View ArticleGoogle Scholar
 The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87.View ArticleGoogle Scholar
 Bishop MJ, Thompson EA. Maximum likelihood alignment of DNA sequences. J Mol Biol. 1986;190:159–65.View ArticlePubMedGoogle Scholar
 Thorne JL, Kishino H, Felsenstein J. An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol. 1991;33:114–24.View ArticlePubMedGoogle Scholar
 Rivas E. Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinformatics. 2005;6:63.View ArticlePubMedPubMed CentralGoogle Scholar
 Bradley RK, Holmes I. Transducers: an emerging probabilistic framework for modeling indels on trees. Bioinformatics. 2007;23:3258–62.View ArticlePubMedGoogle Scholar
 Miklós I, Novák Á, Satija R, Lyngsø R, Hein J. Stochastic models of sequence evolution including insertiondeletion events. Stat Methods Med Res. 2009;18:453–85.View ArticlePubMedGoogle Scholar
 Holmes I, Bruno WJ. Evolutionary HMMs: a Bayesian approach to multiple sequence alignment. Bioinformatics. 2001;17:803–20.View ArticlePubMedGoogle Scholar
 Holmes I. Using guide trees to construct multiplesequence evolutionary HMMs. Bioinformatics. 2003;19:i147–57.View ArticlePubMedGoogle Scholar
 BouchardCôté A. A note on probabilistic models over strings: The linear algebra approach. Bull Math Biol. 2013;75:2529–50.View ArticlePubMedGoogle Scholar
 Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J. Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics. 2015;16:108.View ArticlePubMedPubMed CentralGoogle Scholar
 Thorne JL, Kishino H, Felsenstein J. Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol. 1992;34:3–16.View ArticlePubMedGoogle Scholar
 Miklós I, Toroczkai Z. An improved model for statistical alignment. In: Gascuel O, Moret BME, editors. WABI 2001, LNCS 2249. Heidelberg: SplingerVerlag; 2001.Google Scholar
 Cartwright RA. Problems and solutions for estimating indel rates and length distribution. Mol Biol Evol. 2009;26:473–80.View ArticlePubMedGoogle Scholar
 Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 2008;18:298–309.View ArticlePubMedPubMed CentralGoogle Scholar
 Miklós I, Lunter GA, Holmes I. A “long indel” model for evolutionary sequence alignment. Mol Biol Evol. 2004;21:529–40.View ArticlePubMedGoogle Scholar
 Kim J, Sinha S. Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment. Bioinformatics. 2007;23:289–97.View ArticlePubMedGoogle Scholar
 Rivas E, Eddy SR. Probabilistic phylogenetic inference with insertions and deletions. PLoS Comput Biol. 2008;4:e1000172.View ArticlePubMedPubMed CentralGoogle Scholar
 Gu W, Zhang F, Lupski JR. Mechanisms for human genomic rearrangements. PathoGenetics. 2008;1:4.View ArticlePubMedPubMed CentralGoogle Scholar
 Rivas E, Eddy SR. Parameterizing sequence alignment with an explicit evolutionary model. BMC Bioinformatics. 2015;16:406.Google Scholar
 Cartwright RA. DNA assembly with gap (Dawg): simulating sequence evolution. Bioinformatics. 2005;21:iii31–8.View ArticlePubMedGoogle Scholar
 Fletcher W, Yang Z. INDELible: A flexible simulator of biological sequence evolution. Mol Biol Evol. 2009;26:1879–88.View ArticlePubMedPubMed CentralGoogle Scholar
 Strope CL, Abel K, Scott SD, Moriyama EN. Biological sequence simulation for testing complex evolutionary hypothesis: indelSeqGen version 2.0. Mol Biol Evol. 2009;26:2581–93.View ArticlePubMedPubMed CentralGoogle Scholar
 Dirac PAM. The Principles of Quantum Mechanics. 4th ed. London: Oxford University Press; 1958.Google Scholar
 Messiah A. Quantum Mechanics, Volume II. (Translated from French to English by Potter J). Amsterdam: NorthHolland; 1961.Google Scholar
 Ezawa K, Graur D, Landan G. Perturbative formulation of general continuoustime Markov model of sequence evolution via insertions/deletions, Part IV: Incorporation of substitutions and other mutations. bioRxiv. 2015. doi:10.1101/023622. Accessed 4 Aug 2015.
 Ezawa K, Graur D, Landan G. Perturbative formulation of general continuoustime Markov model of sequence evolution via insertions/deletions, Part I: Theoretical basis. bioRxiv. 2015. doi:10.1101/023598. Accessed 4 Feb 2016.
 Messiah A. Quantum Mechanics, Volume 1. (Translated from French to English by Temmer GM). Amsterdam: NorthHolland; 1961.Google Scholar
 Gillespie DT. Exact stochastic simulation of coupled chemical reactions. J Phys Chem. 1977;81:2340–61.View ArticleGoogle Scholar
 Feller W. On the integrodifferential equations of purely discontinuous markov processes. T Am Math Soc. 1940;48:488–515.View ArticleGoogle Scholar
 Redelings BD, Suchard MA. Joint Bayesian estimation of alignment and phylogeny. Syst Biol. 2005;54:401–18.View ArticlePubMedGoogle Scholar
 Chindelevitch L, Li Z, Blais E, Blanchette M. On the inference of parsimonious evolutionary scenarios. J Bioinform Comput Biol. 2006;4:721–44.View ArticlePubMedGoogle Scholar
 Diallo AB, Makarenkov V, Blanchette M. Exact and heuristic algorithms for the indel maximum likelihood problem. J Comput Biol. 2007;14:446–61.View ArticlePubMedGoogle Scholar
 Farris JS. Phylogenetic analysis under Dollo’s law. Syst Zool. 1977;26:77–88.View ArticleGoogle Scholar
 Ezawa K, Graur D, Landan G. Perturbative formulation of general continuoustime Markov model of sequence evolution via insertions/deletions, Part II: Perturbation analyses. bioRxiv. 2015. doi:10.1101/023606. Accessed 4 Aug 2015.
 Ezawa K, Graur D, Landan G. Perturbative formulation of general continuoustime Markov model of sequence evolution via insertions/deletions, Part III: Algorithm for first approximation. bioRxiv. 2015. doi:10.1101/023614. Accessed 4 Aug 2015.
 Ezawa K. Characterization of multiple sequence alignment errors using completelikelihood score and positionshift map. BMC Bioinformatics. 2016;17:133.View ArticlePubMedPubMed CentralGoogle Scholar
 Notredame C. Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol. 2007;3:e123.View ArticlePubMedPubMed CentralGoogle Scholar
 Löytynoja A, Goldman N. Phylogenyaware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008;320:1632–5.View ArticlePubMedGoogle Scholar
 Landan G, Graur D. Characterization of pairwise and multiple sequence alignment errors. Gene. 2009;441:141–7.View ArticlePubMedGoogle Scholar
 Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, Birney E. Genomewide nucleotidelevel mammalian ancestor reconstruction. Genome Res. 2008;18:1829–43.View ArticlePubMedPubMed CentralGoogle Scholar
 Westesson O, Lunter G, Paten B, Holmes I. Accurate reconstruction of insertiondeletion histories by statistical phylogenetics. PLoS One. 2012;7:e34572.View ArticlePubMedPubMed CentralGoogle Scholar
 Lunter GA, Miklós I, Drummond A, Jensen JL, Hein J. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics. 2005;6:83.View ArticlePubMedPubMed CentralGoogle Scholar
 Suchard MA, Redelings BD. BAliPhy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006;22:2047–8.View ArticlePubMedGoogle Scholar
 Novák Á, Miklós I, Lyngsø R, Hein J. StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics. 2008;24:2403–4.View ArticlePubMedGoogle Scholar
 Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press; 1998.View ArticleGoogle Scholar
 The ORCID register of Kiyoshi Ezawa. http://orcid.org/0000000349068578. Accessed May 19, 2016.