General continuous-time Markov model of sequence evolution via insertions/deletions: are alignment probabilities factorable?

Background Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, indel probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment. However, it is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the time-axis. Moreover, currently none of these models can fully accommodate biologically realistic features, such as overlapping indels, power-law indel-length distributions, and indel rate variation across regions. Results Here, we theoretically dissect the ab initio calculation of the probability of a given sequence alignment under a genuine stochastic evolutionary model, more specifically, a general continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. Our model is a simple extension of the general “substitution/insertion/deletion (SID) model”. Using the operator representation of indels and the technique of time-dependent perturbation theory, we express the ab initio probability as a summation over all alignment-consistent indel histories. Exploiting the equivalence relations between different indel histories, we find a “sufficient and nearly necessary” set of conditions under which the probability can be factorized into the product of an overall factor and the contributions from regions separated by gapless columns of the alignment, thus providing a sort of generalized HMM. The conditions distinguish evolutionary models with factorable alignment probabilities from those without ones. The former category includes the “long indel” model (a space-homogeneous SID model) and the model used by Dawg, a genuine sequence evolution simulator. Conclusions With intuitive clarity and mathematical preciseness, our theoretical formulation will help further advance the ab initio calculation of alignment probabilities under biologically realistic models of sequence evolution via indels. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1105-7) contains supplementary material, which is available to authorized users.


Background
The evolution of DNA, RNA and protein sequences is driven by mutations such as base substitutions, insertions and deletions (indels), recombination and other genomic rearrangements (e.g., [1][2][3]). Some recent comparative genomic analyses revealed that indels account for more base differences between closely related genomes than base substitutions do (e.g., [4][5][6][7]). It is therefore imperative to develop a method to reliably calculate the probability of sequence evolution via mutations including indels. Since the groundbreaking works by Bishop and Thompson [8] and by Thorne, Kishino and Felsenstein [9], many studies have been made to calculate the probabilities of pairwise alignments (PWAs) and multiple sequence alignments (MSAs) under probabilistic models aiming to incorporate the effects of indels. And the methods have greatly improved in terms of the computational efficiency and the scope of application (reviewed, e.g., in [10][11][12]). Most of these studies are based on hidden Markov models (HMMs) (e.g., [13]) or transducer theories (e.g., [14]). Even today, the studies on these methods are steadily advancing (e.g., [15,16]), and it seems that their mathematical and algorithmic bases are about to be established.
However, it is important to remember that these desirable properties alone are not sufficient for a model in natural science to be satisfactory. In addition to having the mathematical (or algorithmic) soundness, a satisfactory model must also approximate well, or at least decently, the real phenomena it is intended to describe. In the case of an indel probabilistic model, there are two key elements for this requisite: one is the evolutionary consistency of the model, and the other is the model's flexibility to accommodate various biologically realistic features, i.e., the biological realism, of indels.
Let us first explain the evolutionary consistency. In natural sequence evolution, the probability (density) of an indel process must be given vertically, as a multiplicative accumulation of the probabilities of transitions between states of an entire sequence, each from one time point to the next one, along the time axis (or along a phylogenetic tree when dealing with MSAs) (Fig. 1a). If a probabilistic model gives the probability density of each evolutionary process according to this natural design, we call it a "genuine stochastic evolutionary model", or simply an "evolutionary model" for short. And we will consider an indel probabilistic model as "evolutionarily consistent", if its alignment probabilities can be derived directly from an evolutionary model, even if it does not appear to follow the aforementioned natural design. By definition, each of HMMs and transducers calculates (the indel component of ) an alignment probability horizontally, as a product of either inter-column transition probabilities or block-wise contributions (Fig. 1c). Therefore, it is a priori unclear whether each HMM or transducer is evolutionarily consistent or not, or, if it is, how (Fig. 1). It would be worth mentioning that some models were indeed derived explicitly from some sorts of evolutionary models (e.g., [9,17,18]). Unfortunately, all such studies in the past imposed some unnatural assumptions, such as the prohibition of overlapping indels and the restriction of deletions to single-base ones. Such assumptions were necessary for an alignment probability to be trivially factorable, at least as a product of blockwise contributions. Thus, they are unsatisfactory from the viewpoint of the second key element, i.e., the biological realism.
Regarding the biological realism, past empirical studies revealed some properties of real indels, in addition to their possibilities to affect multiple contiguous residues at a time and to overlap others. Among the most important would be the studies that showed power-law distributions of indel lengths (see, e.g., [19] and references therein). On the contrary, standard HMMs and transducers can usually implement geometric distributions of indel lengths, or at best mixed geometric distributions (e.g., [20]), but cannot implement the power-law distributions themselves. But some generalized HMMs (or transducers) (e.g., [21,22]) can incorporate power-law indel length distributions. For example, the HMM of Kim and Sinha [22] is quite flexible, and it can incorporate the power-law distributions and also do away with the commonly imposed time-reversibility. As discussed e.g., in [21] and [23], there is no biological reason for imposing the time reversibility, and they were usually imposed to reduce the computational time. In this sense, the HMM of Kim and Sinha is two steps closer to the biological reality than the standard HMMs (and transducers). Unfortunately, similarly to the standard HMMs and transducers, their HMM is not evolutionarily consistent and thus cannot correctly handle overlapping indels along the same branch, though they can handle overlapping indels along different branches. Another possibly important biologically realistic feature is the indel rate variation across sites (or regions) (e.g., [24]), due to selection and the mutational predispositions (caused, e.g., by the sequence or epigenetic contexts). Thus far, attempts to incorporate this feature have been rare (e.g., [25]), and most studies have handled spacehomogeneous models, whose indel rates are homogeneous along the sequence.
As far as we know, except the models implemented in some genuine sequence evolution simulators (e.g., [26][27][28]), there is only one class of genuine stochastic evolutionary models discussed thus far that is also considerably biologically realistic, which are the "substitution/insertion/deletion (SID) models" proposed by Miklós et al. [21]. The SID models in general do not impose the aforementioned unnatural restrictions on indels. Moreover, the general SID model can accommodate any indel length distributions, and also some indel rate variations across sites (albeit through the residue state context alone). Unfortunately, however, we have not seen any further theoretical development of the general SID model since it was proposed. Instead, Miklós et al. developed the "long indel" model [21], which is a space-homogeneous, time-reversible SID model. (More precisely, the insertion rate depends on the inserted sequence only through the product of the frequencies of its constituent residues. As mentioned above, the timereversibility was introduced just for computational convenience, and it could be dispensed with if desired.) And they gave a verbal justification that the probability of a PWA under the "long indel" model can be calculated via a generalized HMM, as a product of contributions from "chop-zones" delimited by gapless columns. In the present viewpoint, this is the most satisfactory HMM that we know was used for actual sequence analyses, because it satisfies both the evolutionary consistency and the biological realism to some degree. Nevertheless, their justification, although plausible, has two problems. First, it is unclear exactly how their HMM is related to the ab initio probabilities of evolutionary (especially indel) processes under their evolutionary model. And second, their justification takes advantage of the space-homogeneity of the model, which makes it unclear how their HMM can be extended to be space-heterogeneous while keeping the evolutionary consistency. To solve these two problems, we need at least to get back to their origin, i.e., the general SID model. It seems, however, that this model has never been theoretically dissected thus far, possibly due to the lack of mathematical or conceptual tools to handle it easily.
In this study, we examine a general continuous-time Markov model of the evolution of an entire sequence via insertions, deletions and substitutions. This model could be regarded as an extension of the general SID model, in the sense that it allows explicit (or inherent) rate variation across sites, not only due to residue state contexts. Such rate variation could be regarded as the effect of, e.g., the epigenetic context and/or the context within the 3D structure of the protein product (e.g., [25]). 1 To theoretically dissect the ab initio calculation of alignment probabilities under this model, we introduce some useful tools. Among the most important would be the operator representation of mutations, namely, insertions, deletions and substitutions. This enabled us to shift our focus from the trajectory of sequence states, which played a central role in [21], to the history of mutations (especially indels). Moreover, the operator representation enabled to algebraically define the equivalence relationships between two . a Probability density calculation via a genuine stochastic evolutionary model. Each sequence state is represented as an array of sites (boxes). Sites to be deleted are shaded in red or magenta. Inserted sites are shaded in blue or cyan. The s I and s F , respectively, denote the initial and final states. The s ν (ν = 1, 2, 3) is an intermediate state. (The "P[…]" denotes a probability, the "p[…]" denotes a probability density.) b A pairwise alignment (PWA) between the initial (I) and final (F) states resulted from the indel process in panel a. The C i (i = 1, …, 10) labels the alignment column below. c Probability calculation via a HMM (or a transducer). It is a priori unclear whether or how the methods in panels a and c are related with each other. For clarity, residue states and substitutions were omitted. (Note that the equation in panel a is merely a rough expression to give a broad idea on the issue. Rigorous expressions will be given in Results and discussion.) Panels a and b of this figure were adapted from panels B and F of Fig. 1 of [32] different series of indels. They, in cooperation with the focus shift, enabled to define the local history set (LHS) equivalence classes. These equivalence classes play an essential role when deriving the "sufficient and nearly necessary" set of conditions under which alignment probabilities are indeed factorable, thus providing a sort of generalized HMM. We also adapt techniques from the time-dependent perturbation expansion in quantum mechanics [29,30], expanding each alignment probability into a summation over contributing mutational histories with different numbers of indels. It should be noted, however, that we formally deal with all terms in the expansion. 2 Thus, at least formally, the probabilities we deal with are exact solutions of the model's defining equation. For clarity, we will focus on insertions/deletions in the bulk of the manuscript. However, we can also incorporate substitutions; see, e.g., [31] for more details.
This paper describes the backbone of our study (more extensively recorded in an unpublished paper [32]) to give the theoretical basis of our ab initio probability calculation under the general continuous-time Markov model of indels. Peripheral topics surrounding the study can be found in [32]. 3 Throughout the paper, we suppose that each probability is calculated under a given evolutionary model setting, including the phylogenetic tree of the sequences. In section R1 of Results and discussion, we briefly review the most general form of the SID model [21]. Then, in section R2, we introduce two important tools, namely, the ancestry index and the operator representation of mutations including indels. Using the results of sections R1 and R2, we define our general continuous-time Markov model in section R3, and formally give the general solution to its defining equation in terms of the operator representation. In section R4, we formally express the ab initio probability of a given PWA in a perturbation expansion. Then, using the concept of the LHS equivalence classes defined in section R5, we derive in section R6 the conditions under which the PWA probability is factorable. In section R7, the derivation is extended to the probability of a given MSA. In section S8, some examples are given to illustrate models with factorable and non-factorable alignment probabilities. The former category includes the indel evolutionary model of Dawg [26] and the "long indel" model [21], among others. In section R9, we discuss the merits, possible uses and extensions, as well as some outstanding issues, of the results in this study. In Table 1, we summarize the key concepts and results of this paper, mainly for those who want its gist quickly. Likewise, Table S1 (in Additional file 1) summarizes mathematical symbols used commonly in this paper, to facilitate the readers' cruise through the equations. Supplementary methods (in Additional file 1) and Supplementary appendix (in Additional file 2) give detailed derivations of some important results. The former is more essential and accessible to a wider audience; the latter is for those who are interested in further mathematical details.
We end this section with two notes. First, in this paper, the term "an evolutionary (or indel) process" means a series of successive mutation (or indel) events with both the order and the specific timing specified, and the term "an evolutionary (or indel) history" means a series of successive events with only the order specified. This usage should conform to the common practice in this field. Second, we will describe the results in the bra-ket notation, similar to that in quantum mechanics [29,33]. However, those who are unfamiliar with the notation need not worry about it. Our formulation via the bra-ket notation can be proven to be equivalent to the standard formulation of the continuous-time Markov model via the vector-matrix notation. (We refer the interested readers to Supplementary appendix SA-1 in Additional file 2.) Therefore, if desired, the symbols of a bra (〈x|), a ket (|y〉), and an operator (Ô) could be regarded simply as convenient reminders of a row vector, a column vector, and a matrix, respectively. 4

Results and discussion
The key concepts and results proposed/obtained in this paper are summarized in Table 1. Readers can use the table to quickly grasp an overview of this paper, as well as to easily locate what they look for. Also, most mathematical symbols are briefly explained in Table S1 in Additional file 1.

R1. Brief review of general SID model
Miklós et al. [21] proposed a class of evolutionary models, which they called the "substitution/insertion/ deletion (SID) models". They are continuous-time Markov models defined on the space of strings (i.e., sequences) of any lengths, each of which consists of letters (i.e., residues, such as bases or amino acids) from a given alphabet (denoted as Ω here). Following [21], their state space will be denoted as: Ω*≡ ∪ L = 0 ∞ Ω L , whose component, Ω L , is the space of all sequences of length L. If desired, a sequence state, s ∈ Ω L , could be represented as: s = [ω 1 , ω 2 , …, ω L ] (with ω x ∈ Ω for x = 1, 2, …, L) (see Fig. 2a). In this model, mutations are defined as transitions from a sequence state to another, and their instantaneous rates can be given via the following "rate grammar" they proposed:

Ancestry index
An ancestry index is assigned to each site. Sharing of an ancestry index among sites indicates the sites' mutual homology. As a fringe benefit, the indices enable the mutation rates to vary across regions (or sites) beyond the mere dependence on the residue state of the sequence.
Section R2 (1st and 2nd paragraphs), Fig. 2 Operator representation of mutations This enables the intuitively clear and yet mathematically precise description of mutations, especially insertions/ deletions, on sequence states. This is a core tool in our ab initio theoretical formulation of the genuine stochastic evolutionary model.
Section R2 (3rd paragraph), Fig. 3 Rate operator An operator version of the rate matrix, which specifies the rates of the instantaneous transitions between the states in our evolutionary model. In other words, the rate operator describes the instantaneous stochastic effects of single mutations on a given sequence state. Totally space-homogeneous model Such a model gives factorable PWA probabilities, because the exit rate is an affine function of the sequence length (regardless of whether indel rates are time-dependent or not). The indel model of Dawg [26] and the "long indel" model [21] belong to this class. Here, ρ m (…) with m = S, I and D denote the rates of the substitution, the insertion, and the deletion, respectively, possibly depending on the arguments in the parentheses. In each of the above rules, s and s′, respectively, denote the sequence states before and after the mutation. The symbols s L and s R denote the subsequences flanking the mutated portion from the left and from the right, respectively. 5 These SID models equipped with this "rate grammar" are genuine stochastic evolutionary models, and thus do not usually impose unnatural restrictions on the mutations (except possibly through restrictions on mutation rates). And the most general SID model can accommodate quite general mutation rates, including indel length distributions, by allowing their dependence on the sequence states before and after the mutation. 6 As far as we know, however, this most general SID model was not theoretically examined further (at least thus far), maybe because adequate mathematical or conceptual tools were not devised and because of some other reasons (mentioned below). In the following sections, we will provide such tools, which in turn will help theoretically dissect an extended version of the most general SID model.

R2. Ancestry indices and operator representation of mutations
First, we slightly extend the framework of the SID models, by assigning an ancestry to each site, which is a unit position in the sequence that accommodates a single residue. Hereafter, we will consider that it is the sites, instead of the residues, that are inserted/deleted. For example, the above example sequence state, s = [ω 1 , ω 2 , …, ω L ](∈Ω L ), can be extended as: (Fig. 2b).
Here, υ x (∈ϒ) is the ancestry index assigned to the x-th site of the sequence (with x = 1, 2, …, L). Alternatively, the extended sequence state could also be represented is an array of ancestry indices assigned to the sites, and ω → ¼ is an array of residue states that fill in the sites. (Note that ω → corresponds to the sequence state (s) in the SID models (in section R1)).
The ancestry indices follow a number of rules: (i) different sites in the same sequence always have different ancestry indices; (ii) the ancestry index of a site remain unchanged as long as the site exists; and (iii) Equivalence (with caveat) of the "chop-zone" method and our ab initio method We showed that the "chop-zone" method in [21], adapted to calculate the probability of a given LHS equivalence class, is equivalent to our ab initio method, at least if the indel model is spatiotemporally homogeneous.
Subsection R8-1, Supplementary appendix SA-3 Model with simple insertion rate variation If the deletion rates are space-homogeneous and the insertion rates depend only on the insertions' flanking sites, the PWA probabilities are still factorable.

Space-homogenous model flanked by essential sites
This kind of model is a simplest example of the indel model whose ab initio PWA probabilities are non-factorable.
Degree of non-factorability The "difference of exit-rate differences" (Eq. (R8-2.4)) could measure the "degree of non-factorability." Space-heterogeneous model with factorable PWA probability We found that a class of indel models with rate-heterogeneity across regions (Eqs. (R8-3.1,R8-3.2)) have partially factorable PWA probabilities. Figure S3 NOTE: Especially important things are in boldface Fig. 2 Sequence states. a A sequence state in the SID models [21]. Each site (cell) is assigned a residue state (A, T, G or C). b The corresponding extended sequence state in our evolutionary model. Each site is assigned an ancestry index (number) in addition to a residue state. c The corresponding basic sequence state. Each site is assigned an ancestry index alone. The ω → represents the set of residue states assigned to all sites.
The υ → represents the set of ancestry indices assigned to all sites. (Note that the identical symbols (s's) used in panels a and c represent different types of states (of the same sequence)) every time when an insertion takes place, new ancestry indices are assigned to the newly inserted sites.
Other than these rules, the assignment of the indices is arbitrary. Especially, their values themselves are not so important. The most essential thing is whether two sites of different sequences share the same ancestry index or not; if so, the sites are mutually homologous (actually orthologous unless duplications are considered). Another important thing is the spatial relationship of the site having each ancestry with other sites, especially preserved ancestral sites (PASs, explained shortly). For the space of ancestry indices, ϒ, we will tentatively use the set of all positive integers (Ν 1 ≡{1, 2, 3, …}), although there should be a more appropriate mathematical entity. 8 Because of the rules imposed above, the space of the extended sequence states (denoted asS II in [31]) is included in but never equal to {ϒ × Ω} * = ∪ L = 0 ∞ {ϒ × Ω} L . We devised the ancestry indices to facilitate the description of indel histories by keeping track of the evolutionary course of each site. For example, consider that a sequence whose initial state had the ancestry indices υ This alignment tells that the sites with ancestries 2, 3 and 4 were deleted and that the sites with ancestries 8, 9 and A were inserted. We can also see that the sites with 1, 5, 6 and 7 were preserved during the evolution. We will henceforth refer to such sites as "preserved ancestral sites (PASs)". The PASs indicate that no indels occurred at or through the sites during the evolution under consideration. Thus, they can be used to narrow down the possible indel histories that might have resulted in the pairwise alignment (PWA) (as argued, e.g., in [21]). A fringe benefit of the ancestry indices is that they enable the mutation rates to vary beyond the dependence on the residue states (section R3). Hereafter, we refer to an array of ancestry indices (like υ → ) as a "basic sequence state" (abbreviated as a "sequence state" or a "basic state"), which is a backbone to be fleshed out by the residue states (like ω → ) to give the extended sequence state (likes ).
Hereafter, the basic sequence state will often be denoted, e.g., as s (Fig. 2c). (Henceforth, symbols like s will never denote a residue state (like ω → ), which was a sequence state in section R1). And S II (⊂ϒ * = ∪ L = 0 ∞ ϒ L ) denotes the space of the basic states.
Another, probably more important, tool we introduce here is the operator representation of mutations. When considering the evolutionary processes (or histories), we symbolically represent each (extended) sequence state as a bra-vector, likes I j h , which could be regarded as an abstract extension of a row vector in the normal representation of the continuous-time Markov model. Then, each mutation of a sequence can be represented as a linear operator (regarded as an abstract extension of a matrix) acting on a bra-vector (Fig. 3). OperatorM S x; ω↦ω 0 ð Þ denotes the substitution of the residue to ω 0 (≠ω) if it was ω at the x th site (or the null action otherwise) (Fig. 3a). OperatorM I x; l ð Þ denotes the insertion of l sites between the x-th and (x + 1)-th sites (Fig. 3b). The insertion operator alone does not determine the residue states of the inserted sites. It is the job of the "fill-in" operator,F x; δω → 0 l ½ , which fills in the l inserted sites with a new array of residues, δω → 0 l ½ ∈Ω l À Á . (This "division of labor" facilitates the decoupling of the substitution component and the indel component of an alignment probability. See Appendix A1 of [31] for more details). Finally, operatorM D x B ; x E ð Þ denotes the deletion of the subsequence between (and including) the x B -th and x E -th sites in the sequence immediately before the event (Fig. 3c). The action of multiple successive mutation events can be expressed as a product of mutation operators on an initial (extended) sequence state. For example, the indel history illustrated in panels a and b of Fig. 4 can be represented as a series of indel events,M D 3; 3 ð Þ;M I 5; 2 ð Þ;M D 2; 3 ð Þ;M I 5; 1 ð Þ Â Ã , on the initial basic state s I (given above). Then, the final result of this indel history is expressed as: Figure 4c shows the MSA among the initial, intermediate and final sequence states. Figure 4d shows the resulting PWA between the initial and final sequence states.
These new tools, the ancestry indices and the operator representation of mutations, will play essential roles in our theoretical development described below. In the SID models [21], each evolutionary process was expressed as a (time-recorded) trajectory of sequence states, each of which was represented by an array of residues (without ancestry assignments). In consequence, an instantaneous transition from a state to the next state was often expressed as a summation of multiple possible mutations. (For example, the transition from ω Þ). Ancestry indices help avoid such ambiguous channels by uniquely defining each instantaneous state-to-state transition as an action of a single mutation. (In the above example, the former causes the transition from υ  ð Þ. The sites to be deleted are shaded in magenta. In this figure, the extended sequence states were used for illustration. The bra-vector below each array denotes the state. The extended state,s, is identical to that in Fig. 2b. Each vertical arrow indicates the action of the mutation operator beside it. Note that the first arguments of all operators and the second argument of the deletion operator specify positions along the sequence, and not ancestries (specified at the top of the sites) as the "equivalence relations" between the products of operators (section R5), facilitates the examination of the factorability of alignment probabilities, as we will see in sections R4-R6.

R3. Instantaneous transition (or rate) operator and finite-time transition operator
Now we are ready to define our evolutionary model, i.e., the general continuous-time Markov model to describe the evolution of an entire sequence along a time axis. If desired, this could be done by extending the rate grammar, Eqs. (R1.1,2,3), so that each mutation rate depend on the entire extended sequence states (i.e., not only on the residue states) immediately before and after the mutation. (We will also introduce the explicit time dependence. This may allow us to incorporate the effects of changes in the physiology, genomic contexts, external environments, etc. (See also section 2.4 of [32].)) However, to define the model more neatly, we will parametrize the mutation rates in coordination with the mutation operators defined in section R2. Although the parametrization is different from (the extended version of ) Eqs. (R1.1,2,3), the two sets of mutation rates are equivalent. (To remember these differences in the parametrization, we will use the symbol r m (…) (m = S, I and D) instead of ρ m (…).) Lets be the extended sequence state immediately before the mutation. And let t be the time at which the mutation occurred. Then, r S x; ω↦ω 0 ;s; tÞ ð is the rate of the substitution, M S x; ω↦ω 0 ð Þ. It must be zero unless ω is at the x-th site ofs . r I x; l; δω → 0 l ½ ;s; tÞ is the rate of the insertion,M I x; l ð Þ (accompanied byF x; δω → 0 l ½ ). And r D À x B; x E;s ; tÞ is the rate of the deletion,M D À x B; x E Þ . Using these mutation rates that accompany the mutation operators, we can define our evolutionary model in a manner closer to the standard, by defining the instantaneous transition rate operator (or the "rate operator" for short), which is an analog of the instantaneous transition rate matrix in a continuoustime Markov model. Because the state space we are working in,S II , is essentially infinite, we cannot give the explicit matrix expression of the rate operator on the entire state space. Nevertheless, the rate operator can be defined if we give its action on every state inS II . LetQ SID t ð Þ denote our rate operator (at time t). It is convenient to decompose it as follows: whereQ m t ð Þ with m = S, I and D, respectively, are the substitution, insertion, and deletion components of the rate operator. Each of the three components can be further decomposed as: Here, the "mutation part",Q m M t ð Þ , describes the transitions to different states via mutations of type m (=S, I or D). And the "exit rate part",Q m X t ð Þ, attenuates the state retention probability at the exit rate, R m Xs ; tÞ ð , which is determined by the type m mutations. It guarantees that the state probabilities sum up to unity at any time. Specifically, the mutation parts are defined by the following actions on every state: ðR3:5Þ Here, Ls ð Þ denotes the length (i.e., the number of sites) ofs, and ω xs ð Þ denotes the residue at the x-th site ofs. (R3.4) represent insertions at the left and right ends, respectively, of the sequence. How to deal with such insertions varies depending on various factors including the model setting, and can be implemented by adjusting the insertion rates accordingly (see, e.g., [21,26]). The terms with x B < 1 or x E > Ls ð Þ in Eq. (R3.5) represent the deletions of the subsequences sticking out of the subject sequence. These terms were included because the subject sequence is regarded as embedded in a "chromosome" with a virtually infinite length (e.g., [21,26]). 9 The exit rate parts are defined in nearly the same form: They differ only in the exit rates: x B; x E;s ; tÞ: ðR3:9Þ These equations, Eqs. (R3.1-R3.9), cooperatively define the instantaneous transition rate operator,Q SID t ð Þ, and thus define our evolutionary model. If desired,Q SID t ð Þ could be decomposed as: is the collection of all mutational transition terms, and Q SID X t ð Þ≡Q S X t ð Þ þQ I X t ð Þ þQ D X t ð Þ is the entire exit rate part, which attenuates the state retention probability at the total exit rate, R SID Actually, the total mutational part,Q SID M t ð Þ, is equivalent to (the extended version of) the "instantaneous rate matrix" of the general SID model (defined by Eq. (1) in [21]), although the two expressions appear quite different from each other. The difference mainly stems from the parametrization and the state representation. We use our parametrization because we believe it to clarify the meaning of each term in the rate operator. And, in this section, the sequence state after the mutation (say, 〈s 0 j) was represented as the result of the mutation operator (say,M m … ð Þ) acting on the state before the mutation (say,s h j), like 〈s 0 j ¼s h jM m … ð Þ. This is legitimate because a mutation transfers a subject state uniquely to another state. This state representation will facilitate the unfolding of our theory below.
Thus far, we included substitutions (and residue states) mainly in order to discuss the relationship between our evolutionary model and the general SID model. However, our main interest here is in calculating the probability of the skeleton of a sequence alignment, composed only of the sites and gaps, which will then be filled in with residues to form a full alignment. Because such a skeleton can be created only through an evolutionary process of indels, we will hereafter omit the description of substitutions and the resulting changes in the residue state (ω → ). (This includes omitting the "fill-in" operators, F x; δω → 0 l ½ 's.) (Henceforth, the alignment skeleton will be called the "alignment" for short.) But we will retain the basic sequence state consisting of ancestry indices ( s ¼υ → ). The rate heterogeneity will be realized only through the ancestry dependence. This would be almost sufficient for representing the dependence of the rates on, e.g., the epigenetic or 3D structural context (e.g., [25]). And this may also be able to approximate the dependence on some residue state contexts, especially in highly conserved regions. Now, the rate operator defined by Eqs. (R3.1-R3.9) is reduced as follows. (The reduced total rate operator will be denoted asQ ID t ð Þ).  where the ω → -dependence of the right hand side was omitted. Now we can calculate the operator that gives the finite-time transition probabilities between states. We will call it the "finite-time transition operator". LetP ID t; t 0 À Á be such an operator describing the state transition via indels alone from an initial time, t, to a final time, t 0 (> t). OperatorP ID t; t 0 À Á is defined to give the following equations: On the right hand side, P[(s 0 , t 0 )|(s, t)] is the probability that the sequence is in state s 0 at time t 0 conditioned on that it was in state s at time t. On the left hand side, |s 0 〉 denotes a ket-vector (an abstract extension of a column vector), whose exclusive role here is to give "innerproducts" with bra-vectors: 〈s| s 0 〉 = 1 (if s = s 0 ), = 0 (otherwise). From the evolutionary principle that our model must satisfy, or equivalently, from the fundamental properties (such as the Chapman-Kolmogorov equation) of the continuous-time Markov model, the finite-time transition operator must be given by the multiplicative accumulation of the effects of the rate operators along the time axis. Thus,P ID t; t 0 À Á is formally calculated as: ðR3:18Þ In the middle of the equation, Î is the identity operator (i.e., 〈s|Î = 〈s| for every s ∈ S II ), and t On the rightmost side, T{…} denotes the meta-operator that rearranges the operators in each operator product term in the temporal order so that the earliest operator comes leftmost. Another way to giveP ID t; t 0 À Á is through the first-order time-differential equation. Again, from the fundamental properties of the continuous-time Markov model, or equivalently, from Eq. (R3.18), we can show that the operator satisfies the following differential equations:

R4. Perturbation expansion of finite-time transition operator and pairwise alignment probability: brief description
In time-dependent perturbation theory of quantum mechanics (e.g., [29,30]), the instantaneous time evolution operator (Hamiltonian Ĥ(t)) is considered as a sum of two operators,Ĥ t ð Þ ¼Ĥ 0 t ð Þ þV t ð Þ, and the time evolution of the system is described as if the system mostly evolves according to the well-solvable instantaneous time-evolution operator (Ĥ 0 (t)) and is occasionally perturbed by the "interaction" operator (V t ð Þ ). We adapt such a technique of time-dependent perturbation expansion to our evolutionary model. Here, we briefly describe the results. For their detailed derivations, see Supplementary methods SM-1 in Additional file 1. We first re-express our rate operator as: HereQ ID 0 t ð Þ≡Q I X t ð Þ þQ D X t ð Þ describes the mutationfree evolution, andQ ID M t ð Þ≡Q I M t ð Þ þQ D M t ð Þ describes the single-mutation transition between states. From the reduced form of Eq. (R3.6), we get: . Similarly, the backward equation (Eq. (R3.20)) accompanied by Eq. (R3.21) is equivalent to another crucial integral equation: ðR4:5Þ Now, to formally solve Eq. (R4.4), we assume that the solution can be expanded as: À Á is the collection of terms containing N indel operators each. Substituting this expansion into Eq. (R4.4) and performing some formal calculations, we get the final form of the ab initio solution we desire: Here, Η ID (N; s 0 ) denotes the space of all possible histories of N indels each beginning with the sequence state, s 0 . And is the probability that an N -event indel history, ½M 1 ; M 2 ; ⋯;M N (withM ν (ν = 1, 2, …, N) being the ν-th event), occurred during time interval [t I , t F ], given an initial sequence state (s 0 ) at time t I . The rate, rðM ν ; s ν−1 ; τ ν Þ , is r I (x, l; s ν − 1 , τ ν ) ifM ν ¼M I x; l ð Þ, and it is It should be noted that Η ID (N = 0; s 0 )≡{(s 0 , [])} consists only of the history with zero indel, [ ], whose conditional probability is: supplemented with Eq. (R4.7) is also the solution of Eq. (R4.5). (Mathematically, Eq. (R4.7) is a multipletime integral over all possible timing, whose integrand is the probability density of an evolutionary process of N indels with particular timing, (τ 1 , τ 2 , …, τ N )). Equation (R4.6) states that the finite-time transition operator (acting on 〈s 0 |) is the collection of the effects of all possible indel histories starting with s 0 , each weighted by its probability (Eq. (R4.7)). Thus, it mathematically underpins Gillespie's [34] famous stochastic simulation algorithm, which provides the basis of genuine molecular evolution simulators (e.g., [26][27][28]). Our derivation of Eq. (R4.6) and Eq. (R4.7) through the integral equation (Eq. (R4.4) or Eq. (R4.5)) bridges Gillespie's own intuitive reasoning and Feller's [35] mathematically rigorous proof of the solution. Now, substitute an "ancestral" sequence state, s A (∈S II ), for s 0 in Eq. (R4.6), and take the inner product between it and the ket-vector, |s D 〉, of a "descendant" sequence state, s D (∈S II ). Comparing the two sequence states in S II naturally gives a PWA, α(s A , s D ) (e.g., Eq. (R2.1)). 10 Hence, the summation of s A P ID ðt I ; t F Þ s D j i's over all "equivalent" s D 's providing the same α(s A , s D ) must be P[(α(s A , s D ), [t I , t F ])|(s A , t I )], which is the probability that α(s A , s D ) results from sequence evolution during [t I , t F ], given s A at t I . Similarly to the derivation of Eq. (R4.6), we obtain its formal expression as: , of our evolutionary model. Thus, they are the "ab initio probability" of the PWA. In the following, we will examine its factorability.

R5. Local history-set equivalence class of indel histories
Before advancing to the factorability of general PWA probabilities, we will introduce an essential concept here. For this purpose, we first consider the very simple PWA, Eq. (R2.1), as an example. (Here we make the substitutions, s A ¼ υ  (Fig. 5b). Thus, these two indel histories result in the same final state: s F h j ¼ s I h jM D 2; 4 ð ÞM I 3; 3 ð Þ ¼ s I h jM I 6; 3 ð ÞM D 2; 4 ð Þ (Fig. 5, panels a and b). In other words, the two different successive actions of two indel operators have the same effect on the sequence states (in space S II ). This fact will be phrased as "the two products of the operators are equivalent", and represented by the relationship: This "binary equivalence" can be generalized to the following relationships between two indel events separated at least by a PAS: ðR5:2bÞ

ðR5:2dÞ
Here, l 0 ≡min{x E − x B + 1, x E } in Eq. (R5.2c), and l 2 0 ≡min{x E2 − x B2 + 1, x E2 } in Eq. (R5.2d). If you will, these equivalence relations could be phrased as follows. "The operator representing the event on the left along the sequence will not change whether it comes first or second. The operator representing the event on the right will shift its operational position to the left/right by the number of sites deleted/inserted before its operation, when it comes second". 11 Now, we will extend the binary equivalence relations, Eqs. (R5.2a-R5.2d), to the equivalence relations among more general complex indel histories, each consisting of more than two indel events. Let us consider a global history of N indel events, ½M 1 ;M 2 ; …;M N , which begins with an "ancestral state", s A (∈S II ), and ends with a "descendant state", s D (∈S II ). Obviously, the two states must satisfy the equation: Given an indel history, we can identify PASs unambiguously. Suppose that such PASs separate the indel events, M ν (ν = 1, 2, …, N) in ½M 1 ;M 2 ; …;M N , into K local subsets of indels, each of which is confined either between a pair of PASs or between a PAS and an end of the resulting PWA. Number the K local subsets as k = 1, 2, …, K from left to right, and let N k be the number of indel events in the k th local subset. Naturally, ∑ k = 1 K N k = N. And let M 0 k; i k ½ be the element ofM ν È É ν¼1;2;…;N representing the i k th event (in the temporal order) in the k th local subset (i k = 1, 2, …, N k ; k = 1, 2, …, K). (The prime here indicates that the operator is equivalent to the prime-less version). Then, repeatedly applying the binary equivalence relations, Eqs. (R5.2a-R5.2d), between the indel operators belonging to different local subsets, we can move the operators around in the product in Eq. (R5.3) and get the following equation: ). Therefore, it should be equivalent to ½M 1 ;M 2 ; …;M N in this sense. Hence, we can define a particular equivalence class, which is the set of all global indel histories that can be "decomposed" into the identical set of local indel histories, such as Eq. (R5.4), only through a series of Eqs. (R5.2a-R5.2d), between indel operators separated by at least a PAS. We will call it the "local-history-set (LHS) equivalence class". In the equivalence class defined by a local history set

LHS-equivalent global indel histories begin-
ning with s A . Each of the global histories corresponds to a way of reordering N indel events while retaining the relative temporal order among N k events within the k th local indel history (for every k = 1, …, K).
In the simplest example at the beginning of this section (Eq. (R5.1) and above), the corresponding LHS is: The LHS consists of two local histories, each of which is a single-indel history (Fig. 5c). As a slightly more complex example, consider the history, Here the "factorability" means that the PWA probability can be re-expressed as the product of an overall factor and contributions from local regions. Natural candidates for the local regions would be those in between the PASs, because we know that indels never hit or pierced PASs (if the alignment is correct). We are not interested in trivial factorability. Thus, we only consider PWAs (or global histories) each of which requires at least two local indel histories. In the following, we will only briefly describe our demonstration of the PWA probability factorization. Its more detailed yet rather intuitive description is given in Supplementary methods SM-2 in Additional file 1. (It is complemented by mathematically rigorous arguments in Supplementary appendix SA-2 in Additional file 2). We first notice that each component probability, given by Eq. (R4.7), will not be factorable. This is because its domain of multiple-time integration is not a direct product. So, lets consider the total probability of a LHS equivalence class, ðR6:1Þ We can show that this probability can be factorized as: if the following two conditions are satisfied. Condition (i): The rate of an indel event ( rðM ν ; s ν−1 ; τ ν Þ ) is independent of the portion of the sequence state (s ν − 1 ) outside of the region of the local history the event (M ν ) belongs to.
Condition (ii): The increment of the exit rate due to an indel event (δR X ID (s ν , is independent of the portion of the sequence state (s ν − 1 ) outside of the region of the local history the event (M ν ) belongs to.
See Supplementary appendix SA-2 in Additional file 2 for the derivation of the mathematically rigorous version of this set of conditions. (For illustration, in Supplementary methods SM-3 in Additional file 1, the factorability of the probability was examined for the simplest concrete LHS equivalence class (in Fig. 5)). Condition (i) is somewhat similar to the "context-independence" condition imposed on the "long indel" model [21], though our condition is slightly less restrictive. Condition (ii) has never been found or even discussed thus far. In fact, the "long indel" model trivially satisfies this condition (see subsection R8-1), thus [21] did not need to pay attention to it. However, this condition is not always satisfied. Indeed, as exemplified in subsection R8-2, some models have non-factorable alignment probabilities due to the violation of this condition even though condition (i) is satisfied. Each ðR6:5Þ whereΛ ID αðs A ; s D Þ Â Ã is the set of all LHSs consistent with α(s A , s D ). Hence, the PWA probability, Eq. (R4.9), can be rewritten as: We are considering all indel histories, including nonparsimonious ones, that can yield α(s A , s D ). Thus, the LHSs belonging toΛ ID αðs A ; s D Þ Â Ã may consist of different numbers of local histories. We will choose the maximum possible set of PASs in the given PWA, which separates the PWA into the finest potentially localhistory-accommodating regions, denoted as γ 1 ; γ 2 ; …; γ κ max . (κ max is uniquely determined by the PWA and the evolutionary model). 12 Then, we can represent Figure S1 in Additional file 1). Substituting Eqs. (R6.2,R6.4) into Eq. (R6.6), and exploiting the vector representation of the LHSs, we can reach the desired expression:

ðR6:7Þ
Here,Λ ID γ κ ; αðs A ; s D Þ Â Ã denotes the set of local indel histories that can give rise to the sub-PWA of α(s A , s D ) confined in γ κ , and the multiplication factor, ) and contributions from regions accommodating local indel histories (μ PΛ ID γ κ ; Â À Â αðs A ; s D Þ; ½t I ; t F Þ ðs A ; t I Þ Ã 's). Therefore, the set of conditions, (i) and (ii), is sufficient for the factorability of the PWA probability. At present, we are not sure whether the set of conditions is also necessary or not. This may not be the case in the rigorous sense, and there may be some instances with factorable PWA probabilities despite the violation of condition (i) or (ii). Nevertheless, even if there are, we suspect that such cases should be isolated, requiring intricate cancellations of the terms. Thus, we will refer to the conditions (i) and (ii) as the "sufficient and nearly necessary set of conditions" for factorable PWA probabilities.

R7. Factorability of multiple sequence alignment probability: brief description
Thus far, we only examined the probability of a given PWA, conditioned on an ancestral state at initial time. Actually, once we know how to calculate such conditional PWA probabilities, we can build them up along the phylogenetic tree to calculate the probability of a given MSA, as described in the introductions of [13] and [14]. (See also [36] for an essentially equivalent method that appears different.) Here, we basically follow their procedures. However, it should be stressed that the MSA probability here will be calculated ab initio under a genuine evolutionary stochastic model and not under a HMM or a transducer, which is not necessarily evolutionarily consistent. This section briefly explains the derivation of the factorization of an ab initio MSA probability. For details on the derivation, see Supplementary methods SM-4 in Additional file 1.
In this section, we formally calculate the ab initio probability of a MSA given a rooted phylogenetic tree, T = ({n} T , {b} T ), where {n} T is the set of all nodes of the tree, and {b} T is the set of all branches of the tree. We decompose the set of all nodes as: {n} T = Ν IN (T) + Ν X (T), where Ν IN (T) is the set of all internal nodes and Ν X T ð Þ ¼ n 1 ; …; ; n N X f g is the set of all external nodes. (The N X ≡|Ν X (T)| is the number of external nodes.) The root node plays an important role and will be denoted as n Root . Because the tree is rooted, each branch b is directed. Thus, let n A (b) denote the "ancestral node" on the upstream end of b, and let n D (b) denote the "descendant node" on the downstream end of b. Let s(n) ∈ S II be a sequence state at the node n ∈ {n} T . Especially, we use abbreviations: s A (b)≡s(n A (b)) ∈ S II and s D (b)≡s(n D (b)) ∈ S II . Finally, as mentioned in Background, we suppose that the branch lengths, {|b||b ∈ {b} T }, and the indel model parameters, {Θ ID (b)} T ≡{Θ ID (b)|b ∈ {b} T }, are all given. Note that the model parameters Θ ID (b) could depend on the branch, at least theoretically.
First, we extend the ideas proposed in [13,14,36] to each indel history along a tree, by regarding the indel history along a branch as a map (or a transformation) from the ancestral state to the descendant state, as follows. An indel history along a tree consists of indel histories along all branches of the tree that are interdependent, in the sense that the indel process of a branch b determines a sequence state s D (b) at its descendant node n D (b), on which the indel processes along its downstream branches depend. Thus, an indel history on a given root sequence state s Root = s(n Root ) ∈ S II automatically determines the sequence states at all nodes, {s(n) ∈ S II for ∀ n ∈ {n} T }. LetΗ ID s 0 ð Þ≡∪ ∞ N¼0 Η ID N; s 0 ð Þ (with Η ID (N; s 0 ) defined below Eq. (R4.6)) be the set of all indel histories along a time axis (or a branch) starting with state s 0 . Then, each indel history, , along tree T and starting with s Root can be specifically expressed as:

ðR7:1Þ
Here, the symbol,M ν b ð Þ, denotes the ν th event in the indel history along branch b ∈ {b} T . The probability of the indel history, Eq. (R7.1), can be calculated as: Here, the probability of an indel history,M ⇀ b ð Þ 1 M 1 b ð Þ; :::;M NðbÞ b ð Þ Â Ã ∈Η ID s A b ð Þ À Á , along branch b ∈ {b} T is given by the probability during the corresponding time interval, [t(n A (b)), t(n D (b))]: ðR7:3Þ Here we explicitly showed the branch-dependence of the model parameters.
Now, consider a MSA, α s 1 ; s 2 ; …; s N X ½ , among the sequences at the external nodes, s i = s(n i ) ∈ S II (n i ∈ Ν X (T)). (Remember that the term "MSA" here means its homology structure, as noted in endnote (10)). Let be a pair of a root state and an indel history along T starting with the state. And letΨ ID α½s 1 ; s 2 ; …; s N X ½ ; T be the set of all such pairs defined on T consistent with α½s 1 ; s 2 ; …; s N X . Then, analogously to Eq. (R4.9) supplemented with Eq. (R4.7) for a PWA, the probability of a given MSA under a given model setting (including T) should be expressed as: denote a set of ancestral states at all internal nodes (, or, more precisely, its equivalence class in the sense of endnote (8)). To be consistent with a given MSA, the ancestral states must satisfy the "phylogenetic correctness" condition in each MSA column (e.g., [37,38]). 13  Here, P α½s 1 ; s 2 ; …; s N X ½ ; s n ð Þ f g Ν IN T j is the probability of simultaneously getting α½s 1 ; s 2 ; …; s N X and s n ð Þ f g Ν IN . This probability is the sum of contributions from all indel histories sharing the same homology structure among sequence states at all nodes. Especially, the sequence states at internal nodes have homology structures (with the states at other nodes) fixed for respective nodes. Thus, through some reasoning (explained in SM-4), we get:

ðR7:6Þ
Here, is the probability of the ancestor-descendant PWA along branch b. This Eq. (R7.6) is basically the expression proposed in [13,14], and we demonstrated in effect that their proposal also holds even with a genuine stochastic evolutionary model.  (R7.2)). In Supplementary methods SM-4, we will use the ancestral-state-based expansion (i.e., Eq. (R7.5) supplemented with Eq. (R7.6)), as was only briefly sketched at the bottom of subsection 4.2 of (ibid.). In a MSA, gapless columns play almost the same role as PASs in a PWA. Because of the aforementioned "phylogenetic correctness" condition, a gapless column indicates that no indel event hit or pierced the site. Therefore, gapless columns will partition a MSA into regions each of which accommodates a local subset of every global history. Analogously to the argument above Eq. (R6.7), let C 1 ; C 2 ; …; C Κ max be the maximum possible set of such regions determined by a given MSA and a model setting (including the tree) ( Figure S2 in Additional file 1). (As argued in subsection R8-3, all gapless columns are not necessarily needed to delimit the regions.) Because the summation in Eq. (R7.5) involves the summation over all MSA-consistent root states, it would be convenient to specify a "reference" root state, s 0

Root
. It can be anything, as long as it is the state at the root consistent with α½s 1 ; s 2 ; …; s N X . Technically, one good candidate for s 0 Root would be a root state obtained by applying the Dollo parsimony principle [39] to each column of the MSA, because it is arguably the most readily available state that satisfies the phylogenetic correctness condition along the entire MSA. Then, we will impose the following condition.

Condition (iii):
μ P ½s Root ; s Root 0 ; n Root ; C Κ : ðR7:8Þ Here the multiplication factor, μ P [s Root , s 0 Root , n Root ; C Κ ], represents the change in the state probability at the root due to the difference between s Root and s 0 Root within C Κ . This equation holds, e.g., when P[(s Root , n Root )] is a geometric distribution or a uniform distribution of the root sequence length. 14 Under the conditions (i), (ii) and (iii), through a series of formal calculations and reasoning, Eq. (R7.5) supplemented with Eq. (R7.6) can be re-expressed into the final factorized form: Here, is the probability of having state s 0 Root that has been intact all across tree T, and⌣ Μ P α½s 1 ; s 2 ; …; s N X ½ ; s Root 0 ; C Κ T j is the multiplication factor contributed from all local indel histories (along T) confined in C Κ . 15 Briefly, the multiplication factor is the summation of terms over all possible sets of the MSA-consistent ancestral states in C Κ . And each of the terms is the product of the local PWA multiplication factors (Eq. R6.8) confined in C Κ ( Figure S2 in Additional file 1), the exponential of minus the summation over all b's of the time-integrated exit rate differences between s A (b) and s 0 Root coming from C Κ , and μ P [s Root , s 0 Root , n Root ; C Κ ] for the root state probability. (For the factor's exact expression and the detailed derivation of Eq. (R7.9), see Supplementary methods SM-4 in Additional file 1).

R8. Examples: Models with factorable/non-factorable alignment probabilities
A merit of conditions (i) and (ii) given in section R6 is that they can draw the line between evolutionary models with factorable PWA probabilities and those with nonfactorable ones. To illustrate the use of these conditions, we here give three examples: (1) a simple model with factorable probabilities, (2) a simple model with non-

R8-1. Totally space-homogeneous model
The simplest conceivable indel models would be those whose indel rate parameters are space-homogeneous, i.e., independent of the positions where the indels occur: r I x; l 1 ; s; t ð Þ¼g I l 1 ; t ð Þ; ðR8 À 1:1Þ In fully space-homogeneous models, these equations hold for 1 ≤ x ≤ L(s) − 1, 1 ≤ l 1 ≤ L I CO , 1 ≤ l 2 ≤ L D CO , and 2 − l 2 ≤ x B ≤ L(s), where L I CO and L D CO are the "cut-off lengths" for insertions and deletions, respectively. (Depending on the model, r I (0, l; s, t) = g I;L (l, t) and r I (L(s), l; s, t) = g I;R (l, t) could differ from g I (l, t) in Eq. (R8-1.1)). In fact, these conditions were imposed in nearly all continuoustime Markov models of indels that were studied in the past. Note that the rate parameters in Eqs. (R8-1.1,R8-1.2) could depend on time, although most studies thus far assumed that the rates are time-independent as well. Eqs. (R8-1.1,R8-1.2) automatically guarantees condition (i). Thus, all we have to do is check whether or not condition (ii) is also satisfied. Indeed, we can show it is. The exit rate of this model is calculated by substituting Eqs. (R8-1.1,R8-1.2) into Eq. (R4.3) (supplemented with Eqs. (R3.14,R3.15)), and we find that it is an affine function of the sequence length (L(s)): ðR8 À 1:4Þ Here δLM ν À Á is the length change caused by the event, M ν . The rightmost hand side of this equation depends only onM ν and the time it occurred, but not on the sequence state (s(ν − 1)). Thus, condition (ii) is always satisfied under fully space-homogenous models, which means that alignment probabilities calculated ab initio (as in section R4) under such models are factorable, as shown in section R6.
An important special case of the space-homogeneous model is the model used by Dawg [26], whose indel rate Thus, in this case, condition (ii) will not be satisfied in general, whereas condition (i) is satisfied. This case gives the simplest example of a model with non-factorable PWA probabilities despite space-homogeneous rates of indels (as long as they are allowed). In a model with dissection of the ab initio calculation of alignment probabilities under genuine stochastic evolutionary models. Another merit of this formulation is that it gives intuitively clear pictures. For example, the insertion and deletion operators simply mathematically represent the intuitive pictures of the indels naturally acting on sequences (Fig. 3). Thus, the action of the rate operator, given by Eqs. (R3.6-R3.10) (or Eqs. (R3.11-R3.15)), is intuitively understandable as merely the collection of all possible single-mutational channels from a given sequence state (and some compensating terms). Then, the expansion formula for the action of the finite-time transition operator, Eqs. (R4.6,R4.7), can also be intuitively interpreted as the collection of contributions from all possible mutational processes starting with an initial sequence state. Importantly, this expansion was not posed via a hand-waving argument but rigorously derived as the solution of the defining equations of the model (Eqs. (R3.19-R3.21)), which justifies its ab initio status. And the integral equations, Eqs. (R4.4,R4.5), bridged the expansion's mathematically rigorous and intuitive aspects. Finally, the binary equivalence relations, Eqs. (R5.2a-R5.2d) (e.g., Fig. 5), and their resulting LHS equivalence classes also allow intuitive interpretations as the invariance of the local effects of indels under their relative orders with spatially separate events. Therefore, the conditions for the factorability of PWA probabilities are also intuitively understandable. Although their mathematically rigorous derivations (in Supplementary appendix SA-2 in Additional file 2) might appear somewhat formidable, they are actually not so difficult once the aforementioned intuitive pictures are understood. Hence, by coupling the mathematical preciseness with the intuitive clarity, our theoretical formulation is expected to facilitate further advances of the study of ab initio alignment probabilities under genuine stochastic evolutionary models with some biological realism.
For clarity, this study focused only on indels among various mutational types, because indels are essential for creating a sequence alignment. If desired, however, our theoretical formulation could also incorporate substitutions ( [31]; see also [42]). Moreover, the formulation could also be extended to incorporate other genome rearrangements, such as duplications and inversions. (See [31] for an initial, rudimentary attempt.) Such an extended formulation will provide tools to enable concrete analyses of "rate grammars" extended to incorporate genome rearrangements (briefly mentioned in [21]).
The practical use of our formulation depends on how efficiently it can calculate quite accurate alignment probabilities. Although the factorability of alignment probabilities will help greatly speed up the computation, the contribution from each local region (e.g., Eq. (R6.8) or Eq. (SM-4.22) in Additional file 1) is still composed of infinitely many terms. Good news is that the first approximation of each local contribution, which is the summation of the terms from parsimonious local indel histories alone, is quite accurate, as long as the gap lengths and the branch lengths are at most moderate (Ezawa, unpublished; draft manuscripts: [40,41]). Thus, considerably efficient computation of considerably accurate alignment probabilities may be possible based on our formulation. Especially, at least when the model is spatiotemporally homogeneous, our ab initio calculation was shown to be equivalent to the calculation of [21], with a caveat (see subsection R8-1). Thus, their dynamic programming (DP) may be applicable, possibly with some modifications, to a wider class of models with factorable probabilities.
Despite these favorable aspects, our theoretical formulation still has some limitations and outstanding issues. First, we did not examine whether our "sufficient and nearly necessary" set of conditions for the alignment probability factorization is exactly necessary and sufficient or not. Nor did we provide any counterexamples that violate our set of conditions and still have factorable alignment probabilities. Solving these problems may be interesting at least mathematically.
Second, although in this study we tentatively used the set of positive integers to represent the space of ancestry indices (ϒ), it is obviously not the best choice. Finding, or establishing, a mathematical entity (either a set or a space) that is more suitable for representing ϒ should be another mathematically interesting issue.
Third, in section R8 (and in section 5 of [32]), we only considered simple boundary conditions. Each sequence end was either freely mutable or flanked by a biologically essential region that allows no indels. These boundary conditions may remain good approximations if the subject sequences were extracted from well-characterized genomic regions. In real sequence analyses, however, the ends of the aligned sequences are often determined by artificial factors, such as the methods to sequence the genome, detect sequence homology, and annotate the sequences. Moreover, the constant cutoff lengths (L I CO and L D CO ) were introduced just for the sake of simplicity, to broadly take account of the effects of various factors that suppress very long indels (such as selection, chromosome size, genome stability, etc.). In reality, it is much more likely that the cutoff lengths will vary across regions. Then, the alignment probabilities would be only approximately factorable, as in subsection R8-2. In order to pursue further biological realism and to enable more accurate sequence analyses, it should be inevitable to address these issues seriously. Eq. (R8-2.4) may be useful for such studies.
Fourth, we strongly caution the readers that, at this point, a naïve application of our formulation or its algorithmic implementation [41] to a reconstructed MSA is fraught with high risks of incorrect predictions of indel histories, etc. This is because reconstructed MSAs, even if they were reconstructed via state-of-the-art aligners (reviewed, e.g., in [43]), are known to be considerably erroneous (e.g., [42,44,45]). Thus, it should be preferable to first develop a method or a program that accurately assesses and rectifies alignment errors, preferably by estimating the distribution of fairly likely alternative MSAs (as, e.g., in [16,46,47]), before using our formulation to make some evolutionary or biological predictions.
Fifth, in this study, the phylogenetic tree was regarded as given. In many cases, however, the phylogenetic trees must also be inferred from the input sequence data. A theoretically ideal way would be to infer the joint distribution of MSAs and phylogenetic trees, as it is expected to minimize possible prediction biases (e.g., [13,[48][49][50]). A major problem is that such an analysis would be tremendously time-consuming in general. At present, it is a totally open question whether our formulation can be adapted to infer a quite accurate joint distribution efficiently enough.

Conclusions
To the best of our knowledge, this is the first study to theoretically dissect the ab initio calculation of alignment probabilities under a genuine stochastic evolutionary model, which describes the evolution of an entire sequence via insertions and deletions (indels) along the time axis. The model handled here extends the previously most general evolutionary model, i.e., the general form of the "substitution/inserton/deletion models" [21]. It should be noted that we did not impose any unnatural restrictions such as the prohibition of overlapping indels. Nor did we make the pre-proof assumption that the probability is factorable into the product of column-tocolumn transition probabilities or block-wise contributions. The essential tool introduced in this study was the operator representation of indels. This enabled us to shift the focus from the trajectory of sequence states (as in [21]) to the series of indel events, and to define local-history-set (LHS) equivalence classes of indel histories. Moreover, the operator representation also facilitated the adaptation of the time-dependent perturbation expansion (e.g., [29,30,33]), which enabled us to express the probability of an alignment as a summation of probabilities over all alignment-consistent indel histories. Then, under a most general set of indel rate parameters, we exploited the LHS equivalence classes and found a "sufficient and nearly necessary" set of conditions on the indel rate parameters and exit rates under which the ab initio alignment probabilities can be factorized to provide a sort of generalized HMMs. We also showed that quite a wide variety of indel models could satisfy this set of conditions. Such models include not only the "long indel" model [21] and the indel model of a genuine molecular evolution simulator, Dawg [26], but also some sorts of models with rate variation across regions. Moreover, we explicitly showed (in Supplementary Appendix SA-3 in Additional file 2) that, as far as each LHS equivalence class is concerned, the probability calculated via the method of [21] is equivalent to that calculated via our ab initio formulation, at least under their spatiotemporally homogeneous indel model.
To summarize, by depending purely on the first principle and by providing intuitively clear pictures, this study established firm theoretical grounds that will help further advance the ab initio calculation of alignment probabilities under genuine stochastic evolutionary models with some biological realism. And our theoretical formulation will also provide other indel probabilistic models with a sound reference point, provided that there exist approximate methods that can quite accurately estimate the ab initio alignment probabilities fairly efficiently. Such approximate methods will be the subject of a related study (Ezawa, unpublished; draft manuscripts available [40,41]).

Methods
Methodological details in this study are described in Supplementary methods in Additional file 1, or in Supplementary appendix in Additional file 2.

Endnotes
1 In a sense, the models implemented in the genuine sequence evolution simulators (e.g., [26][27][28]) could also be considered as special cases of this evolutionary model.