Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

Background Reconstruction of multiple sequence alignments (MSAs) is a crucial step in most homology-based sequence analyses, which constitute an integral part of computational biology. To improve the accuracy of this crucial step, it is essential to better characterize errors that state-of-the-art aligners typically make. For this purpose, we here introduce two tools: the complete-likelihood score and the position-shift map. Results The logarithm of the total probability of a MSA under a stochastic model of sequence evolution along a time axis via substitutions, insertions and deletions (called the “complete-likelihood score” here) can serve as an ideal score of the MSA. A position-shift map, which maps the difference in each residue’s position between two MSAs onto one of them, can clearly visualize where and how MSA errors occurred and help disentangle composite errors. To characterize MSA errors using these tools, we constructed three sets of simulated MSAs of selectively neutral mammalian DNA sequences, with small, moderate and large divergences, under a stochastic evolutionary model with an empirically common power-law insertion/deletion length distribution. Then, we reconstructed MSAs using MAFFT and Prank as representative state-of-the-art single-optimum-search aligners. About 40–99% of the hundreds of thousands of gapped segments were involved in alignment errors. In a substantial fraction, from about 1/4 to over 3/4, of erroneously reconstructed segments, reconstructed MSAs by each aligner showed complete-likelihood scores not lower than those of the true MSAs. Out of the remaining errors, a majority by an iterative option of MAFFT showed discrepancies between the aligner-specific score and the complete-likelihood score, and a majority by Prank seemed due to inadequate exploration of the MSA space. Analyses by position-shift maps indicated that true MSAs are in considerable neighborhoods of reconstructed MSAs in about 80–99% of the erroneous segments for small and moderate divergences, but in only a minority for large divergences. Conclusions The results of this study suggest that measures to further improve the accuracy of reconstructed MSAs would substantially differ depending on the types of aligners. They also re-emphasize the importance of obtaining a probability distribution of fairly likely MSAs, instead of just searching for a single optimum MSA. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0945-5) contains supplementary material, which is available to authorized users.

Additional file 1 of "Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map," by Kiyoshi Ezawa Supplementary methods

SM-1. Complete-likelihood score: definition and "indel-substitution factorization"
This section provides one of the key results in this study, that is, the formal proof that the complete-likelihood score can be calculated as the summation of the indel component and the substitution component, under genuine stochastic evolutionary models. Briefly, the proof proceeds as follows: (I) Express the occurrence probability of a given MSA as the summation of probabilities over all possible evolutionary processes that can result in the MSA; (II) divide the set of all evolutionary processes into subsets, each of which consists of processes sharing the same indel process; (III) show that the total probability of each subset over all consistent substitution processes is independent of the shared indel process, as long as it is evolutionarily consistent with the MSA; (IV) by factoring out the indel-independent total probability over substitution processes, express the probability of the MSA as the product of the total probability over substitution processes (under a fixed indel process) and the total probability over indel processes; and (V) take the logarithm of the total probability. In the following, we give details on the proof. As you will see shortly, part (III) is the main part of the proof, and it is rather long. So, it was divided into 10 paragraphs, (III-1) through (III-10). ] .) Consider also that the phylogenetic tree, T = {n} T , {b} T ( ) , of the aligned sequences is also given. (Here n and b represent a node and a branch (an edge), respectively. And n R denotes the root node hereafter.) Then, consider a given genuine stochastic evolutionary model, Θ = (Θ S , Θ ID ) .
Here Θ S and Θ ID denote an evolutionary model of substitutions and that of indels, respectively, equipped with specific parameters. The occurrence probability, P  α Θ, T " # $ % , of MSA  α conditioned on this model setting, Θ , (and the tree, T ,) is interpreted as the "complete-likelihood," L Θ, T ( )  α , that is, the likelihood of the given "complete" evolutionary model, Θ (, and the given tree, T ), given the MSA,  α : (1) In this paper, we deal with a general continuous-time Markov model as the specific genuine stochastic evolutionary model, Θ .
(Preparation 2) Previously (in Appendix A3 of [73]), we proved that, if some conditions are satisfied, the above "complete-likelihood" is factorable into the product of the "basic" component and the "residue" component. Here the "basic" component is the total probability of the possible indel processes, and the "residue" component is the total probability of the possible substitution processes (including the initial residue states). Thus, we will call this factorization the "indel-substitution factorization." The proof given there was considerably long, because of two major reasons.
(1) It started with a quite general model and gradually narrowed down its scope, in order to find the conditions for the factorization. And "history" is an untimed record of a series of fixed mutation events. Here, however, we mainly consider evolutionary processes, each of which is a timed record of the series of fixed mutation events. When we refer to a "summation" over processes, we tacitly assume that it includes the multiple-time integration over the timings of the events as well. (For brevity, an evolutionary process is represented by adding a dot over the symbol of its evolutionary history counterpart. Similarly, a set of evolutionary processes is represented by adding a dot over the symbol of the corresponding set of evolutionary histories.) (I) Generalizing the proof given in sections 3.
[ ] is the set of all evolutionary processes (including the root sequence state ) along tree T that can result in MSA  α . And each evolutionary process, denoted as   ψ here, is made up of four components: Here, s R is the basic state of the root sequence,  ω R is its residue state,  ψ ID is an indel process, and  ψ S is a substitution process. These four components are not completely independent of each other.
First, because of the causality,  ψ ID and  ψ S depend on  s R , but not vice versa. Second, the substitution process depends on the indel process, because an indel process determines the presence/absence of the sites at which substitution events take place. Explicitly taking account of these dependences, the above evolutionary process could be represented as: ) . We also notice that the above set,   can be expressed as: is the "skeleton," i.e., the "basic" component, of MSA # $ is the set of all "basic" sequence states at the root ( n R ) that are consistent with the tree ( T ) and α .   Η ID α; s R ; T " # $ % is the set of all indel processes along T that start with s R at the root and that can give rise to α . And $ % is the set of all substitution processes along T that start with at the root, that are consistent with  ψ ID , and that can "flesh out" α to give rise to  α .
(II) To go further, we first formally decompose Eq.(3) as follows: Here, is the set of all pairs, each of a "basic" state at the root and an indel process along T , that can give rise to α . And is the set of all pairs, this time each of a "residue" state at the root and a substitution process along T , that are compatible with   ψ ID and that can give rise to  α by filling out α with the extant residues. Second, let us recall a basic identity in the probability theory: By using this, we decompose the summand on the right hand side of Eq.
(2), we get: This is the general equation that holds true regardless of details of the evolutionary model , and it provides the sound starting point of the following argument.

Condition (i)
: "The indel rates are independent of the residue state and the substitution process before each indel event." Under this condition, the probability (density) of an indel process becomes independent of the substitution model setting, that is: Next, we assume the following two conditions.

Condition (ii)
: "The substitution rates at each site are independent of the states (both "basic" and residue) and the evolutionary processes (both indel and substitution) at other sites." Condition (iii): "The probability of the residue state of an inserted subsequence, conditioned on the insertion, can be factorized into the product of residue probabilities over the inserted sites. And the probabilities are independent of sequence states before insertion." In an equation, the condition (iii) could be expressed as: Here p I δ  ! ω = ( ! ω x+1 ,..., ! ω x+l ); x, l; s, t I ( ) denotes the probability, conditioned on the insertion of l sites between the x th and the x +1 th sites of s at time t I , that the inserted conditioned on the insertion of the site with ancestry υ x+i ( ! s ) (, which is the x + i th site of the new sequence ! s ,) at t I , that the inserted residue is ! ω x+i . Under the conditions (ii) and (iii), we can perform the following factorization: ---Eq.(10)  (III-2) Now, consider the site-wise probability on the right hand side of Eq.(10), i.e., The conditions (ii) and (iii) guarantee that, under a fixed indel process component involving each site, the substitution processes at the site can be described by a continuous-time Markov ---Eq.(12a) Here and in Eq.(12b) below, we should consider that the summation over residue states at each node actually involves all the probabilities whose conditions or outcomes concern the states at the node. (We used this notation because there was no neater way to precisely represent the multiple summations here.) In Eq.(12a), in the tree ( T ) along which the site (with ancestry υ ) persists under the indel process and whose lower-ends are internal nodes. Β P X   ψ ID [υ], T " # $ % is the set of site-persisting branches whose lower-ends are external nodes (with extant sequences).
is the set of branches along which the site is deleted. P (ω R , n R ) υ, Θ S " # $ % is the probability that the residue state is ω R at n R , given Θ S and υ .
is the probability that the residue is ω D (b) at the lower-end (i.e., descendant node) n D (b) of branch b , given that the residue is ω A (b) at the upper-end (i.e., ancestral node) . We symbolically omitted but tacitly assumed the dependence of these conditional probabilities on Θ S and υ . It should also be kept in mind that each node has a single residue state under each substitution process, as long as the subject site existed at the node.
Here, as in Eq.(9), p I ω I ; υ, ) is the probability that the site (with ancestry υ ) has residue ω I when inserted, given that it was inserted at t I (b I ) along b I . And ! " # $ is the probability that the residue is ω D (b I ) at the lower-end n D (b I ) of b I , given that the residue was ω I when the site was inserted along b I . Again, we tacitly assumed the dependence of the conditional probabilities on Θ S and υ .
(III-5) Now, we will show that, under some additional conditions, Eqs(12a,b) are identical, and independent of the details on the indel processes (   ψ ID [υ] 's), as long as the processes are consistent with a given MSA column,  α[υ] (or, more precisely, a given "skeleton," α[υ] ). For this purpose, we recall the "phylogenetic correctness" condition (e.g., [34,74]), which has to be satisfied by any indel processes (or their resulting ancestral "basic" states) that could result in a given MSA column. In the present context, the "phylogenetic correctness" condition could be rephrased as follows.
"When a site exists at two points on the tree ( T ), it must also exist all along the path on T that connects these points." [ ] hereafter) by making the site exist along all paths connecting the external nodes holding the site, and by making the site absent from all remaining parts of the tree. The other indel histories (and processes) that satisfy the "phylogenetic correctness" condition could be constructed by continuously extending one or more paths of "site existence" from the point(s) on the "web" of paths of [ ] , while keeping the site existence/absence at the external nodes intact.
(III-6) Thus, all we have to prove is that we can retract such extended paths back to the Dollo parsimonious history while retaining the site-wise probability (Eq.(12a) or Eq.(12b)) unchanged. Broadly speaking, there are two types of extended paths: one upward, extending toward the root ( n R ); and the other downward, extending toward (but short of) external nodes where the site is absent. We first retract the downward paths, and then we retract the upward one.
(III-7) Each downward extension ends either at a point along a branch (like or at an internal node (like n D (b) ). In any case, the end of each downward extension is associated with a summation of the single conditional probabilities, like: . By the definition of the conditional probabilities, this summation is always 1 (unity). Therefore, we can indeed retract each downward extension until it becomes a point (node). The indel process resulting from all such retractions is the Dollo parsimonious process, maybe bearing an upward extension.
(III-8) The upward extension also ends either at a point along a branch (like t I (b I ) ) or at an ancestral node (like n A (b) , especially n R ). We need to separately consider the cases where the site was already present at the root (Eq.(12a)) and those where the site was inserted (Eq.(12b)). In the latter case, the extension could be retracted without any effect on the probability if the following condition is satisfied: Condition (iv): "For the inserted residue probabilities at the site with ancestry υ , } ω∈Ω, τ ∈T , we have: for all upward paths on T (, whose upper-end and lower-end are denoted as ! τ and τ , respectively)." This is because the summation over residues at an internal node ( ! n ) on an un-branched path ) could be trivially performed as: as a result of the Chapman-Kolmogorov equation. When the upward extension reaches the root ( n R ), we could retract the extension if the following equation is satisfied: Here b I 0 is the branch where the site is inserted in the Dollo parsimonious history.
Comparing Eq.(15) to Eq.(13) with the substitutions ! τ = n R and τ = n D (b I 0 ) , we see that, along with Eq.(13), the following equation is necessary and sufficient.

Condition (v):
The conditions (iv) and (v) could be rephrased in the following words.
"The inserted residue probabilities at the root, And the inserted residue probabilities along the tree, p I ω; υ, τ These conditions were automatically satisfied in most of the past evolutionary models with indels (e.g., [88][89][90][91]), because these models used time-reversible substitution models and assumed that the inserted residue probabilities are given by the equilibrium frequencies of the substitution model.
Usually, they do not appear in an input MSA. Let us consider Eq.(12a) or Eq.(12b) compatible with an indel history resulting in such a null column. After retracting all the downward extensions in such a case, we are left with a point, which is either the root ( n R ) or an insertion point ( t I (b I ) ). In the former case the site-wise probability reduces to , and in the latter case it reduces to Either way, the probability becomes 1 (unity). Thus, the "null" MSA columns give trivial probabilities due to substitutions.
(III-10) Thus, we proved that, under the conditions (iv) and (v) (i.e., under Eq.(13) and Eq.(15')), the site-wise probabilities, Eq.(12a) and Eq.(12b), do not depend on the details [ ] is the Dollo parsimonious indel history for the column α[υ] . Substituting this result into Eq.(10), we have: The right hand side of Eq.(10') depends only on  α but not on other details of   ψ ID , because the set of Dollo parsimonious histories, , is uniquely determined by α .
Recalling the conditions (ii) and (iii), we could re-interpret the right-hand side of Eq.(10') as the probability of  α , given α , Θ S and T . This also conforms to the standard calculation of the probability of a MSA via substitutions (e.g., [1,2,62]). Thus, we have: On the right hand side, we omitted α from the argument of the likelihood, because α trivially follows from  α .
(IV) Substituting Eq.(8) and Eq.(10") into Eq. (7), the complete-likelihood can be rewritten as: [ ] is the set of all indel processes (including the "basic" root states) that can give rise to α , the summation on the right hand side of Eq.(7') is actually the probability of α under given Θ ID and T : Substituting Eq.(16) into Eq.(7'), we obtain the final result: ---Eq. (V) Taking the logarithms of both sides of Eq.(7"), we see that the logarithm of the complete likelihood (called the "complete-likelihood score" here) is the summation of the logarithms of the two likelihoods.

SM-2. Pre-processing MSAs: details
In principle, MSA aligners can only reconstruct the "homology structure" of a MSA, which describes the mutual homology relationships among the residues (or sites) in the homologous sequences (e.g., [83]). They cannot predict further details concerning the indel processes that produced the sequences. Therefore, we pre-processed each of the true and reconstructed MSAs so that MSAs with the same homology structure will be represented identically.
Specifically, we first removed MSA columns consisting only of gaps. Then, we swapped two adjacent blocks, each consisting only of the columns with the same gap pattern, when they satisfied the following two conditions. (1) A block contains residues only in sequences that contain exclusively gaps in the other block. (In other words, the set of sequences with residues in one block does not overlap that in the other.) (2) Viewed in the MSA, the highest sequence with a residue in the left block is higher than that in the right block. The swapping was re-iterated until no adjacent pair of blocks was found swappable.

SM-3. Partitioning pair of MSAs into correct and erroneous segments: details
We partitioned a pair of true and reconstructed MSAs into correctly and erroneously reconstructed segments (or "correct" and "erroneous" segments for short) in a manner similar in philosophy to but slightly different from that by Here we detail how we partitioned a MSA pair. First, in each of the true and reconstructed MSAs, each residue of each sequence was assigned a number, which is the count of residues on its left along the sequence. Then, each gap character ("-") was also assigned a number, which is the average of the numbers assigned to the residues on its immediate left and right. (For example, we assign 10.5 to all gaps in a run of gaps sandwiched by the residues numbered 10 and 11.) This happens to be effectively equivalent to the "pos" recoding of gap characters [92], and to the "evol" recoding (ibid.) when dealing with When the pair turned out to be equivalent, the tentative "erroneous" segment was re-classified as "correct," and it was merged with the flanking "correct" segments.

SM-4. Partitioning position-shift map into position-shift blocks: details
As can be seen from panel C of Figures 1, and S1-S3, a position-shift map usually has a clear block structure, where each block with a position-shift is delimited by two positions along the MSA and one or more branches in the tree. There could be a number of strategies to identify the blocks. In this study, we employed a "bottom-up" strategy, in which small blocks were constructed first, and then they were merged to form larger blocks. An important thing is how to handle some common exceptions, such as gaps in a block, multiple independent yet exactly aligned blocks sharing a position-shift, nested blocks, and a block interrupted by insertions along remote branches. The detailed procedure is as follows.
Given a position-shift map, we first chopped each sequence into segments, each of which consists of contiguous residues with the same position-shift. When two segments with the same position-shift were separated by a run of gaps alone, they were merged. Second, looking across the sequences, the segments with the same position-shift and sharing the same start and end coordinates (in the MSA) were merged into a tentative block, and its boundaries on the phylogenetic tree were examined. If the set of sequences were delimited by the boundaries fewer than the sequences, the tentative block was established. Otherwise, the sequence set was split into two sub-sets at their most recent common ancestor, and the boundaries were examined again. This "split-and-examine" process continued until the boundaries become fewer than the sequences in every sub-set. Third, if three blocks with the same position-shift enclosed another block with a different position-shift, the three blocks were merged into a larger block. And, finally, if two or more neighboring blocks with different position-shifts were separated only by insertions along non-delimiting branches, and if the blocks had consistent position-shifts, they were merged into a "composite block" (like the blue and red blocks in Figure S2 D).

SM-5. MSA-specific types of errors
In addition to the error types in pairwise alignments (PWAs), namely, "shift," "merge," "split," "purge," and "ex-nihilo" (1) A "vertical merge," which has two types. In one type, two neighboring deletion-derived gaps along two sibling branches are erroneously aligned together to form a wrong gap derived from a spurious deletion along the parent branch ( Figure S6 A,B). In the other type, a correct gap was derived from an insertion along a branch, and another correct gap was derived from a neighboring deletion along one of the branch's children. Then, they are erroneously aligned together to form a wrong gap derived from a spurious insertion along the other child of the branch.
(2) A "vertical split," which is the reverse of a vertical merge. In other words, a "vertical split" becomes a "vertical merge" if the true and reconstructed MSAs swap their roles.
(3) A "collapse of independent insertions (CII)," which has two types. In one type, (4) A "creation of spurious independent insertions (CSII)," which is the reverse of a CII.
(5) An "incomplete collapse of independent insertions (iCII)," which is similar to CII, but occurs between gaps of different lengths ( Figure S6 E,F), whereas a CII occurs between equally long gaps.
(6) An "incomplete creation of spurious independent insertions (iCSII)," which is the reverse of an iCII.
Errors that do not fall into these categories were tentatively classified as "complex." (See SM-7 for more details on the "complex" errors.)

SM-6. Associating MSA error with single position-shift block
The idea underlying the procedures in this and the next sections is to compare two indel histories, one inferred from the true MSA and the other from the reconstructed MSA, and to attribute the history differences to the moves of position-shift blocks (see, e.g., Figure 5). In general, the move of a block will change the predictions on indels along its delimiting branches and/or their neighboring branches. Thus, the key is to identify such indels from the inferred histories. The following is the detailed procedure. position-shift block, we also attempted to determine a "main" branch that delimits the block.
If the block had only one delimiting branch, it became the one. If the block had two or more delimiting branches, we first excluded those that were already the "main" delimiting branches of other blocks. If only one branch remained, it was used for the "main" delimiting branch; otherwise, we suspended the examination of the block. Then, for each block in each MSA, we listed "involved" indel events, each of which is defined as an event satisfying two conditions: (1) it either overlaps or flanks the block (along the MSA); and (2) it occurred either along the main delimiting branch or along its parent, child, or sibling. An "involved" indel event in a MSA was removed from the list if it had its equivalent in the other MSA. Finally, we judged what type of MSA error is associated with the block by comparing the two lists of indel events "involved" with the block, one in the true MSA and the other in the reconstructed MSA. In the following, we will explain some specific example cases. (iii) It was regarded as a "split" in an "opposite" case from either of (ii), that is, if we observed nearly the same situation as either of the cases in (ii), with the only difference that the roles of true and reconstructed MSAs were swapped. (v) It was regarded as an "ex-nihilo" in an opposite case from (iv). (xi) It was regarded as an "incomplete creation of spurious independent insertions (iCSII)" in an opposite case from (x).
We actually defined more cases in which an error associated with a block was considered to be of a definite type. [The precise conditions for the error-type vs. block associations are implemented in the subroutine, "classify_error_assoc_w_sgl_cblk," of the prototype Perl script, "classify_msa_errors_via_mblks.alpha2.pl," in the "ComplLiMment" package (in Additional file 2).]

SM-7. Associating pair of MSA errors with pair of position-shift blocks
After the first round of analysis described in SM-6, we re-examined those position-shift blocks that could not be associated with any errors of definite types. Here we tentatively refer to such a block as a "complex block." And we refer to a block whose examination was suspended in the first round as a "suspended block." Broadly speaking, after identifying pairs of seemingly interacting blocks, we first considered two series (or "paths") of moves of two First, we noticed some cases in each of which a complex or suspended block (like the purple block in Figure 1, panel D) could be absorbed into a neighboring block (like the blue one) to form a larger block. Thus, we absorbed a complex or suspended block (called "block A" here) into another block (called "block B" here), which is either complex or associated with a definite error, if the following four conditions were fulfilled. : (a) The two blocks have the same shift. Finally, we attempted to associate each block-pair with a pair of errors of definite types (as in Figure S1) in the following manner. (Here, the components of the subject block-pair will be referred to as "block C" and "block D.") For the block-pair, we obtained two "intermediate MSAs" from the reconstructed MSA: one (denoted as "rec-C") by undoing the move of block C, and the other (denoted as "rec-D") by undoing the move of block D.
The two intermediate MSAs correspond to two different paths of block-wise moves, both starting from the reference MSA and leading to the reconstructed MSA. For example, "rec-C" could occur in a path where block D moves before the move of block C. Thus, we attempted to associate block D with a definite error by comparing the indel events inferred from the reference MSA with those from "rec-C," and to associate block C with a second definite error by comparing the events inferred from "rec-C" with those from the reconstructed MSA. Each association was attempted just as in SM-6. If successful, this associates the block-pair with a pair of definite errors. Also, if successful, the use of "rec-D" associates the block-pair with another pair of definite errors. If both attempts were successful, the simpler pair of errors was chosen. If only one attempt was successful, the resulting pair of errors was associated to the block-pair. If neither attempt was successful, the results of the first round were retained.
After these analyses were completed, some complex blocks could neither be associated with definite errors nor be incorporated into any other block. We regarded these blocks as involved in "complex" errors.

SM-8. Theoretical mean lengths of gapless and gapped segments
The the probability that a sequence of length L has experienced no indel throughout a phylogenetic tree ( T , with total branch length T ) is: From this, the probability that a particular ancestral site results in a gapless column is: And the probability that a column flanking a gapless column is also gapless is: becomes negligible. Thus, the mean length of a gapless segment is approximately given by: Among the L A sites of the ancestral sequence, L A P GLC sites are expected to give gapless columns. Thus, on average, there should be N GLS = L A P GLC L GLS (T ) ≈ L A P GLC 1− P FGLC ( ) gapless segments along the ancestral sequence. Using these quantities, we can roughly calculate the average number of ancestor-derived columns in a gapped segment as: ---Eq.(SM-8.6) In general, however, gapped segments contain insertion-derived columns as well. At present, we don't know exactly how many such columns should be. Nevertheless, by resorting to a time-reversed counterpart of the evolutionary model, the number may be approximated as follows: Here l I denotes the mean insertion length. Combining Eqs.(SM-8.6,7), we get a rough estimation of the average size of a gapped segment: ---Eq.(SM-8.8) In the limit T  0 , Eq.(SM-8.8) reduces to L GS (T )  λ D l D + λ I l I ( ) λ D + λ I ( ) , as expected.    The same note and footnotes apply as those for Table 3. The same note and footnotes apply as those for Table 4.  Table 4.  CTTT------mamm002

C. Position-shift map
- that this erroneous segment contains complex errors. Manually, however, the errors can be interpreted as "vertical split + shift + merge + shift," which were caused by the green block (with shift -1), the blue one (with shift 1), the red one (with shift -1) and the yellow one (with shift -2), respectively. See the legend of Figure 1 for details on the notation.  nodes in the tree. The most recent common ancestor of the aligned sequences is in the black dashed box at the root. In red dashed boxes, "+X," "-Y," and "Z->W" represent the insertion of subsequence X, deletion of subsequence Y, and the substitution from residue Z to residue W, respectively.