Simprot simulated sequences
Simprot's simulation parameters provide flexibility for generating alignments so that the effects of distinct factors can be examined together and/or separately under multiple evolutionary scenarios. Simulated sequences were used to investigate the influence of sequence length, indel frequency, indel length, evolutionary distance, terminal gaps length, gamma density function and tree topology on the accuracy of alignments inferred by different programs. More than 30000 alignments were created independently by Simprot using five phylogenetic trees (Figure 2) with variable lengths and different number of variable size indels, in order to cover different topological evolutionary patterns. Simprot generates a known alignment and another file containing the sequences with no indels. One hundred simulated alignments with different random seed values were created for each combination of tested parameters. All corresponding sequences were also aligned with the nine programs described above and the resulting alignments were compared to the "true" alignment generated by Simprot. The average accuracy values for the 100 alignments of each set are reported here and in some cases a Wilcoxon signed ranks test was employed in order to determine the statistical significance of the difference on average accuracy. The protein substitution matrix used in all simulations was PMB [32], which is also the program's default.
As reported previously [9], sequence length does not affect alignment accuracy of the different programs. In order to confirm this, five different root sequence lengths were employed in the analysis: 50, 100, 150, 200, 250 and 300 amino acids. These values were selected in order to get the resulting alignments in a feasible amount of time, while maintaining a significant difference in root sequence lengths. To determine the effect of amino acid substitutions and sequence length on alignment accuracy, we first kept the indel frequency and indel length very low and considered different trees with various overall evolutionary distances and with increasing root sequence lengths. The majority of the packages tested generated results ranging from good to excellent with increasing sequence length. POA presented the lowest accuracies on the performed tests and at the same time was positively affected by sequence length increase (Figure 3). POA's lower accuracy appears to be due to a tendency of the program to place large internal gaps close to the sequence terminals, while the accuracy increase in larger sequence lengths might be explained by the proportionally small influence of these terminal gaps in the alignment scoring. Apparently, the alignment of sequences with a large number of substitutions but a low number of gaps did not present a problem for any of the nine algorithms used, no matter the size of the sequence. Noticeably, Clustal W showed the steepest decline in accuracy when sequence length was increased even at very low gap frequency values (Figure 3 and 4). The program occasionally had an alignment accuracy decrease two to three times larger than the average for all other seven programs (Figure 4B), especially when indels were added to the reference alignments.
Different indel frequencies were also used in the simulations in order to test the effect of indel occurrences in alignment accuracy. Simprot's process for insertions and deletions assumes a Poisson model, where the expected frequency of indels between two sequences separated by an expected 100 PAM distance is
p = 1 - e-z/c
where z is the indel probability that is scaled by the evolutionary scale factor c. The smallest frequency p employed was the program's default value 3% and increased up to 30%. As expected, when indels were added to the simulated sequences and evolutionary distance was increased, there was an evident loss in accuracy for all programs. This corroborates results obtained by Lassmann and Sonnhammer [9], who showed that programs tended to have poorer performance as the evolutionary distance increased when indels were present. The best results were generated by ProbCons and Mafft L-INS-i. ProbCons presented better results for trees with longer evolutionary distances when intermediate to large indel frequencies were applied. Conversely, Mafft L-INS-i performed better for smaller evolutionary distances and with intermediate indel frequency values (Figure 4).
In Simprot the evolutionary distance set by the branch lengths in the input tree affects the expected number of substitutions and also affects the expected number of insertions and deletions. In order to further analyze the influence of branch lengths on alignment accuracy, we considered a single tree topology and scaled the branches (Figure 2, tree A) so that the overall tree shape was not changed (Figure 5). As shown above, all programs were negatively affected by increased evolutionary distances, particularly when the employed indel frequency parameter was high. POA had the steepest decline in accuracy, at small indel frequencies (Figure 5A).
Due to the fact that indel frequency appeared to have a large effect on MSA accuracy, we analyzed the effect of increasing the indel frequency independently of other factors (Figure 6). Our results showed that accuracy and its rate of decline with indel frequency depended on the input tree used. Trees with longer branch lengths (Figure 6A) had sharp decreases in accuracy with increasing indel frequencies. Input trees with shorter branch lengths showed smaller declines in accuracy (Figure 6B–C). For most programs, when a topology with varied branch length was used, the accuracy decrease was almost linear with increasing indel frequency. ProbCons and Mafft L-INS-i were the least affected by the increase in evolutionary distance and resulted in the best performances.
Another element that can influence the occurrence and number of indels in protein sequences is the tree topology. We considered this independently of evolutionary distance and Simprot's indel frequency by considering two trees with identical maximum evolutionary distance (Figure 2, Trees D and E) but with different topologies. Although evolution had occurred at different locations in the two tree topologies tested (tips opposed to internal nodes), this did not seem to have large influence on overall alignment accuracy for the majority of the algorithms analyzed (Figure 7), as in both cases programs obtained similar alignment scores.
Real protein sequence data often contains non-homologous terminal ends and/or incomplete sequences. We investigated the effect of large terminal gaps on alignment accuracy. A small modification in Simprot's code was necessary to include an additional probability of terminal gaps. Since no reasonable biological model exist for external gaps, we introduced an ad hoc parameter t which determines the probability and length of external gaps by scaling the probabilities for internal gaps. Five different values of the terminal gaps insertion parameter were used while keeping the internal gap frequency constant (5%) for these simulations. It was observed that the presence of terminal gaps, regardless of their length, had a minimum effect on alignment accuracy for most of the programs (Figure 8). Again in this case, Mafft L-INS-i and ProbCons were the top performers.
The analysis of influence of the indel frequency on alignment accuracy did not take into account the overall size of the simulated insertions and deletions. To test a possible effect of indel size on the programs' performances three different values (2, 3 and 4) of Simprot's c parameter were tested. This value is used by the generalized Qian and Goldstein distribution [11, 12] in indel length determination. Larger c values yield shorter indels while smaller values result in longer ones [11]. The indel frequency was kept constant. We found that the larger the indels the lower the alignment accuracy, although with a moderate difference in the final average score (Figure 9). This accuracy loss could be seen for all phylogenetic trees analyzed and for the majority of the programs. Again, Mafft L-INS-i and ProbCons seemed to fare better and were least affected by the variation in gap length.
According to Rosenberg [7], changes in the shape of the gamma distribution, of Yang's [13] distribution of evolutionary rates, influences alignment accuracy. The gamma shape models the proportion of slow to fast evolving positions and accounts for variable substitution rates among sites; the lower the α the larger the number of sites with low substitution rate. Decreasing gamma's α shape parameter positively affected the alignment accuracy (Figure 10), due to an increasing number of identical sites between sequences. Simprot allows the modification of the gamma's α shape parameter and in our study it was set at 0.1, 0.7, 1 (Simprot's default), 5 and 10. We examined changing α using different topologies and indel frequency values. The obtained results show a moderate influence of the value of gamma α at low indel frequencies, resulting in an accuracy loss for most of the programs, especially for POA. When larger indel frequencies were employed, the negative effect of increasing gamma was accentuated up to α = 5, which was reversed by a small gain in accuracy when α was increased from 5 to 10 (Figure 10). This is expected, since with α values the gamma distribution of evolutionary rates tends to be less extreme (exponential) and to have a curve shape similar to the normal distribution.
In order to deduce why there was a large influence of insertions and deletions on the programs' performance, we analyzed the average number of gaps per sequence present in the "true" alignment and in all resulting alignments (Figure 11). As mentioned above, POA had the tendency of inserting long internal gaps at the sequence terminals; this inflates the average number of gaps in the alignments that are constructed by the program. Clustal W, under default parameters, was the most conservative of the programs tested and in every case had a smaller gap number average for its alignments than other packages and the known alignment. Kalign and ProbCons were the programs with a final number of inserted gaps closest to the real alignment in the majority of the simulations. Overall, programs that use a progressive alignment with tree determination showed a smaller gap number average per sequence than the programs that do not use a guide tree (POA, Dialign2.2 and Dialign-T).
In summary our results show that it is the total number of indels independently of where in the tree they occur, and to some degree independently of the number substitutions, that had the greatest effect on alignment accuracy. Also, indel size plays a role in alignment accuracy, but to a lesser extent than indel number. Additionally, the gamma distribution of evolutionary rates generally had a negative effect on the final accuracy. Regarding program performance, ProbCons and Mafft L-INS-i achieved the best results in the majority of the simulated alignments sets. An intermediary group consisted of T-Coffee, Muscle, Mafft FFT-NS-2 and Kalign, while Clustal W, POA, Dialign-T and Dialign2.2 often produced the poorest alignment accuracy. An overall summary of alignment accuracy for each program is shown in Figure 1. With the exception of Clustal W, in scenarios of large sequence lengths and indel frequency, programs that have a tree-guided multiple alignment procedure showed better results than those that do not rely on tree determination to align protein sequences. As pointed above, programs with a tree-determination step were more conservative in inserting gaps than programs that lack this step, generally achieving better final accuracies.
BAliBASE
It was important to determine if the results obtained from the Simprot-generated sequences were applicable to alignments from actual proteins. We considered the accuracy of the nine programs on the latest version of BAliBASE alignments (Figure 1). Overall we found results similar to those obtained on the simulated sequences in that ProbCons and Mafft using strategy L-INS-i appeared to have the best performance. In BAliBASE's reference RV11, containing equidistant sequences sharing less than 20% identity, ProbCons and Mafft L-INS-i were not statistically different according to the Wilcoxon signed ranks test (p > 0.05). The same result with no statistical separation was observed in reference RV20, which is composed of sequences from divergent subfamilies, in reference RV30, comprised of sequences from protein families with some highly diverged sequences, and in reference RV50, made of sequences with large insertions.
ProbCons did perform significantly better than Mafft L-INS-i in the set RV12 that contains equidistant sequences sharing between 20 and 40% identity. Mafft L-INS-i and T-Coffee were not statistically different (Wilcoxon signed ranks test, p > 0.05). Conversely, on reference RV40 composed of protein sequences with large extensions, Mafft L-INS-i outperformed all other packages, with ProbCons and T-Coffee not far behind and not significantly different.
When results for all references are analyzed together, the same pattern observed from the isolated references was also found. In this broader scenario, ProbCons and Mafft L-INS-i achieved the best results and the difference in final alignment accuracy is not statistically significant (Wilcoxon signed ranks test, p > 0.05). An intermediary pack is formed of two distinct groups (defined by Wilcoxon signed ranks test) where Muscle and T-Coffee did slightly better than Mafft FFT-NS-2, Kalign and Clustal W. Showing the poorest performance for the whole database set were Dialign-T, Dialign2.2, which were statistically indistinguishable, and POA. Overall, the results from Simprot and BAliBASE data sets were consistent, with the exception of Mafft FFT-NS-2 which ranked significantly lower on BAliBASE data sets than on Simprot's. These results corroborate in part the findings of Lassmann and Sonnhammer [9], that showed T-Coffee as the best available algorithm at the time for BAliBASE v2 alignments. Their result also indicated POA as the program with the poorest performance.
Speed of execution
Mafft FFT-NS-2 was the fastest program for all tested sequence sizes (Figure 12). T-Coffee, as shown before [9], had the worst speed, with an average alignment time for the smallest sequence set (100 amino acids) longer than for Clustal W, Mafft (FFT-NS-2 and L-INS-i), Kalign and POA when aligning the largest set. ProbCons had the second worst average time for most sequence sizes.