Multiple sequence alignment is a basic task in Bioinformatics and has many applications in biological analyses such as phylogenetic inferencing and protein 3D structure prediction. The progressive alignment method [1] is one of the most commonly used methods for multiple sequence alignment. Roughly speaking, the method first constructs a guide tree that is supposed to capture the phylogenetic relationship of the input sequences, and then aligns the sequences progressively according to the topology of the guide tree such that more related sequences are aligned first and the less related ones are aligned later.
Recently, we have proposed an adaptive approach for progressive multiple sequence alignment[2]. We observed that for different sequence families with different similarities, their alignments usually have different characteristics and structural properties, and by using some reliable measure to estimate the similarity of the inputs, we may exploit the corresponding properties to help generate better alignments. To estimate the similarity, we proposed to use the average percent identity, which is defined as follows. For any two sequences, the percent identity of these two sequences is defined to be
where NIdentity is the number of identities in the optimal pairwise alignment of the two sequences, and LAlignment is the length of this alignment. The average percent identity of the input sequences is the average of the PIDs over every pair of the sequences. In [2], we noted that if is greater than 40%, the input sequences are very similar, and we showed how to exploit the properties of similar sequences and align the sequences globally. If is between 25% and 40%, the input are moderately similar, and we can exploit the corresponding properties to align them locally. For input below 25%, we do not know which alignment methods is better; hence we suggested trying different methods (e.g., using global alignment methods as well as local alignment methods) and using their consensus to determine the final alignment.
To test the effectiveness of our idea, we developed a software tool called GLProbs, which implements our adaptive approach for multiple sequence alignment. We have done extensive testings and empirical comparisons for GLProbs, and the results showed that GLProbs has significantly better accuracy than a dozen of other leading MSA tools (see [2] for more details).
In this paper, we study why GLProbs can achieve such a high accuracy, and exploit ways to further improve the software tool. In particular, we are interested in finding out the impact of the adaptive guide tree construction method used in GLProbs. This also leads us to study the following fundamental question:
Are guide trees really important to obtain high quality multiple sequence alignments, and if yes, how to construct the best guide trees.
We note that there are already studies suggesting that guide trees are important. For example, Penn et al.[3] showed that uncertainties in the guide tree lead to a major source of alignment uncertainty, and Capella-Gutierrez and Gabaldon[4] showed that most gaps are inserted in patterns that follow the guide tree.
To study the guide trees of GLProbs, we have done the following tests.
First, we modified GLProbs to GLProbs-Random in which the adaptive guide tree construction step of GLProbs was replaced by a step that just generates a random guide tree. Then we compared the performance of GLProbs and GLProbs-Random empirically.
Second, we modified GLProbs to a new tool GLProbs-Reference and compared their performance of aligning families of protein sequences whose correct multiple sequence alignments are generally agreed by the biologists. The modification done to get GLProbs-Reference is that the guide tree generated by GLProbs is replaced by the phylogenetic tree constructed as follows: Based on the known correct alignment of the input sequences we construct their phylogenetic trees using the maximum-likelihood method [5], and then use these phylogenetic trees as the guide trees. Intuitively these phylogenetic trees should be the best guide trees for the alignments. The aim of this test is to find out whether the guide trees constructed by the adaptive method are competitive among the best.
Finally, we study whether the adaptive guide tree construction method of GLProbs can bring similar improvement to other MSA tools. We have modified five leading multiple sequence alignment tools, namely MSArobs [6], Probalign [7], Prob-Cons [8], T-Coffee [9], ClustalW [10], by replacing their original guide trees construction steps with the adaptive guide tree construction step, and keeping other steps intact. Then we compare their performance on aligning protein sequence families obtained from three popular benchmark datasets.
We will detail the results of our tests in Sections 2, 3 and 4. Below, we summarise our conclusions.
• For sequences with high similarity, the guide tree construction method is not critical; many reasonable methods can generate good enough guide trees leading to satisfactory alignments.
• For sequences with moderate similarity, better guide trees are very important for generating good alignments. Our study showed that the guide trees constructed by the adaptive method of GLProbs are usually among the best, and they can be used to improve the performance of other MSA tools.
• For sequences with very low similarity, the adaptive guide tree construction method can also improve the accuracy of other MSA tools; in fact, the improvements are larger than those obtained for other more similar sequences. However, the accuracy of these alignments is still very low. We found that for sequences with very low similarity, it is very difficult to generate good guide trees, and using a bad guide tree will have serious detrimental effect on the quality of the resulting alignment. For these sequences, we suggest using other methods, such as the non-progressive alignment method, that do not rely on guide trees for generating better alignments.