Skip to main content
  • Research article
  • Open access
  • Published:

MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts

Abstract

Background

Multiple Sequence Alignment (MSA) is a basic tool for bioinformatics research and analysis. It has been used essentially in almost all bioinformatics tasks such as protein structure modeling, gene and protein function prediction, DNA motif recognition, and phylogenetic analysis. Therefore, improving the accuracy of multiple sequence alignment is important for advancing many bioinformatics fields.

Results

We designed and developed a new method, MSACompro, to synergistically incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. The method is different from the multiple sequence alignment methods (e.g. 3D-Coffee) that use the tertiary structure information of some sequences since the structural information of our method is fully predicted from sequences. To the best of our knowledge, applying predicted relative solvent accessibility and contact map to multiple sequence alignment is novel. The rigorous benchmarking of our method to the standard benchmarks (i.e. BAliBASE, SABmark and OXBENCH) clearly demonstrated that incorporating predicted protein structural information improves the multiple sequence alignment accuracy over the leading multiple protein sequence alignment tools without using this information, such as MSAProbs, ProbCons, Probalign, T-coffee, MAFFT and MUSCLE. And the performance of the method is comparable to the state-of-the-art method PROMALS of using structural features and additional homologous sequences by slightly lower scores.

Conclusion

MSACompro is an efficient and reliable multiple protein sequence alignment tool that can effectively incorporate predicted protein structural information into multiple sequence alignment. The software is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/.

Background

Aligning multiple evolutionarily related protein sequences is a fundamental technique for studying protein function, structure, and evolution. Multiple sequence alignment methods are often an essential component for solving challenging bioinformatics problems such as protein function prediction, protein homology identification, protein structure prediction, protein interaction study, mutagenesis analysis, and phylogenetic tree construction. During the last thirty years or so, a number of methods and tools have been developed for multiple sequence alignment, which have made fundamental contributions to the development of the bioinformatics field.

State of the art multiple sequence alignment methods adapt some popular techniques to improve alignment accuracy, such as iterative alignment [1], progressive alignment [2], alignment based on profile hidden Markov models [3], and posterior alignment probability transformation [4, 5]. Some alignment methods, such as 3D-Coffee [6] and PROMALS3D [7], use 3D structure information to improve multiple sequence alignment, which cannot be applied to the majority of protein sequences without tertiary structures. In order to overcome this problem, we have developed a method to incorporate secondary structure, relative solvent accessibility, and contact map information predicted from protein sequences into multiple sequence alignment. Predicted secondary structure information has been used to improve pairwise sequence alignment [8, 9], but few attempts had been made to use predicted secondary structure information in multiple sequence alignment [10–15]. To the best of our knowledge, applying predicted relative solvent accessibility and residue-residue contact map to multiple sequence alignment is novel.

In order to use the predicted structural information to advance the state of the art of multiple sequence alignment, we first compared the existing multiple sequence alignment tools [16–31, 4, 5, 32–37] on the standard benchmark data sets such as BAliBASE [38], SABmark [39] and OXBENCH [40], which showed that MAFFT [30], T-coffee [31], MSAProbs [4], and ProbCons [5] yielded the best performance. Then we developed MSACompro, a new multiple sequence alignment method, which effectively utilizes predicted secondary structure, relative solvent accessibility, and residue-residue contact map together with posterior alignment probabilities produced by both pair hidden Markov models and partition function as in MSAProbs [4]. The assessment results of MSACompro compared to the benchmark data sets from BAliBASE, SABmark and OXBENCH showed that incorporating predicted structural information has improved the accuracy of multiple sequence alignment over most existing tools without using structural features and sometimes the improvement is substantial.

Method

Following the general scheme in MSAProbs [4], MSACompro has five main steps: (1) compute the pairwise posterior alignment probability matrices based on both pair-HMM and partition function, considering the similarity in amino acids, secondary structure, and relative solvent accessibility; (2) generate the pairwise distance matrix from both the pairwise posterior probability matrices constructed in the first step and the pairwise contact map similarity matrices; (3) construct a guide tree based on pairwise distance matrix, and calculate sequence weights; (4) transform all the pairwise posterior matrices by a weighting scheme; (5) perform a progressive alignment by computing the profile-profile alignment from the probability matrices of all sequence pairs, and then an iterative alignment to refine the results from progressive alignment. Our method is different from MSAProbs in that it adds secondary structure and solvent accessibility information to the calculation of the posterior residue-residue alignment probabilities and computes the pairwise distance matrix with the help of predicted residue-residue contact information.

Construction of pairwise posterior probability matrices based on amino acid sequence, secondary structure and solvent accessibility information

For two protein sequences X and Y in a sequence group S to be aligned, we denote X = (x1, x2,......,x n 1), Y = (y1, y2,......,y n 2), where x1, x2,......, x n 1 and y1, y2,......,y n 2 are lists of the residues in X and Y, respectively. n1 is the length of sequence X, and n2 is the length of sequence Y. Suppose x i is the i-th amino acid in sequence X, and y j is the j-th amino acid in sequence Y. We let aln denote a global alignment between X and Y, ALN the set of of all the possible global alignments of X and Y, and aln* ∈ ALN the true pairwise alignment of X and Y. The posterior probability that the i-th residue in X (x i ) is aligned to the j-th residue (y j ) in Y in aln* is defined as:

p ( x i ~ y j ∈ a l n * | X , Y ) = ∑ a l n ∈ A L N P ( a l n | X , Y ) I { x i ~ y j ∈ a l n }
(1)
( 1 ≤ x i ≤ n 1 , 1 ≤ y j ≤ n 2 )
I { x i ~ y j ∈ a l n } = { 1 , i f ( x i ~ y j ∈ a l n ) t r u e o , o t h e r w i s e

P(aln | X, Y) denotes the probability that aln is the true alignment aln*: Thus, the posterior probability n1 × n2 matrix P XY is a collection of all the values p(x i ~ y j ∈ aln* | X, Y) (p(x i ~ y j ) for short) for 1 ≤ x i ≤ n1, 1 ≤ y j ≤ n 2. The calculation process of the pairwise posterior probability matrix is described as follows.

As in MSAProbs, two different methods (a pair hidden Markov model and a partition function) are used to compute the pairwise posterior probability matrices ( P X Y 1 and P X Y 2 ), respectively. The first kind of pairwise probability matrix P X Y 1 is calculated by a partition function (F) of alignments based on dynamic programming. F(i, j) denotes the probability of all partial global alignments of X and Y ending at position (i, j). F M (i, j) is the probability of all partial global alignments with x i aligned to y j , F y (i, j), is the probability of all partial global alignments with y j aligned to a gap, and F X (i, j) is the probability of all partial global alignments with x i aligned to a gap. Accordingly, the partition function can be calculated recursively as follows:

F M ( i , j ) =F ( i - 1 , j - 1 ) e W 1 β s ( x i , y j ) + W 2 S S ( s s ( x i ) , s s ( y j ) ) + W 3 S A ( s a ( x i ) , s a ( y j ) )
F Y ( i , j ) = F M ( i , j - 1 ) e β g a p + F Y ( i , j - 1 ) e β e x t
(2)
F X ( i , j ) = F M ( i - 1 , j ) e β g a p + F X ( i - 1 , j ) e β e x t
F ( i , j ) = F M ( i , j ) + F Y ( i , j ) + F X ( i , j )

Subject to the constraint W1 + W2 + W3 = 1.

In the formula above, s(x i , y j ) is the amino acid similarity score between x i and y j . One element of the substitution matrix s, SS(ss(x i ), ss(yj)) is the similarity score between the secondary structure (ss(x i )) of residue x i in protein X and that of residue y j in protein Y according to the secondary structure similarity matrix SS, SA(sa(x i ), sa(y j )) is the similarity score between the relative solvent accessibility (sa(x i )) of residue x i in protein X and that of residue y j in protein Y according to the solvent accessibility similarity matrix SA. W1, W2, W3 are weights used to control the influence of the amino acid substitution score, secondary structure similarity score, and solvent accessibility similarity score. The secondary structure and solvent accessibility can be automatically predicted by SSpro/ACCpro [41] (http://sysbio.rnet.missouri.edu/multicom_toolbox/) using a multi-threading technique implemented in MSACompro, or alternatively be provided by a user. The values of the three weights are set to 0.4, 0.5, and 0.1 by default, and can be adjusted by users. The ensembles of bidirectional recurrent neural network architectures in ACCpro are used to discriminate between two different states of relative solvent accessibility, higher or lower than the accessibility cutoff - 25% of the total surface area of a residue [42], corresponding to e or b. As in MSAprobs, β is a parameter measuring the deviation between suboptimal and optimal alignments, gap(gap ≤ 0) is the gap open penalty, and ext(ext ≤ 0) is the gap extension penalty.

We used the Gonnet 160 matrix as a substitution matrix to generate the similarity scores between two amino acids in proteins [43]. The 3 × 3 secondary structure similarity matrix SS contains the similarity scores of three kinds of secondary structures (E, H, C) as follows:

SS= 100 010 001

, where two identical secondary structures receive a score of 1 and different ones receive a score of 0.

The 2 × 2 solvent accessibility similarity matrix SA contains the similarity scores of two kinds of relative solvent accessibilities (e, b) as follows:

SA= 10 01

, where two identical solvent accessibilities receive a score of 1 and different ones receive a 0. It is worth noting that we used the simple identity scoring matrix for secondary structure and solvent accessibility here. Employing more advance scoring matrices defined in [44] may lead to further improvement. Each posterior residue-residue alignment probability element in the first kind of posterior probability matrix ( P X Y 1 ) can be calculated from the partition function as:

p 1 ( x i ~ y j ) = F M ( i − 1 , j − 1 ) F M ' ( i + 1 , j + 1 ) F • e W 1 β s ( x i , y j ) + W 2 S S ( s s ( x i ) , s s ( y j ) ) + W 3 S A ( s a ( x i ) , s a ( y j ) )
(3)

, where F M ′ ( i , j ) denotes the partition function of all the reverse alignments starting from the position (n1, n2) till position (i, j) with x i aligned to y j .

As in MSAProbs, the second kind of pairwise probability matrix P X Y 2 is calculated by a pair hidden Markov model (HMM) combining both Forward and Backward algorithm [4, 5, 45]. The pairwise probabilities can be generated under the guidance of pair HMM involving state emissions and transitions. P X Y 2 is only derived from protein sequences without using secondary structure and solvent accessibility, which is different from PROMALS [15] that lets HMM emit both amino acids and secondary structure alphabets.

The final posterior probability matrix P XY is calculated as the root mean square of the corresponding values in P X Y 1 and P X Y 2 as follows.

p ( x i ~ y j ) = p 1 ( x i ~ y j ) 2 + p 2 ( x i ~ y j ) 2 2
(4)

where p1(x i ~ y i ) and p2(x i ~ y i ) denote a posterior probability element in two kinds of posterior probability matrices ( P X Y 1 and P X Y 2 ), respectively.

Construction of pairwise distance matrices based on pairwise posterior probabilities and pairwise contact map scores

The posterior probability matrix P XY is used as a scoring function to generate a pairwise global alignment between sequences X and Y. The optimal global alignment score Opt(X,Y) of the global alignment is computed according to an optimal sub-alignment score matrix AS. The optimal sub-alignment score AS(i, j) denotes the score of the optimal sub-alignment ending at residues i and j in X and Y. The AS matrix is recursively calculated as:

A S ( i , j ) = max { A S ( i − 1 , j − 1 ) + P X Y ( x i ~ y j ) A S ( i − 1 , j ) A S ( i , j − 1 )
(5)

AS (n1, n2) is the optimal score of the full global alignment between X and Y, which is denoted as Optscore(X,Y).

In addition to the optimal alignment score, we introduce a contact map score, CMscore(X, Y), for the optimal pairwise alignment of X and Y, assuming that the spatially neighboring residues of two aligned residues should have a higher tendency to be aligned together. CMscore(X, Y) is calculated from the contact map correlation score matrix CMap XY based on the residue-residue contact map matrices CMap X and CMap Y of X and Y.

Assuming the optimal global alignment of X and Y is represented as,

x 1 x 2 . . . . . . . - x m . . . . . . x p . . . . . . x n 1 y 1 - . . . . . . y k y k + 1 . . . . . - . . . . . . y n 2

we can generate a new alignment after removing the pairs containing gaps:

x 1 . . . . . . . x m . . . . . . . . . . . . x n 1 y 1 . . . . . . y k + 1 . . . . . . . . . . . y n 2

, which can be denoted as

x 1 ′ x 2 ′ . . . . . . . . . . . . x n ′ y 1 ′ y 2 ′ . . . . . . . . . . . y n ′

, where n is the length of the new alignment without gaps

From this alignment, we can construct two contact map matrices, CMap X and CMap Y , shown below:

CMa p X = x 11 ′ x 12 ′ . . . . . . x 1 n ′ x 21 ′ x 22 ′ . . . . . . x 2 n ′ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x n 1 ′ x n 2 ′ . . . . . . x n n ′
(6)
CMa p Y = y 11 ′ y 12 ′ . . . . . . y 1 n ′ y 21 ′ y 22 ′ . . . . . . y 2 n ′ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . y n 1 ′ y n 2 ′ . . . . . . y n n ′
x i j ′

is the contact probability score between amino acid x i ′ and x j ′ in protein sequence X, and y i j ′ is the contact probability score between amino acid y i ' and y j ' in protein sequence Y. The residue-residue contact probabilities are predicted from the sequence by NNcon [46] (http://sysbio.rnet.missouri.edu/multicom_toolbox/). The contact map correlation score matrix CMap XY is calculated as the multiplication of CMap X and CMap Y :

C M a p X Y = C M a p X × C M a p Y = x y 11 ′ x y 12 ′ . . . . x y 1 n ′ x y 21 ′ x y 22 ′ . . . . x y 2 n ′ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x y n 1 ′ x y n 2 ′ . . . . x y n n ′
(7)
x y i i ′

is the contact map score for an aligned residue pair (amino acid x i ′ in protein X and amino acid y i ' in protein Y). The contact map score for the global alignment of two sequences X and Y is calculated as

C M s c o r e ( X , Y ) = 1 n 2 ∑ i = 1 n C M a p X Y ( i , i ) = 1 n 2 ∑ i = 1 n x y i i ′ = 1 n 2 ∑ i = 1 n ∑ j = 1 n x i j ′ y j i ′
(8)

In practice, we only need to calculate the diagonal values in CMap XY .

Finally, we define the pairwise distance between sequences X and Y as

d ( X , Y ) =1- W 4 O p t s c o r e ( X , Y ) min { n 1 , n 2 } - W 5 CMscore ( X , Y )
(9)

, where W4 + W5 = 1. The weights W4 and W5 are used to control the influence of sequences X and Y.

Construction of guide tree and transformation of posterior probability

Akin to MSAProbs [4], a guide tree is constructed by the UPGMA method that uses the linear combinatorial strategy [47]. The distance between a new cluster Z formed by merging clusters X and Y, and another cluster W is calculated as (10):

d ( W , Z ) = d ( W , X ) × N u m ( X ) + d ( W , Y ) × N u m ( Y ) N u m ( X ) + N u m ( Y )
(10)

In which Num(X) is the number of leafs in cluster X.

After the guide tree is constructed, sequences are weighted according to the schemes inferred in [4].

To reduce the bias of sampling similar sequences, we use a weighted scheme to transform the former posterior probability as

P X Y ′ = 1 w N ( ( w X + w Y ) P X Y + ∑ Z ∈ S , Z ≠ X , Y w z P X Z P Z Y )
(11)

w X and w Y are, respectively, the weight of sequences X and Y, w Z is the weight of a sequence Z other than X or Y in the given group of sequences, and wN is the sum of sequence weights in dataset S.

Combination of progressive and iterative alignment

We first use the guide tree to generate a multiple sequence alignment by progressively aligning two clusters of the most similar sequences together. As in MSAProbs [4], we also apply a weighted profile-profile alignment to align two clusters of sequences. The sequence weights are the same as in the previous step. The posterior alignment probability matrix of two clusters/profiles is averaged from the probability matrices of all sequence pairs (X, Y), where x and y are from the two different clusters. Formula (5) used to generate the global profile-profile alignment is based on the posterior alignment probability matrices of the profiles. In order to further improve the alignment accuracy, we then use a randomized iterative alignment to refine the initial alignment. This randomized iterative refinement randomly partitions the given sequence group S into two separate groups, and performs a profile-profile alignment of the two groups. The iterative refinement can be completed after 10 iterations by default, or a fixed number of iterations set by users. Generally speaking, the final progressive alignment orders sequences along the guide tree from closely related to distantly related. To improve the alignment accuracy, a final iterative alignment is applied to refine the results from progressive alignment. In addition, a multi-thread technology based on OpenMP is also used to improve the efficiency of the program [48].

Results and discussion

Evaluation of MSACompro and other tools on the standard benchmarks

We tested MSACompro in comparison to three benchmarks: BAliBASE, SABmark and OXBENCH, and evaluated the alignment results in terms of sum-of-pairs (SP) score and true column (TC) score. The SP score is the number of correctly aligned pairs of residue in the test alignment divided by the total number of aligned pairs of residues in core blocks of the reference alignment [49]. The TC score is the number of correctly aligned columns in the test alignment divided by the total number of aligned columns in core blocks of the reference alignment [49]. We used the application bali_score provided by BAliBASE 3.0 to calculate these scores. We compared MSACompro to 11 other MSA tools which do not have access to the structural information, including ClustalW 2.0.12, DIALIGN-TX 1.0.2 [27], FSA 1.15.5, MAFFT 6.818, MSAProbs 0.9.4, MUSCLE 3.8.31, Opal 0.2.0, POA 2, Probalign 1.3, Probcons and T-coffee 8.93. It is worth noting that a fair comparison between our method with these multiple sequence alignment methods without using structural features is not possible because these methods use less input information. So, the goal of comparison is to present the idea that structural information-based alignment may contain valuable information that is not available in sequence-based multiple sequence alignments and can therefore be a supplement to sequence-based alignments. And to make the evaluation more fair and comprehensive, we also compared MSACompro with four tools which use structural information, including MUMMALS 1.01 [14], PROMALS [15] and PROMALS3D [7].

To understand how various parameters of MSACompro affect alignment accuracy, some experiments were carried out to evaluate these variants based on two algorithm changes: (1) combining amino acids, secondary structure, and relative solvent accessibility information into the partition function calculation using respective weights for each of them; (2) computing the pairwise distance from both the pairwise posterior probability matrices and the pairwise contact map similarity matrices by introducing the weight wc for contact map information. To optimize the parameters, we used BAliBASE 3.0 data sets as training sets, and SABmark 1.65 and OXBENCH data sets as testing sets. Firstly, we focused on the effect of secondary structure and solvent accessibility information by testing different values of weight w1 for amino acid similarity and weight w2 for secondary structure information on BAliBASE 3.0 data sets. MSACompro worked wholly the best if the weight w1 for amino acid similarity and the weight w2 for secondary structure information were 0.4 and 0.5, respectively. Since the sum of w1, w2 and wc is 1, we can deduce that wc is 0.1 if w1 and w2 are 0.4 and 0.5. Then we focused on the effect of residue-residue contact map information under two different scenarios: using secondary structure and relevant solvent accessibility information by keeping the w1, w2, and w3 at their optimum values (0.4, 0.5, 0.1), or excluding that information by setting both w2 and w3 as 0. Evaluation results on BAliBASE 3.0 database were found to improve the most when wc is 0.9 by integrating both secondary structure and relevant solvent accessibility information. Additionally, to avoid over-fitting, we tested MSACompro against SABmark 1.65 and OXBENCH data sets using this set of parameters independently, and found that a significant improvement was also gained in comparison to other leading protein multiple sequence alignment tools. More details can be found in the next section, "A comprehensive study on the effect of predicted structural information on the alignment accuracy". Consequently, the weights w1, w2, w3 and wc are respectively set at 0.4, 0.5, 0.1 and 0.9 in MSACompro by default. All other tools were also evaluated under default parameters.

Firstly, we evaluated these methods on BAliBASE [16] - the most widely used multiple sequence alignment benchmark. The latest version, BAliBASE 3.0, contains 218 reference alignments, which are distributed into five reference sets. Reference set 1 is a set of equal-distant sequences, which are organized into two reference subsets, RV11 and RV12. RV11 contains sequences sharing >20% identity and RV12 contains sequences sharing 20% to 40% identity. Reference set 2 contains families with >40% identity and a significantly divergent orphan sequence that shares <20% identity with the rest of the family members. Reference set 3 contains families with >40% identity that share <20% identity between each two different sub-families. Reference set 4 is a set of sequences with large N/C-terminal extensions. Reference set 5 is a set of sequences with large internal insertions. Tables 1, 2, and 3 report the mean SP scores and TC scores of MSACompro and the tools without using structural information for the six subsets and the whole database. All the scores in the tables are multiplied by 100, and the highest scores in each column are marked in bold. The results show that MSACompro received the highest SP and TC scores on the whole database and all the subsets except for the SP score for the subset RV40. In some cases, MSACompro's improvement was substantial.

Table 1 Total SP scores on the full-length BAliBASE 3.0 subsets.
Table 2 Total TC scores on the full-length BAliBASE 3.0 subsets.
Table 3 Overall mean SP and TC scores on the full-length BAliBASE 3.0 subsets.

Secondly, we evaluated MSACompro and other tools without the help of structural information on the SABmark database [4], which is a very challenging data set for multiple sequence alignment according to a comprehensive study [50]. SABmark is an automatically generated data set consisting of two sets. One set is from SOFI [51] and the other is from the ASTRAL database [52], which contains remote homologous sequences in twilight-zone or superfamily. Since some pairwise reference alignments in SABmark are not generally consistent with multiple alignments, a subset of SABmark, 1.65 called SABRE [53], has been widely used as a multiple sequence alignment benchmark database. SABRE was constructed by identifying mutually consistent columns (MCCs) in the pairwise reference structure alignment. MCCs are considered similar to BAliBASE core blocks. SABRE contains 423 out of 634 SABmark groups that have eight or more MCCs. Table 4 shows the overall mean SP and TC scores of the alignments. The mean SP and TC scores of MSACompro are 8.3 and 9.1 points higher than those of the second best-performer, MSAProbs, demonstrating that incorporating predicted structural features into multiple sequence alignments can substantially improve alignment accuracy for even remotely related homologous sequences. Figure 1 shows an example comparison between the alignments generated by our method, MSACompro, and MSAProbs from the SABRE database. The SP and TC scores significantly improved from 0.307 to 0.853 and 0 to 0.780, respectively. This case demonstrates that taking predicted structural information can help avert aligning unmatched regions, especially when the sequence similarity is unrecognizable.

Table 4 Overall mean SP and TC scores on the SABmark 1.65.
Figure 1
figure 1

an example in SABRE database comparing the alignments generated by our method and MSAProbs. The reference alignment and resulting alignments generated by both methods are respectively shown in the figure. The correct alignment regions significantly improved by our MSACompro after taking structural information are marked in red rectangles. In contrast, the corresponding incorrect alignment regions generated by MSAProbs are represented in green rectangles. The predicted secondary structure and solvent accessibility information for the correctly aligned regions are shown in circles.

Thirdly, we also assessed all the tools without using the structural information on the OXBENCH database [54]. OXBENCH is also a popular benchmark database generated by the AMPS multiple alignment method from the 3Dee database of protein structural domains [55]. The conserved columns in OXBENCH can be considered similar to BAliBASE core blocks. The mean SP and TC scores over the whole database are shown in Table 5. The results show that MSACompro improves the alignment accuracy over all other methods.

Table 5 Overall mean SP and TC scores on the OXBENCH. Bold denotes highest scores.

Finally, we also compared the SP scores and TC scores of MSACompro and other tools which adopt the structural information on the six subsets of BAliBASE database, SABmark database and OXBENCH database. Tables 6 and 7 demonstrate the SP and TC scores across the three databases. The results show that MSACompro gained the highest scores on three out of six subsets of BAliBASE and achieved the third highest scores on other data sets, which are lower than PROMALS3D that used true experimental structures as input and PROMALS that used both predicted secondary structures and additional homologous protein sequences found by PSI-BLAST search's on a large protein sequence database [15]. Overall, MSACompro performed similarly as PROMALS, whereas the latter has an advantage on a remote homologous protein sequence data set SABmark since it directly incorporates additional homologous protein sequences to improve the alignment of remotely related target sequences during the progressive alignment process. Moreover, the accuracy of MSACompro on the BAliBASE 3.0 data sets seems to be higher than the published results of another alignment tool of using secondary structure information - DIALIGN-SEC [12], which was not directly tested in our experiment because it is only available as a web server other than a downloadable software package. Therefore, MSACompro is useful to improve the accuracy of multiple sequence alignment in general and particularly for most cases in reality where experimental structures are not available.

Table 6 Total SP scores of the tools which use the structural information on BAliBASE 3.0 subsets, SABmark data sets and OXBENCH data sets.
Table 7 Total TC scores of the tools which use the structural information on BAliBASE 3.0 subsets, SABmark data sets and OXBENCH data sets.

In order to check if alignment score differences between MSACompro and the other alignment methods are statistically significant, we carried out the Wilcoxon matched-pair signed-rank test [56] on both SP and TC scores of these methods on the three data sets. The p-values of alignment score differences calculated by the Wilcoxon matched-pair signed-rank test are reported in Table 8. Generally speaking, the alignment scores of MSACompro are significantly higher than all the alignment methods without using structural information and MUMMALS of using structural information in all but one case according to the significance threshold of 0.05. The exception is that MSACompro's TC score is higher than MSAProbs on the BAliBASE, but not statistically significant. However, the alignment scores of MSACompro are mostly statistically lower than the other two alignment methods (PROMALS or PROMALS3D) of using predicted structural features, more homologous sequences, or tertiary structures.

Table 8 The statistical significance (i.e. p-values) of SP and TC alignment score differences between MSACompro and the other tools on three benchmark data sets.

In addition to alignment accuracy, alignment speed is also a factor to consider in time-critical applications. Because it is difficult to rigorously compare the speed of different methods due to the difference in implementation and inputs, we only report the roughly estimated running time of the different methods on BAliBASE based our empirical observations. The fastest methods are ClustalW, MAFFT, MUSCLE, and POA, which used less than one hour. The medium-speed methods that used a few hours to less than one day include FSA, Opal, Probalign, MSAProbs, ProbCons, T-coffee, MUMMALS, and DIALIGN-TX. The more time demanding methods are MSACompro, PROMALS, and PROMALS3D because they need to generate extra information for alignment. We ran both PROMALS and MSACompro on the BAliBASE 3.0 database on an 4 eight-core (i.e. 32 CPU cores) Linux server to calculate their running time. It took about 4 days and 6 hours for PROMALS to run on the whole BAliBASE 3.0 data sets, and about 9 hours and 13 minutes for MSACompro to run on the same data sets. MSACompro was faster because it used a multiple-threading implementation to call SSpro/ACCpro to predict secondary structure and solvent accessibility in parallel. Out of about 9 hours and 13 minutes, about four hours and 17 minutes were used by MSACompro to align sequences if secondary structure and solvent accessibility information was provided. However, if only one CPU core is used, it took around 6 days and 14 hours for SSpro and ACCpro called by MSACompro to predict secondary structure and solvent accessibility information alone, which is time-consuming. Therefore, MSACompro will be slower than PROMALS if it runs a single CPU core, but faster on multiple (> = 3) CPU cores. As for PROMALS3D, it used about 9 days to extract tertiary structure information and make alignments.

A comprehensive study of the effect of predicted structural information on the alignment accuracy

To understand the impact of predicted secondary structure, relative solvent accessibility, and contact map on the accuracy of multiple sequence alignment, we tested their effects on alignments individually or in combination by adjusting the values of their weights used in the partition function (i.e. for secondary structure and solvent accessibility) or in the distance calculation (i.e. for contact map).

I. Effect of secondary structure information

We studied the effect of secondary structure information by adjusting the values of w1 (weight for amino acid sequence information) and w2 (weight for secondary structure information), the sum of which was kept as 1, and setting the values of w3 (weight for relative solvent accessibility) and wc (weight for contact map) to 0. The results for different w2 values on the SABmark data sets are shown in Table 9. The highest score is denoted in bold and by a superscript of star, and the second highest is denoted in bold. The results show that incorporating secondary structure information always improves alignment accuracy over the baseline established without using secondary structure information (w2 = 0). The highest accuracy is achieved when w2 is set to .5, at which point the score is 8 points greater than the baseline. w2 = 1 means that only secondary structure is used to calculate the posterior alignment probability in the partition function (i.e. equation set (2)), but amino acid sequence similarity is still used to calculate the other posterior alignment probability by the pair Hidden Markov Models. Figures 2 and 3 plot the SP and TC scores against weight values in Table 9 and Table 10, respectively.

Table 9 SP scores for different weights of secondary structures on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
Figure 2
figure 2

the 2D plot of SP scores against w 2 on the SABmark dataset.

Figure 3
figure 3

the 2D plot of TC scores against w 2 on the SABmark dataset.

Table 10 TC scores for different weights of secondary structures on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.

II. Effect of relative solvent accessibility information

Similarly, we studied the effect of relative solvent accessibility on the SABmark by adjusting the values of w1 and w3 and setting the values of w2 and wc to 0. The SP and TC scores with respect to different weight values are shown in Tables 11 and 12, respectively. The scores are also plotted against the weights in Figures 4 and 5, respectively. The highest SP and TC scores were achieved when w3 was set to 0.5 or 0.6.

Table 11 SP scores for different weights of relative solvent accessibility on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
Table 12 TC scores for different weights of relative solvent accessibility on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
Figure 4
figure 4

the 2D plot of SP scores against w 3 on the SABmark dataset.

Figure 5
figure 5

the 2D plot of TC scores against w3 against the SABmark dataset.

III. Effect of residue-residue contact map information

We investigated the effect of contact map information on the BAliBASE 3.0 data set by adjusting wc and setting w2 and w3 to 0. We used NNcon to successfully predict the contact maps for subset RV11, RV30, 42 out of 44 alignments in RV12, 38 out of 40 in RV20, 33 out of 46 in RV40, and 14 out of 16 in RV50. We tested the MSACompro method against this data with contact predictions. Tables 13 and 14 show the SP and TC scores for different wc values on the subsets of the BAliBASE dataset. The results show that using contact information improved the alignment accuracy on some, but not all, subsets.

Table 13 SP scores for different weights for contact map on the BAliBASE3.0 database. Red color highlights the improved scores on each BAliBASE subset. Bold denotes the increased scores.
Table 14 TC scores for different weights for contact map on the BAliBASE 3.0 database. Red highlights the improved scores on each BAliBASE subset. Bold denotes the increased scores.

IV. Effect of combining secondary structure and solvent accessibility information

We adjusted the values of w1 (weight for amino acid), w2 (weight for secondary structure) and w3 (weight for relative solvent accessibility) simultaneously to investigate the effect of using secondary structure and relative solvent accessibility together. SP and TC scores on different parameter combinations are shown in Tables 15 and 15. The highest score is denoted in bold and by a superscript of 1, the second in bold and by a superscript of 2, and the third in bold and by a superscript of 3. The results show that the highest scores are achieved when w1 ranges from 0.4 to 0.5, w2 from 0.4 to 0.5, and w3 from 0.1 to 0.2. Also, using both secondary structure and solvent accessibility improves alignment accuracy over using either one. The best alignment score, which uses both secondary structure and solvent accessibility, is >8 points higher than the baseline approach, which does not use them. The changes of SP scores and TC scores with respect to the weights are visualized by the 3D plots in Figures 6 and 7. We conducted similar experiments on BAliBASE 3.0 and OXBENCH and got the similar results (data not shown).

Table 15 SP scores for different weight combinations (w1 - amino acid, w2 - secondary structure, w3 - solvent accessibility) on the SABmark 1.65 dataset.
Table 16 TC scores scores for different weight combinations (w1 - amino acid, w2 - secondary structure, w3 - solvent accessibility) on the SABmark 1.65 dataset.
Figure 6
figure 6

3D plot of SP scores against secondary structure weight w 2 and relative solvent accessibility weight w 3 .

Figure 7
figure 7

3D plot of TC scores against secondary structure weight w 2 and relative solvent accessibility weight w 3 .

V. Effect of using contact map information together with secondary structure and solvent accessibility information

In order to study whether or not contact information can be used effectively with secondary structure and solvent accessibility, we adjusted the weight wc for contact information, while keeping the w1, w2, and w3 at their optimum values (0.4, 0.5, and 0.1 respectively). Tables 17 and 18 report the SP and TC scores on the BAliBASE 3.0 data set for different wc values from no contact information (wc = 0) to maximum contact information (wc = 1). The results show that the improvement caused by contact information seems not to be substantial and significant.

Table 17 SP scores for different contact map weight wc on the BAliBASE3.0 database while keeping the weights for amino acid, secondary structure, solvent accessibility to 0.4, 0.5, and 0.1, respectively.
Table 18 TC scores for different contact map weight wc on the BAliBASE3.0 database while keeping the weights for amino acid, secondary structure, solvent accessibility to 0.4, 0.5, and 0.1, respectively.

Conclusion

In this work, we designed a new method to incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into multiple protein sequence alignment. Our experiments on three standard benchmarks showed that the method improved multiple sequence alignment accuracy over most existing methods without using secondary structure and solvent accessibility information. However, the performance of the method is comparable to PROMALS and PROMALS3D by slightly lower scores on some subsets and behind it by a large margin on SABMARK probably because these two methods used homologous sequences or tertiary structure information in addition to secondary structure information. Since multiple sequence alignment is often a crucial step for bioinformatics analysis, this new method may help improve the solutions to many bioinformatics problems such as protein sequence analysis, protein structure prediction, protein function prediction, protein interaction analysis, protein mutagenesis and protein engineering.

References

  1. Barton GJ, Sternberg MJ: A strategy for the rapid multiple alignment of protein sequences. confidence levels from tertiary structure comparisons. J Mol Biol 1987, 198: 327–337. 10.1016/0022-2836(87)90316-0

    Article  CAS  PubMed  Google Scholar 

  2. Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, 25: 351–361. 10.1007/BF02603120

    Article  CAS  PubMed  Google Scholar 

  3. Krogh A, et al.: Hidden markov models in computational biology: applications to protein modeling. J Mol Biol 1994, 235: 1503–1531.

    Article  Google Scholar 

  4. Liu YC, Schmidt B, DouglasLM : MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics 2010, 26(16):1958–1964. 10.1093/bioinformatics/btq338

    Article  CAS  PubMed  Google Scholar 

  5. Do CB, et al.: ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 2005, 15: 330–340. 10.1101/gr.2821705

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Poirot O, Suhre K, Abergel C, Eamonn OT, Notredame C: 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment. Nucleic Acids Research 2004, 32: 37–40.

    Article  Google Scholar 

  7. Pei J, Kim B, Grishin NV: PROMALS3D: a tool for multiple sequence and structure alignment. Nucleic Acids Res 2008, 36(7):2295–2300. 10.1093/nar/gkn072

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  8. Söding J, Biegert A, Lupas AN: The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Research 2005, 33: W244-W248. 10.1093/nar/gki408

    Article  PubMed Central  PubMed  Google Scholar 

  9. Söding J: Protein homology detection by HMM-HMM comparison. Bioinformatics 2005, 21: 951–960. 10.1093/bioinformatics/bti125

    Article  PubMed  Google Scholar 

  10. Heringa J: Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comput Chem 1999, 23: 341–364.

    Article  CAS  PubMed  Google Scholar 

  11. Kim NK, Xie J: Protein multiple alignment incorporating primary and secondary structure information. J Comput Biol 2006, 13: 75–88.

    Article  Google Scholar 

  12. Amarendran RS, Suvrat H, Rasmus S, Peter M, Eduardo C, Burkhard M: DIALIGN-TX and multiple protein alignment using secondary structure information at GOBICS. Nucleic Acids Research 2010, 38(suppl 2):W19-W22.

    Google Scholar 

  13. Zhou HY, Zhou YQ: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Bioinformatics 2005, 21: 3615–3621. 10.1093/bioinformatics/bti582

    Article  CAS  PubMed  Google Scholar 

  14. Pei J, Grishin NV: MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 2006, 34(16):4364–4374. 10.1093/nar/gkl514

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Pei J, Grishin NV: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 2007, 23: 802–808. 10.1093/bioinformatics/btm017

    Article  CAS  PubMed  Google Scholar 

  16. Brudno M, Steinkamp R, Morgenstern B: The CHAOS/DIALIGN www server for multiple alignment of genomic sequences. Nucl Acids Res 32(Supplement 2):W41.

  17. Larkin M, et al.: Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23(21):2947–2948. 10.1093/bioinformatics/btm404

    Article  CAS  PubMed  Google Scholar 

  18. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 2003, 31: 3497–3500. 10.1093/nar/gkg500

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ: Multiple sequence alignment with Clustal X. Trends Biochem Sci 1998, 23: 403–405. 10.1016/S0968-0004(98)01285-7

    Article  CAS  PubMed  Google Scholar 

  20. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 1997, 25: 4876–4882. 10.1093/nar/25.24.4876

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Higgins DG, Thompson JD, Gibson TJ: Using CLUSTAL for multiple sequence alignments. Methods Enzymol 1996, 266: 383–402.

    Article  CAS  PubMed  Google Scholar 

  22. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680. 10.1093/nar/22.22.4673

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Higgins DG: CLUSTAL V: multiple alignment of DNA and protein sequences. Methods Mol Biol 1994, 25: 307–318.

    CAS  PubMed  Google Scholar 

  24. Higgins DG, Bleasby AJ, Fuchs R: CLUSTAL V: improved software for multiple sequence alignment. Comput Appl Biosci 1992, 8: 189–191.

    CAS  PubMed  Google Scholar 

  25. Higgins DG, Sharp PM: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 1988, 73: 237–244. 10.1016/0378-1119(88)90330-7

    Article  CAS  PubMed  Google Scholar 

  26. Bailey TL, Noble WS: Searching for statistically significant regulatory modules. Bioinformatics 2003, (Suppl. 2):19.

    Google Scholar 

  27. Amarendran RS, Kaufmann M, Morgenstern B: DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms for Molecular Biology 2008, 3: 6. 10.1186/1748-7188-3-6

    Article  Google Scholar 

  28. Amarendran RS, Jan WM, Kaufmann M, Morgenstern B: DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics 2005, 6: 66. 10.1186/1471-2105-6-66

    Article  Google Scholar 

  29. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L: Fast Statistical Alignment. PLoS Computational Biology 2009, 5: e1000392. 10.1371/journal.pcbi.1000392

    Article  PubMed Central  PubMed  Google Scholar 

  30. Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30(14):3059–66. 10.1093/nar/gkf436

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  31. Notredame C, Higgins D, Heringa J: T-Coffee: A novel method for multiple sequence alignments. JMB 2000, 302: 205–217. 10.1006/jmbi.2000.4042

    Article  CAS  Google Scholar 

  32. Brudno M, Do CB, Cooper G, Michael FK, Davydov E, Eric DG, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Research 2003.

    Google Scholar 

  33. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 2004, 32(5):1792–97. 10.1093/nar/gkh340

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  34. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5(1):113. 10.1186/1471-2105-5-113

    Article  PubMed Central  PubMed  Google Scholar 

  35. Chikkagoudar S, Roshan U, Livesay DR: eProbalign: generation and manipulation of multiple sequence alignments using partition function posterior probabilities. Nucleic Acids Research 2007, 35: W675-W677. 10.1093/nar/gkm267

    Article  PubMed Central  PubMed  Google Scholar 

  36. Sze SH, Lu Y, Yang Q: A polynomial time solvable formulation of multiple sequence alignment. Journal of Computational Biology 2006, 13: 309–319. 10.1089/cmb.2006.13.309

    Article  CAS  PubMed  Google Scholar 

  37. Roshan U, Livesay DR: Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 2006, 22(22):2715–21. 10.1093/bioinformatics/btl472

    Article  CAS  PubMed  Google Scholar 

  38. Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins: Structure, Function, and Bioinformatics 2005, 61: 127–136. 10.1002/prot.20527

    Article  CAS  Google Scholar 

  39. Walle V, et al.: Align-m-a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 2004, 20: 1428–1435. 10.1093/bioinformatics/bth116

    Article  PubMed  Google Scholar 

  40. Raghava GP, et al.: OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 2003, 4: 47. 10.1186/1471-2105-4-47

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  41. Cheng J, Randall A, Sweredoski M, Baldi P: SCRATCH: a Protein Structure and Structural Feature Prediction Server. Nucleic Acids Research 2005, 33(Web Server):72–76. 10.1093/nar/gki396

    Article  Google Scholar 

  42. Pollastri G, Baldi P, Fariselli P, Casadio R: Prediction of coordination number and relative solvent accessibility in proteins. Proteins 2002, 47: 142–153. 10.1002/prot.10069

    Article  CAS  PubMed  Google Scholar 

  43. Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256: 1443–1445. 10.1126/science.1604319

    Article  CAS  PubMed  Google Scholar 

  44. Kawabata T, Nishikawa K: Protein structure comparison using the Markov transition model of evolution. Proteins 2000, 41: 108–122. 10.1002/1097-0134(20001001)41:1<108::AID-PROT130>3.0.CO;2-S

    Article  CAS  PubMed  Google Scholar 

  45. Durbin R, et al.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press Cambridge, UK; 1998.

    Chapter  Google Scholar 

  46. Tegge AN, Wang Z, Eickholt J, Cheng J: NNcon: Improved Protein Contact Map Prediction Using 2D-Recursive Neural Networks. Nucleic Acids Research 2009, 37: w515-w518. 10.1093/nar/gkp305

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  47. Sneath PHA, Sokal RP: Numerical taxonomy. In Freeman. San Francisco,USA; 1973.

    Google Scholar 

  48. OpenMP tutorial[https://computing.llnl.gov/tutorials/openMP]

  49. Thompson JD, Frederic P, Olivier P: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 1999, 27: 2682–2690. 10.1093/nar/27.13.2682

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  50. Walle V, et al.: Align-m-a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 2004, 20: 1428–1435. 10.1093/bioinformatics/bth116

    Article  PubMed  Google Scholar 

  51. Boutonnet NS, et al.: Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. Protein Eng 1995, 8: 647–662. 10.1093/protein/8.7.647

    Article  CAS  PubMed  Google Scholar 

  52. Brenner SE, et al.: The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res 2000, 28: 254–256. 10.1093/nar/28.1.254

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  53. Edgar RC[http://www.drive5.com/bench]

  54. Raghava GP, et al.: OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 2003, 4: 47. 10.1186/1471-2105-4-47

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  55. Poirot O, Suhre K, Abergel C, Eamonn OT, Notredame C: 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment. Nucleic Acids Research 2004, 32: 37–40.

    Article  Google Scholar 

  56. Wilcoxon F: Probability tables for individual comparisons by ranking methods. Biometrics 1947, 3: 119–122. 10.2307/3001946

    Article  Google Scholar 

Download references

Acknowledgements

The work was partially supported by a NIH R01 grant (no. 1R01GM093123) to JC. We thank Angela Zhang for English editing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianlin Cheng.

Additional information

Authors' contributions

JC and XD designed the algorithm. XD implemented the algorithm and carried out the experiments. XD and JC analyzed the data. XD and JC wrote the manuscript. XD and JC approved it.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Deng, X., Cheng, J. MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts. BMC Bioinformatics 12, 472 (2011). https://doi.org/10.1186/1471-2105-12-472

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-12-472

Keywords