Enhancing HMM-based protein profile-profile alignment with structural features and evolutionary coupling information
© Deng and Cheng; licensee BioMed Central Ltd. 2014
Received: 7 January 2014
Accepted: 17 July 2014
Published: 25 July 2014
Protein sequence profile-profile alignment is an important approach to recognizing remote homologs and generating accurate pairwise alignments. It plays an important role in protein sequence database search, protein structure prediction, protein function prediction, and phylogenetic analysis.
In this work, we integrate predicted solvent accessibility, torsion angles and evolutionary residue coupling information with the pairwise Hidden Markov Model (HMM) based profile alignment method to improve profile-profile alignments. The evaluation results demonstrate that adding predicted relative solvent accessibility and torsion angle information improves the accuracy of profile-profile alignments. The evolutionary residue coupling information is helpful in some cases, but its contribution to the improvement is not consistent.
Incorporating the new structural information such as predicted solvent accessibility and torsion angles into the profile-profile alignment is a useful way to improve pairwise profile-profile alignment methods.
Pairwise protein sequence alignment methods have been essential tools for many important bioinformatics tasks, such as sequence database search, homology recognition, protein structure prediction and protein function prediction [1–5]. Following the development of global and local alignment methods of aligning two single sequences [6–8], profile-sequence alignment or profile-profile alignment methods such as PSI-BLAST, SAM , HMMer , HHsearch, HHsuite [4–6], which enrich two single sequences with their homologous sequences, has substantially improved both the sensitivity of recognizing remote homologs and the accuracy of aligning two protein sequences.
Due to their relatively high sensitivity in recognizing remote protein homologs, profile-profile alignment methods have become the default structural template identification method for many template-based protein structure modeling methods and servers [11–14]. For instance, HHsearch, one of top profile-profile alignment tools based on comparing the profile hidden Markov models (HMM) of two proteins, was used by almost all the template-based protein structure prediction methods tested during the last two Critical Assessment of Techniques for Protein Structure Prediction (CASP) [15, 16]. The open source package HHsuite contains both the latest implementation of HHSearch that supports a full HMM-HMM alignment-based search on a HMM profile database and a very fast search tool HHblits  that reduces the number of unnecessary full HMM pairwise alignment in order to drastically improve its search speed. Moreover, the maximum accuracy (MAC) alignment algorithm is applied in HHsuite, but not in HHsearch. In this work, we aim to introduce new sources of information to improve profile-profile alignments with respect to both the original HHsearch package and the open source HHsuite package,
In order to more accurately align the structurally equivalent residues in a target protein and a template protein together, secondary structure information was incorporated into profile-profile sequence alignment methods, yielding the better sensitivity and accuracy [4, 17]. Aiming to find the new source of information to further improve the sensitivity and accuracy of pairwise profile-profile alignment, we examine the effectiveness of incorporating into profile-profile alignment methods some new features that have not been used in profile-profile alignments before, including protein solvent accessibility, torsion angles, and the evolutionary residue coupling information [18, 19].
Specifically, we add the additional scoring terms for solvent accessibility, torsion angles, and evolutionary residue coupling information into the scoring function of HHsuite  in order to enhance the alignment process. According to our evaluation, adding solvent accessibility and torsion angles can improve the alignment accuracy, but incorporating the evolutionary residue coupling information is only useful in some cases.
Adding solvent accessibilities and torsion angles into the viterbi alignment
where k denotes the index of columns that query HMM q aligned to template HMM t, i(k) and j(k) are the respective columns in q and t, P tr is the product of all transition probabilities for the path through q and t. The latest version of HHsuite has included the secondary structure information into the calculation of the score. In this work, we further augment the calculation of the score by adding the terms to account for the solvent accessibility, and torsion angles.
S IM (i, j) and S GD (i, j) are calculated similarly as S MI (i, j) and S DG (i, j).
The difference between Equation (3) above and the default one in HHsuite is that two new terms (Ssa, Stors) were added to utilize the solvent accessibility and torsion angle information. In Equation (3), S ss (q i , t j ) is the secondary structure score between column i in query HMM (q i ) and column j in template HMM (t j ), which was the same as the one originally used in HHsuite. S sa (q i , t j ) is the solvent accessibility score between q i and t j , and S tors (q i , t j ) is the torsion angle score between q i and t j , which are the new terms introduced in this work. w ss , w sa , and w tors are weights for the secondary structure score, solvent accessibility score and torsion angle score respectively. S shift is the score offset for match-match states. Three weights w ss , w sa , w tors and shift score S shift are set to 0.11, 0.72, 0.4 and −0.03 by default, and can be adjusted by users as well. qi − 1(M, M) is the transition probability from state M at column i-1 to next state M of in the query HMM, and tj − 1(M, M) is the transition probability from state M at column j-1 to next state M in the template HMM.
The score is calculated by the kronecker-delta function δ(a, b), which equals to 1 if a = b, 0 otherwise.
Realign the profiles by maximum accuracy alignment combining solvent accessibility and torsion angles
It has been shown that maximum accuracy (MAC) algorithm can generally create a more accurate alignment than the Viterbi algorithm, while the latter can generate better alignment scores, e-values and probabilities [5, 24]. Consequently, the Viterbi algorithm is applied to compute e-values and scores, and the MAC algorithm is chosen to generate the final HMM-HMM pairwise alignment in HHsato by default.
where p min controls the alignment model (0: global alignment mode, 1: local alignment mode). F IM (i, j) and F GD (i, j) are calculated similarly as F MI (i, j) and F DG (i, j). Solvent accessibility score S sa (q i , t j ) and torsion angle score S tors (q i , t j ) are calculated as in the Viterbi alignment.
B IM (i, j) and B GD (i, j) are calculated similarly as B MI (i, j) and B DG (i, j).
Trace back maximum accuracy alignments with the evolutionary residue coupling information
N is 21, standing for 20 amino acids plus gap. The joint probability of two residues X i and X j (F ij (X i , X j )) and the probability of residue X i (F i (X i )) are calculated in the same way as in . However, EC ij is calculated as the mutual information (MI) instead of the direct information (DI) based on the global probability model  in order to achieve the higher time efficiency. A higher EC value corresponds to a stronger correlation between two columns in the given profile.
Results and discussion
Evaluation data set and metric
We evaluated HMMsato along with HHSearch  and HHsuite on the alignments between 106 targets (queries) of the 9th Critical Assessment of Techniques for Protein Structure Prediction (CASP9) [15, 16] and their homologous template proteins (templates) released at the CASP9’s web site. The alignment data set has 2,621 pairs of query and template proteins. 1,483 pairs associated with 60 CASP9 targets were used as optimization data set to optimize the parameters of HMMsato, and 1,138 pairs associated with the remaining 46 CASP9 targets were used to test the methods. The reference (presumably true) pairwise alignments of a query-template protein pair was generated by using TMalign  to align the tertiary (3D) structures of the two proteins together. The alignments generated by HMMsato and other tools were evaluated by three metrics, including sum-of-pairs (SP) score, true column (TC) score, and the quality of the tertiary structural models of the query proteins built from the alignments. The SP and TC scores are the two standard metrics for evaluating sequence alignment quality . The quality of tertiary structural models indirectly assesses the quality of sequence alignments according to their effectiveness in guiding the construction of protein structural models.
The SP score is the number of correctly aligned pairs of residue in the predicted alignment divided by the total number of aligned pairs of residues in the core blocks (i.e., sequence alignment regions precisely determined by structural alignment of structurally equivalent residues in the structures of two proteins) of the true alignment . The TC score is the number of correctly aligned columns in the core blocks of the true alignment . The 3D model of a query protein was produced by MODELLER  based on both the pairwise alignment generated by an alignment method and the known structure of the template protein in the alignment. We used TM-Score  to align a 3D model of a query protein against its true structure to generate TM-scores and GDT-TS scores  for the model in order to measure the quality of the alignment used to generate the model, assuming better alignments lead to better 3D models with higher TM-scores and GDT-TS scores. Both TM-score and GDT-TS score are in the range [0, 1] .
Optimization of weights for the solvent accessibility, torsion angles and evolutionary coupling information
We estimated the weights of the solvent accessibility, torsion angles and evolutionary residue coupling information on the training alignments step by step. Firstly, we found the best weight value (w sa = 0.72) for solvent accessibility. Then, we identified the best weight value (w tors = 0.4) for torsion angles while keeping the weight for solvent accessibility fixed. Finally, we found the best parameter value (w ec = 0.1) for the evolutionary residue coupling information by keeping w sa and w tors at their optimum values. HHsearch and HHsuite were both evaluated with and without secondary structure information. The default parameter values were used with HHsearch and HHsuite.
Comparison of HMMsato, HHSearch, and HHsuite on the test data set
The mean SP and TC scores of the pairwise alignments generated by HHsearch1.2, HHsuite and HMMsato on the CASP9 test data set consisting of 1,138 pairs of proteins
Mean SP score
Mean TC score
HHsearch (without secondary structure information)
HHsearch (with secondary structure information)
HHsuite (without secondary structure information)
HHsuite (with secondary structure information)
The average TM-scores and GDT-TS scores of the 3D models generated from the 1,127 pairwise test alignments produced by HHsearch1.2, HHsuite and HMMsato
Average GDT- TS score
HHsearch (without secondary structure information)
HHsearch (with secondary structure information)
HHsuite (without secondary structure information)
HHsuite (with secondary structure information)
The statistical significance (p-values) of SP and TC score differences between HMMsato and the other two tools on the test data set
p-value of SP scores
p-value of TC scores
HMMsato -- HHsearch (without secondary structure information)
1.078 X 10−6
3.414 X 10−7
HMMsato -- HHsearch (with secondary structure information)
HMMsato -- HHsuite (without secondary structure information)
1.724 X 10−8
1.515 X 10−9
HMMsato -- HHsuite (with secondary structure information)
Impact of solvent accessibility, torsion angles and evolutionary coupling information on the alignment accuracy
The SP scores and TC scores with different values of w sa using HMMsato on the training data
The SP scores and TC scores with different values of w tors using HMMsato
The effect of evolutionary residue coupling information on alignment accuracy
We studied the effect of the evolutionary residue coupling information on alignment accuracy in a similar way. HMMsato worked the best when w ec was 0.1. However, the evolutionary coupling information did not improve the overall alignment accuracy on the training data set, probably due to lack of a large number of diverse sequences in many cases required by the evolutionary coupling calculation to obtain the sufficient discriminative power. Specifically speaking, the alignment quality increased in 57 alignments, stayed the same in 1363 alignments, but decreased in 61 alignments. Similarly, on the test data set, the alignment quality increased in 59 alignments, stayed the same in 1024 alignments, but decreased in 55 alignments. Generally speaking, the evolutionary coupling information contributed to the improvement of alignment accuracy in some cases, but its effect was rather inconsistent.
Comparison of HMMsato and HHSearch with secondary structure information on the test data set
We designed a method to incorporate relative solvent accessibility, torsion angles and evolutionary residue coupling information into HMM-based pairwise profile-profile protein alignments. Our experiments on the large CASP9 alignment data set showed that utilizing solvent accessibility and torsion angles improved the accuracy of HMM-based pairwise profile-profile alignments. However, the effect of the evolutionary residue coupling information on alignments is less consistent according to our current experimental setting, even though it may still be a valuable source of information to explore in the future. Particularly, we will use the latest method (i.e., direct information) of calculating evolutionary coupling information to guide the profile alignment process. Furthermore, we will carry out more extensive search of optimal weights for solvent accessibility, torsion angle, secondary structure, and evolutionary coupling information to improve alignment accuracy.
The work was partially supported by a NIH R01 grant (R01GM093123) to JC.
- Kinch LN, Wrabl JO, Krishna S, Majumdar I, Sadreyev RI, Qi Y, Pei J, Cheng H, Grishin NV: CASP5 assessment of fold recognition target predictions. Proteins: Structure, Function, and Bioinformatics. 2003, 53 (S6): 395-409. 10.1002/prot.10557.View Article
- Bork P, Koonin EV: Predicting functions from protein sequences—where are the bottlenecks?. Nat Genet. 1998, 18 (4): 313-318. 10.1038/ng0498-313.View ArticlePubMed
- Henn-Sax M, Höcker B, Wilmanns M, Sterner R: Divergent evolution of (βα)8-barrel enzymes. Biol Chem. 2001, 382 (9): 1315-1320.View ArticlePubMed
- Söding J: Protein homology detection by HMM–HMM comparison. Bioinformatics. 2005, 21 (7): 951-960. 10.1093/bioinformatics/bti125.View ArticlePubMed
- Remmert M, Biegert A, Hauser A, Söding J: HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011, 9: 173-175. 10.1038/nmeth.1818.View ArticlePubMed
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.View ArticlePubMed CentralPubMed
- Mott R: Smith–Waterman algorithm. eLS. 2005, http://onlinelibrary.wiley.com/doi/10.1038/npg.els.0005263/abstract,
- Holmes I, Durbin R: Dynamic programming alignment accuracy. J Comput Biol. 1998, 5 (3): 493-504. 10.1089/cmb.1998.5.493.View ArticlePubMed
- Hughey R, Karplus K, Krogh A: SAM: Sequence alignment and modeling software system. Technical Report UCSC-CRL-99-11. 2003, Santa Cruz, CA 95604: Baskin Center for Computer Engineering and Science, University of California
- Finn RD, Clements J, Eddy SR: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011, 39 (suppl 2): W29-W37.View ArticlePubMed CentralPubMed
- Ginalski K, Pas J, Wyrwicz LS, Von Grotthuss M, Bujnicki JM, Rychlewski L: ORFeus: detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Res. 2003, 31 (13): 3804-3807. 10.1093/nar/gkg504.View ArticlePubMed CentralPubMed
- Tang CL, Xie L, Koh IYY, Posy S, Alexov E, Honig B: On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. J Mol Biol. 2003, 334 (5): 1043-1062. 10.1016/j.jmb.2003.10.025.View ArticlePubMed
- Tomii K, Akiyama Y: FORTE: a profile–profile comparison tool for protein fold recognition. Bioinformatics. 2004, 20 (4): 594-595. 10.1093/bioinformatics/btg474.View ArticlePubMed
- Söding J, Biegert A, Lupas AN: The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res. 2005, 33 (suppl 2): W244-W248.View ArticlePubMed CentralPubMed
- Kryshtafovych A, Fidelis K, Moult J: CASP9 results compared to those of previous CASP experiments. Proteins: Structure, Function, and Bioinformatics. 2011, 79 (S10): 196-207. 10.1002/prot.23182.View Article
- Kryshtafovych A, Fidelis K, Moult J: CASP10 results compared to those of previous CASP experiments. Proteins: Structure, Function, and Bioinformatics. 2013, 82 (S2): 164-174.
- Hildebrand A, Remmert M, Biegert A, Söding J: Fast and accurate automatic structure prediction with HHpred. Proteins: Structure, Function, and Bioinformatics. 2009, 77 (S9): 128-132. 10.1002/prot.22499.View Article
- Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C: Protein 3D structure computed from evolutionary sequence variation. PLoS One. 2011, 6 (12): e28766-10.1371/journal.pone.0028766.View ArticlePubMed CentralPubMed
- Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS: Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing. Cell. 2012, 149 (7): 1607-1621. 10.1016/j.cell.2012.04.012.View ArticlePubMed CentralPubMed
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637. 10.1002/bip.360221211.View ArticlePubMed
- Cheng J, Li J, Wang Z, Eickholt J, Deng X: The MULTICOM toolbox for protein structure prediction. BMC Bioinformatics. 2012, 13 (1): 65-10.1186/1471-2105-13-65.View ArticlePubMed CentralPubMed
- Faraggi E, Yang Y, Zhang S, Zhou Y: Predicting continuous local structure and the effect of its substitution for secondary structure in fragment-free protein structure prediction. Structure. 2009, 17 (11): 1515-1527. 10.1016/j.str.2009.09.006.View ArticlePubMed CentralPubMed
- Zhang W, Liu S, Zhou Y: SP5: improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model. PLoS One. 2008, 3 (6): e2325-10.1371/journal.pone.0002325.View ArticlePubMed CentralPubMed
- Biegert A, Söding J: De novo identification of highly diverged protein repeats by probabilistic consistency. Bioinformatics. 2008, 24 (6): 807-814. 10.1093/bioinformatics/btn039.View ArticlePubMed
- Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005, 33 (7): 2302-2309. 10.1093/nar/gki524.View ArticlePubMed CentralPubMed
- Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins: Structure, Function, and Bioinformatics. 2005, 61 (1): 127-136. 10.1002/prot.20527.View Article
- Deng X, Cheng J: MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts. BMC Bioinformatics. 2011, 12: 472-10.1186/1471-2105-12-472.View ArticlePubMed CentralPubMed
- Eswar N, Webb B, Marti‒Renom MA, Madhusudhan M, Eramian D, Shen M-y, Pieper U, Sali A: Comparative Protein Structure Modeling Using Modeller. Curr Protoc Bioinformatics. 2006, 15 (5.6): 5.6.1-5.6.30.View Article
- Xu J, Zhang Y: How significant is a protein structure similarity with TM-score = 0.5?. Bioinformatics. 2010, 26 (7): 889-895. 10.1093/bioinformatics/btq066.View ArticlePubMed CentralPubMed
- Zemla A, Venclovas Č, Moult J, Fidelis K: Processing and analysis of CASP3 protein structure predictions. Proteins: Structure, Function, and Bioinformatics. 1999, 37 (S3): 22-29. 10.1002/(SICI)1097-0134(1999)37:3+<22::AID-PROT5>3.0.CO;2-W.View Article
- Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics. 2004, 57 (4): 702-710. 10.1002/prot.20264.View Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.