Volume 11 Supplement 2
Linear predictive coding representation of correlated mutation for protein sequence alignment
© Kim and Jeong; licensee BioMed Central Ltd. 2010
Published: 16 April 2010
Although both conservation and correlated mutation (CM) are important information reflecting the different sorts of context in multiple sequence alignment, most of alignment methods use sequence profiles that only represent conservation. There is no general way to represent correlated mutation and incorporate it with sequence alignment yet.
We develop a novel method, CM profile, to represent correlated mutation as the spectral feature derived by using linear predictive coding where correlated mutations among different positions are represented by a fixed number of values. We combine CM profile with conventional sequence profile to improve alignment quality.
For distantly related protein pairs, using CM profile improves the profile-profile alignment with or without predicted secondary structure. Especially, at superfamily level, combining CM profile with sequence profile improves profile-profile alignment by 9.5% while predicted secondary structure does by 6.0%. More significantly, using both of them improves profile-profile alignment by 13.9%. We also exemplify the effectiveness of CM profile by demonstrating that the resulting alignment preserves share coevolution and contacts.
In this work, we introduce a novel method, CM profile, which represents correlated mutation information as paralleled form, and apply it to the protein sequence alignment problem. When combined with conventional sequence profile, CM profile improves alignment quality significantly better than predicted secondary structure information, which should be beneficial for target-template alignment in protein structure prediction. Because of the generality of CM profile, it can be used for other bioinformatics applications in the same way of using sequence profile.
Currently, the comparison of multiple sequence alignments (MSAs) is based on aligning the sequence profiles that represent conservation at specific positions. However, the alignment quality of profile-profile alignment becomes unreliable as the sequence identity of seed sequences becomes low . Even though using predicted secondary structure as additional information slightly improves profile-profile alignment, it is still unsatisfactory since the most protein secondary structure prediction methods are based on sequence profile. In this situation, using correlated mutation information originated from coevolution of two or more residue positions would be informative.
Constructing alignments with high quality is important in comparative modeling, in which target-template alignment is a crucial step together with template selection, but the sequence alignments based solely on statistical amino acid matches become undependable at low sequence identity. Particularly, below 20% sequence identity referred to as midnight zone, using sequence alignment without structural evidence can be problematic. Practically, it is found that many proteins with similar structure have low sequence identity , and, in CASP7, about half of the targets have the single best templates with <20% sequence identity . This means that the reliability of alignment in the midnight zone is a bottleneck for protein structure prediction, and therefore its improvement is strongly desirable.
Correlated mutation is estimated in various ways. McBASC algorithm  calculates the correlation of amino acid substitutions at individual positions. SCA algorithm and its variants [5–7] measure the relative amino acid frequencies observed after perturbing the MSA. Mutual information [8, 9] is used for estimating correlated mutation. Recently, it is also found that normalizing mutual information improves the determination of coevolving residues [10, 11]. In spite of these efforts, the application of correlated mutation is restricted mainly to inter-residue contact prediction [12, 13] and functional site prediction . Moreover, it has not been utilized for the purpose of sequence alignment, the most basic procedure in sequence analysis, and there is no universal method for comparing correlated mutation patterns of different proteins.
In this article, we introduce a novel method, CM profile, which represents correlated mutation based on signal processing technique called linear predictive coding (LPC) , and apply it to the protein sequence alignment problem. The results show that the employment of correlated mutation improves alignment quality consistently at different SCOP levels and sequence identities. The analysis on a few examples shows that the use of CM profile makes alignments preserve correlated mutation and the residues with common contacts are aligned with high scores.
We prepare protein pairs which are non-redundant and distantly related with each other. The data are derived from SCOP  version 1.69 with <35% sequence identity downloaded from Astral compendium . 4253 domains whose MSA is composed of less than 100 sequences are omitted because correlated mutation analysis using MSA with a small number of sequences can be unreliable and include much noise, and 2501 domains remain. To make pairs of distantly related homologs, we select superfamilies with at least 10 domains, and pair the domains with each other in each superfamily. The selected domains are composed of 1105 domains of 50 folds, 60 superfamilies, and 341 families. For parameter selection we use 388 pairs consisting of 200 domains randomly chosen, and for testing use 9118 pairs consisting of the remaining 905 domains.
The frequency matrices and the position-specific score matrices (PSSMs) representing sequence profiles are automatically generated by running PSI-BLAST  version 2.2.19 against NCBI nr database with “–j 3 –e 0.001 –h 0.001” options. The MSAs used for constructing CM profiles are also generated by running PSI-BLAST with the same option, and then thinned by removing the sequences covering less than 50% of the seed sequence and clustering the remaining sequences at 65% sequence identity.
Representation of correlated mutation
Since the dimension of correlated mutation vector is variable depending on sequence length, the correlated mutation vectors of distinct sequences are not paralleled. Therefore, we extract the spectral features, known as LPC cepstral coefficients, to represent the correlated mutation vector. LPC cepstral coefficients have been used for comparing DNA and protein sequences .
The result gives a set of p linear equations
where r(i), known as the autocorrelation function of s(n), is defined as
and symmetric, i.e. r(– k) = r(k). The linear equations can be expressed in matrix form as
Since the p x p matrix of autocorrelation values is a Toeplitz matrix that is symmetric and all the diagonal elements are equal, the solution of the linear equations can be calculated recursively and very efficiently through Levinson-Durbin algorithm without relatively expensive computation such as matrix inversion. If the linear equation is solved, more advanced spectral feature called as LPC cepstral coefficients can be derived from the LPC coefficients by the following recursion.
By using the LPC analysis process described above, we transform a correlated mutation vector to the CM profile consisting of the LPC cepstral coefficients. Since the cepstral coefficients are decaying, we use only the first L coefficients excluding c0. Additionally, we normalize CM profiles of a protein by fitting the mean and variance into zero and one, respectively, to weight them equally regardless of the orders. We obtain consequently a L-dimensional CM profile, c(a i , b), that represents the correlated mutations between amino acid a at position i and amino acid b at other positions. In other word, all the correlated mutation between position i and other positions are represented as 400 x L coefficients.
To compare sequences, we define the alignment score between the position i of a protein and the position j of another protein as follows,
where w mut , w cor , and w sec denote the weights, and S mut (i, j), S cor (i, j), and S sec (i, j) denote the similarity scores of sequence profiles, CM profiles, and secondary structure predictions, respectively, between the positions i and j. S mut (i, j) is the sequence profile score defined as
where q(a i ), q(a j ), t(a i ), and t(a j ) are the frequencies and the PSSM scores of amino acid a at position i and j respectively. S cor (i, j) is the CM profile score defined as
Where d(c(a i , b), c(a j , b)) is the Euclidean distance between CM profile c(a i , b) at position i and CM profile c(a j , b) at position j, d0 is the threshold, and α is the scaling factor. The S cor (i, j) gives a positive score in case that the distance between CM profiles is less than d0, and a negative score in case that the distance is more than d0. S sec (i, j) is the secondary structure prediction score given as 1 if the predicted secondary structures at position i and j are identical, and 0 otherwise. We use PSIPRED  to predict secondary structures. Based on the score matrix consisting of S(i, j) for all i and j, we perform the Needleman-Wunsch algorithm with affine gap costs and baseline to find the optimal alignment.
Assessment and parameter selection
We assess the alignment quality by measuring the average MaxSub score  of models derived from sequence alignments. The model is generated by directly copying the coordinates of C-alpha atoms based on the sequence alignment, and the MaxSub score of the model is computed with default options. The MaxSub score identifies the largest subset of C-alpha atoms of a model that superimpose well over the experimental structure, and provides a single normalized score in the range of 0 to 1. The MaxSub score 0 indicates a completely wrong model, and 1 indicates a perfect model. The parameters of each method are selected by simulated annealing (SA) that uses the average MaxSub score of training set as the objective function.
Selected parameters for the different combination of scoring terms.
Average MaxSub scores of test set by different methods
This result has important implications in several aspects. It is well known that profile-profile alignment is improved most by using secondary structure prediction [1, 23] and numerous state-of-the-art methods hence incorporate secondary structure prediction in their alignment scheme [21, 24–26]. Also in our results, profile-profile alignment is consistently outperformed by combining secondary structure prediction. However, it is more significantly improved by combining correlated mutation (CMPA_PPA), and the best performance is achieved by combining correlated mutation and secondary structure prediction together (CMPA_PPA_SS). Taking this into account, the state-of-the-art methods can be improved significantly by incorporating correlated mutation information.
Another aspect is related with the reliability of alignment in the midnight zone. An alignment becomes less reliable when the sequence identity lies in the midnight zone . Since using correlated mutation is more advantageous for the proteins pairs with low sequence identity and CM profile is easily combined with conventional methods, the coverage of current alignment methods in the midnight zone can be increased by using CM profile. The most important implication is related with the template-based protein structure prediction. From two other aspects described above, it is obvious that using correlated mutation remarkably improves current alignment methods for the sequences with less than 20% sequence identity. This is beneficial to target-template alignment because most of promising templates sharing the same structure have relatively low sequence identity . Practically, according to the recent analysis for the template-based modeling targets of CASP7 , almost half of the targets, specifically 50 among 108 targets, have the best templates with similar structure but low sequence identity less than 20%, and a virtual predictor based on the best templates overall outperforms all other groups by far. The effectiveness of CM profile will carry out more reliable target-template alignments and subsequently provide better models for difficult target-template pairs, thereby increasing the confidence for template-based structure prediction.
To assess the performance of CM profile for domains which have less MSA sequences than 100, we build two additional test sets from the omitted domains. The first test set is built from the domains with 50-99 MSA sequences, and consists of 527 domains of 30 folds, 31 superfamilies, and 142 families, deriving 5586 pairs. The second test set is built from the domains with 1-49 MSA sequences, and consists of 752 domains of 31 folds, 37 superfamilies, and 225 families, deriving 9676 pairs.
Average MaxSub scores of test set with 50-99 MSA sequences by different methods
Average MaxSub scores of test set with 1-49 MSA sequences by different methods
The reason for the effectiveness of CM profile is related with the correlation between coevolution and contact. It has been shown that the residues important for protein function are not only conserved but also coevolved with other inter-related residues [12, 27]. This fact has been exploited to infer the structural factor such as inter-residue contacts  and to evaluate the correctness of de novo model . Recently, it has been also shown that key residues can be identified by analyzing residue-residue coevolution network . In the aspect of alignment, contact-mutation matrices derived from structural information have been used for improving alignment quality . CM profile utilizes correlated mutation information much more globally and progressively, implying all the correlated mutation of possible residue pairs. This optimizes alignment to match multiple contacting residue pairs, while the previous studies only consider at most two residue pairs, thereby improving alignment quality noticeably.
Protein pairs with the MaxSub scores of various methods
Protein 1 (SCOP classification)
Protein 2 (SCOP classification)
Due to the generality of CM profile, it can be successfully exploited for various bioinformatics applications, particularly with machine learning approaches. Our approach is position-specific and consists of a fixed number of values, which allows CM profile to be manipulated in the same way that we use sequence profile. Thus, CM profile can be easily adopted into the current methodology without serious modification to complement them. Moreover, it should be noted that our CM profile is not optimally generated because sequence profiles are automatically generated by PSI-BLAST . The present method can be improved significantly, as the accuracy of correlated mutation is increased through various corrections and noise reductions [10, 11].
We develop a novel method to represent correlated mutation as the spectral features derived from LPC analysis, and we also apply it to sequence alignment of distantly related proteins. When combined with conventional sequence profile, CM profile improves alignment quality significantly better than predicted secondary structure information. Especially, the dramatic improvement in the midnight zone is observed, which should be beneficial for target-template alignment in protein structure prediction. Finally, because the methodology that we have developed in this work can be generalized to many interesting areas of bioinformatics, we expect that CM profile can be applicable to other bioinformatics applications equally well.
This work was supported by the Korea Institute of Science and Technology Information Supercomputing Center.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 2, 2010: Third International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S2.
- Elofsson A: A study on protein sequence alignment quality. Proteins 2002, 46(3):330–339. 10.1002/prot.10043View ArticlePubMedGoogle Scholar
- Yang AS, Honig B: An integrated approach to the analysis and modeling of protein sequences and structures. II. On the relationship between sequence and structural similarity for proteins that are not obviously related in sequence. J Mol Biol 2000, 301(3):679–689. 10.1006/jmbi.2000.3974View ArticlePubMedGoogle Scholar
- Kopp J, Bordoli L, Battey JN, Kiefer F, Schwede T: Assessment of CASP7 predictions for template-based modeling targets. Proteins 2007, 69(Suppl 8):38–56. 10.1002/prot.21753View ArticlePubMedGoogle Scholar
- Olmea O, Rost B, Valencia A: Effective use of sequence correlation and conservation in fold recognition. J Mol Biol 1999, 293(5):1221–1239. 10.1006/jmbi.1999.3208View ArticlePubMedGoogle Scholar
- Lockless SW, Ranganathan R: Evolutionarily conserved pathways of energetic connectivity in protein families. Science 1999, 286(5438):295–299. 10.1126/science.286.5438.295View ArticlePubMedGoogle Scholar
- Süel GM, Lockless SW, Wall MA, Ranganathan R: Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol 2003, 10(1):59–69. 10.1038/nsb881View ArticlePubMedGoogle Scholar
- Dekker JP, Fodor A, Aldrich RW, Yellen G: A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments. Bioinformatics 2004, 20(10):1565–1572. 10.1093/bioinformatics/bth128View ArticlePubMedGoogle Scholar
- Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW: Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol 2000, 17(1):164–178.View ArticlePubMedGoogle Scholar
- Tillier ER, Lui TW: Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics 2003, 19(6):750–755. 10.1093/bioinformatics/btg072View ArticlePubMedGoogle Scholar
- Buslje CM, Santos J, Delfino JM, Nielsen M: Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information. Bioinformatics 2009, 25(9):1125–1131. 10.1093/bioinformatics/btp135PubMed CentralView ArticlePubMedGoogle Scholar
- Dunn SD, Wahl LM, Gloor GB: Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 2008, 24(3):333–340. 10.1093/bioinformatics/btm604View ArticlePubMedGoogle Scholar
- Göbel U, Sander C, Schneider R, Valencia A: Correlated mutations and residue contacts in proteins. Proteins 1994, 18(4):309–317. 10.1002/prot.340180402View ArticlePubMedGoogle Scholar
- Shackelford G, Karplus K: Contact prediction using mutual information and neural nets. Proteins 2007, 69(Suppl 8):159–164. 10.1002/prot.21791View ArticlePubMedGoogle Scholar
- Rabiner LR, Juang BH: Fundamentals of speech recognition. Englewood Cliffs, N.J.: PTR Prentice Hall; 1993.Google Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.PubMedGoogle Scholar
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res 2004, 32(Database issue):D189–192. 10.1093/nar/gkh034PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Pham T: Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recognition 2007, 40(2):516–529. 10.1016/j.patcog.2006.02.026View ArticleGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292(2):195–202. 10.1006/jmbi.1999.3091View ArticlePubMedGoogle Scholar
- Siew N, Elofsson A, Rychlewski L, Fischer D: MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics 2000, 16(9):776–785. 10.1093/bioinformatics/16.9.776View ArticlePubMedGoogle Scholar
- Ohlson T, Aggarwal V, Elofsson A, MacCallum RM: Improved alignment quality by combining evolutionary information, predicted secondary structure and self-organizing maps. BMC bioinformatics 2006, 7: 357. 10.1186/1471-2105-7-357PubMed CentralView ArticlePubMedGoogle Scholar
- Cozzetto D, Tramontano A: Relationship between multiple sequence alignments and quality of protein comparative models. Proteins 2005, 58(1):151–157. 10.1002/prot.20284View ArticlePubMedGoogle Scholar
- Qi Y, Sadreyev RI, Wang Y, Kim BH, Grishin NV: A comprehensive system for evaluation of remote sequence similarity detection. BMC bioinformatics 2007, 8: 314. 10.1186/1471-2105-8-314PubMed CentralView ArticlePubMedGoogle Scholar
- Söding J: Protein homology detection by HMM-HMM comparison. Bioinformatics 2005, 21(7):951–960. 10.1093/bioinformatics/bti125View ArticlePubMedGoogle Scholar
- Lee M, Jeong C, Kim D: Predicting and improving the protein sequence alignment quality by support vector regression. BMC bioinformatics 2007, 8: 471. 10.1186/1471-2105-8-471PubMed CentralView ArticlePubMedGoogle Scholar
- Wu S, Zhang Y: MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins 2008, 72(2):547–556. 10.1002/prot.21945PubMed CentralView ArticlePubMedGoogle Scholar
- Neher E: How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci USA 1994, 91(1):98–102. 10.1073/pnas.91.1.98PubMed CentralView ArticlePubMedGoogle Scholar
- Bartlett GJ, Taylor WR: Using scores derived from statistical coupling analysis to distinguish correct and incorrect folds in de-novo protein structure prediction. Proteins 2008, 71(2):950–959. 10.1002/prot.21779View ArticlePubMedGoogle Scholar
- Lee BC, Park K, Kim D: Analysis of the residue-residue coevolution network and the functionally important residues in proteins. Proteins 2008, 72(3):863–872. 10.1002/prot.21972View ArticlePubMedGoogle Scholar
- Kleinjung J, Romein J, Lin K, Heringa J: Contact-based sequence alignment. Nucleic Acids Res 2004, 32(8):2464–2473. 10.1093/nar/gkh566PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.