Prediction of disulfide connectivity in proteins with machine-learning methods and correlated mutations
© Savojardo et al.; licensee BioMed Central Ltd. 2013
Published: 14 January 2013
Recently, information derived by correlated mutations in proteins has regained relevance for predicting protein contacts. This is due to new forms of mutual information analysis that have been proven to be more suitable to highlight direct coupling between pairs of residues in protein structures and to the large number of protein chains that are currently available for statistical validation. It was previously discussed that disulfide bond topology in proteins is also constrained by correlated mutations.
In this paper we exploit information derived from a corrected mutual information analysis and from the inverse of the covariance matrix to address the problem of the prediction of the topology of disulfide bonds in Eukaryotes. Recently, we have shown that Support Vector Regression (SVR) can improve the prediction for the disulfide connectivity patterns. Here we show that the inclusion of the correlated mutation information increases of 5 percentage points the SVR performance (from 54% to 59%). When this approach is used in combination with a method previously developed by us and scoring at the state of art in predicting both location and topology of disulfide bonds in Eukaryotes (DisLocate), the per-protein accuracy is 38%, 2 percentage points higher than that previously obtained.
In this paper we show that the inclusion of information derived from correlated mutations can improve the performance of the state of the art methods for predicting disulfide connectivity patterns in Eukaryotic proteins. Our analysis also provides support to the notion that improving methods to extract evolutionary information from multiple sequence alignments greatly contributes to the scoring performance of predictors suited to detect relevant features from protein chains.
Disulfide bonds are covalent cross-links between cysteine side chains that play very important roles in the native structures of globular proteins. Folding, stability, and ultimately function of secreted proteins in cells are influenced by the formation of disulfide bonds between cysteine residues . Predicting the topology and the location of disulfide bridges in a protein from its sequence therefore plays a relevant role in protein structural and functional annotation. Several computational methods are presently available for computing cysteine properties in a protein sequence and they can be grouped into: i) methods that predict the disulfide bonding state [2–4]; ii) methods that predict the topological connectivity patterns by assuming that the cysteine bonding state is known [5–8]; iii) methods that compute both i) and ii)[9–12]. Recently we developed DisLocate, a two-stage method for disulfide bond prediction in Eukaryotes comprising two integrated modules. The first based on Conditional Random Fields (CRFs) predicts the cysteine bonding state; the second based on a Support Vector Regression (SVR) predicts the topology of the disulfide bridges . DisLocate improved over previous methods by introducing for the first time the information of the protein subcellular localization in the prediction of the disulfide bonding state .
Here we address the problem of improving the second step of the prediction, namely the prediction of disulfide connectivity pattern, by exploiting the role of correlated mutations. Correlated mutation analysis aims at elucidating relations between pairs of residues in the protein structure that may influence its folding. Routinely, this is done through the identification of the co-evolution of different positions in a multiple sequence alignment. The notion of correlated mutation describes that an unfavourable residue mutation in a structural contact can be compensated by the simultaneous change of the direct partner in such a way that the original interaction is preserved (compensatory mutation) . It has been recently observed that with sufficient and correct information about protein residue-residue contacts it is possible to predict some protein structures from the residue chain [13–16].
Correlated mutation analysis was also introduced in the context of disulfide bond connectivity prediction. Simple correlation patterns of concerted appearing and disappearing cysteines in multiple structural alignments were used to predict the topology of disulfide bonds in proteins .
In the present paper we propose the usage of information derived from correlated mutations to improve the prediction of disulfide connectivity over a set proteins including 1797 chains (PDBCYS). We evaluate two different approaches of computing the correlated mutations: corrected mutual information (MIp) and sparse inverse covariance estimation (iCOV). MIp is a corrected version of mutual information specifically designed to remove the background noise due to both phylogenetic and entropic biases . The latter approach (iCOV) which is based on sparse inverse covariance estimation was recently introduced for the problem of predicting contact maps . Here we combine information derived with both methods for computing correlated mutations with features that were previously found relevant for predicting the disulfide connectivity and implemented in our DisLocate . In order to highlight the effect of correlated mutations we benchmark the newly developed predictors on the same dataset (PDBCYS) previously adopted to evaluate DisLocate . Our results show that correlated mutation analysis adds to the previously introduced features and improves the prediction scores. This indicates that correlated mutations are a significant piece of information also when computing the connectivity pattern of disulfide bridges in protein structures.
Mutual information among cysteines
where f i (a) and f j (b) are the relative frequencies of amino acid types a and b at position i and j, respectively, and f i,j (a,b) is the relative frequency of the amino acid pair ab at positions ij.
where is the average mutual information between position i and all other positions (analogously for position j) and is the average mutual information of all positions.
Sparse inverse covariance estimation
In recent works it has been pointed out that it is possible to improve the co-evolutionary information using the inverse of the covariance matrix [19, 20]. In particular, using information stored into the inverse of the covariance matrix, the performance of contact prediction improves significantly with respect to the simple MI or MIp by reducing the so-called indirect coupling effect i.e. the statistical dependency observed in multiple sequence alignment for residues that are structurally distant [19–21]. One of the proposed approach (followed here), is based on sparse inverse covariance estimation and it is called PSICOV .
where f i (a), f j (b) and f i,j (a,b) are defined as in the previous section and S is a 21m by 21m covariance matrix (here we also include the gap as a 21st symbol).
where is the mean contact score between position i and all other positions (analogously for position j) and is the overall mean contact score. We refer to this bonding score as iCOV in the rest of the paper.
Predicting disulfide connectivity patterns
Once the cysteine bonding state is assigned, we predict the connectivity pattern of the subsets of proteins that contain at least a pair of cysteines in the bonding state by applying a Support Vector Regression approach . The SVR predictions of each possible pair of cysteines is used as edge weight and the Edmond-Gabow algorithm is adopted to predict the most probable disulfide pattern . In order to evaluate SVR, we use the same 20-fold cross validation procedure described before , considering only proteins with at least two disulfide bridges. SVRs were trained using an input encoding based on global and local information. The global information (that does not depend on each particular cysteine pair) is defined by the Normalized Protein Length (one real value), the Protein Molecular Weight (one real value) and the protein amino acid composition (20 real values). The local pairwise encoding (that depends on each particular cysteine pair) consists of the following descriptors:
two PSSM-based windows centered on the cysteines forming the pairs. We used a window of length 13, the best performing among the different-sized windows we tested. With this choice, we ended up with a vector of 13 * 20 * 2 = 520 components;
the Relative Order of the Cysteines. This feature is encoded with 2 real values that represent the normalized relative order of a cysteines pair. Given a protein with n cysteines (C1,C2,...,Cn), the corresponding normalized ordered list of cysteines is given by (1/n, 2/n,...,n/n). For each pair of cysteines, the corresponding values are then taken from the list (e.g. the pair (C1,C4) is encoded as (1/n,4/n));
the Cysteine Separation Distance. This feature is encoded with 1 real value that represents the log-cysteine sequence separation computed as SEP(Ci,Cj) = log (|j - i|) where i and j are sequence positions of cysteines Ci and Cj, respectively.
Correlated mutation information, based on MIp and/or iCOV.
In this study we used the dataset PDBCYS introduced before . From PDB (release May 2010) we extracted 1797 Eukaryotic protein structures with resolution <2.5 Å with at least two cysteine residues and global pairwise sequence similarity <25%. PDBCYS includes 7619 free and 3194 bonded cysteines. Since PDBCYS contains pairs of proteins with detectable local sequence similarity, we clustered all the chains using a local sequence similarity score. First, we ran a BLAST sequence search using all the proteins of the set versus themselves. Then, for each pair of proteins we selected the higher bi-directional (say p1 vs p2 or p2 vs p1) sequence identity as reported in the BLAST output. We subsequently treated the proteins as nodes of a graph and assigned an edge between two nodes only where local sequence identity between the corresponding protein sequences was > 25%. In addition, we computed the connected components of the graph and treated each group of nodes as a protein cluster. Finally, the clusters were grouped in 20 disjoint sets used to train and test the method. We used these 20 subsets to evaluate our method and to compare its performance with previous approaches by adopting a 20-fold cross-validation procedure.
In the following N c is the number of correctly predicted bonds, N p is the total number of predicted bonds, N b is the total number of observed bonds, N patt is number of correctly predicted disulfide connectivity pattern and N is the total number of chains.
To score the disulfide connectivity prediction we computed the following indices:
the precision P b :(7)
the recall R b :(8)
the Q p :(9)
For sake of readability in the Tables we report the indices in percentage (i.e. the obtained values are multiplied by 100).
All multiple sequence alignments used to compute both the MIp and the iCOV features have been generated by running 3 iterations of the jackhmmer program which is a part of the HMMER 3.0 package (http://hmmer.org) against the UNIREF90 sequence database. The inverse covariance estimation was performed by means of the glasso R package available at the CRAN archive (http://cran.rproject.org/web/packages/glasso/index.html), the same used in . All the estimations have been performed using the exact algorithm of the glasso code (see glasso package documentation for details). glasso algorithm depends on a parameter ρ that conditions the sparsity of the reconstructed inverse covariance matrix. This parameter also affects the algorithm run time: the smaller is ρ the longer is the required time. Below we report the results obtained when ρ is set to 1e-8, that was chosen as trade-off between the computational time and the method performance (computed on the validation sets). MIp values were computed as described in . For the SVR implementation we used the libsvm package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) with a RBF kernel.
Results and discussion
Prediction of disulfide connectivity with known bonding state
In order to evaluate the effect of correlated mutations in the task of predicting the topology of disulfide bonds, we first assume that disulfide bonded cysteines are known. We evaluated the performance of methods considering subsets of proteins with a different number of disulfide bonds (from 2 to 5). The reported accuracy was obtained using the same 20-fold cross validation procedure previously described .
Performance on disulfide connectivity prediction obtained with correlated mutation measures
Pb = Rb
Pb = Rb
Pb = Rb
Performance on disulfide connectivity prediction obtained with different SVR-based methods
Pb = Rb
Pb = Rb
Pb = Rb
Pb = Rb
Prediction of disulfide connectivity with predicted bonding state
Prediction without a prior knowledge of the cysteine bonding state
Prediction performance as a function of the quality of the multiple sequence alignments
MIp and iCOV are computed over multiple sequence alignments. We therefore evaluate how the number and the type of sequences included in the alignment (used to compute the correlation among cysteines residues) can affect the final result.
The prediction of protein structures from their sequences it is still an open problem in Structural Bioinformatics, especially considering that the disproportion between the number of putative protein sequences with respect to the number of known 3D structures is exponentially increasing. The bonding state of cysteines plays a relevant role in stabilizing the tertiary folds of proteins, in defining protein functions and in triggering functionally relevant conformational changes . The knowledge of disulfide bonds is very important to predict the protein structure in ab initio and comparative modelling since it poses constraints to the possible chain conformations [27, 28]. In this paper we introduce a new method to predict disulfide bonds starting from protein sequence. We investigate the effect of the information derived from correlated mutations on the problem of predicting the topology of disulfide bonds in proteins. We show that correlated mutations in the form of corrected mutual information (MIp) and inverse of covariance matrix (iCOV) carry a significant quantity of information that was not completely exploited before for the task of disulfide bond prediction. We present a new method that implementing information derived from correlated mutations improves the performance over the state of the art method DisLocate . Finally, we highlight that the optimal performance of the method can be achieved when the number of sequences included in the multiple alignment from where information on correlated mutation is derived is in the range of 10000 protein chains and the correspondent NEFF value of the alignment is greater or equal to 10.
The publication costs of this article were funded by the RFO (Ricerca Fondamentale Orientata) 2011, University of Bologna, awarded to Rita Casadio.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 1, 2013: Computational Intelligence in Bioinformatics and Biostatistics: new trends from the CIBB conference series. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S1.
RC thanks the following grants: PRIN 2009 project 009WXT45Y (Italian Ministry for University and Research: MIUR), COST BMBS Action TD1101(European Union RTD Framework Programme), and PON project PON01_02249 (Italian Ministry for University and Research: MIUR). CS is a recipient of a Ph.D. fellowship from the Ministry of the Italian University and Research.
- Inaba K: Structural basis of protein disulfide bond generation in the cell. Genes Cells. 2010, 15 (9): 935-43. 10.1111/j.1365-2443.2010.01434.x.View ArticlePubMedGoogle Scholar
- Martelli PL, Fariselli P, Malaguti L, Casadio R: Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks. Protein Eng. 2002, 15 (12): 951-953. 10.1093/protein/15.12.951.View ArticlePubMedGoogle Scholar
- Mucchielli-Giorgi MH, Hazout S, Tufféry P: Predicting the disulfide bonding state of cysteines using protein descriptors. Proteins. 2002, 46 (3): 243-249. 10.1002/prot.10047.View ArticlePubMedGoogle Scholar
- Chen YC, Lin YS, Lin CJ, Hwang JK: Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences. Proteins. 2004, 55 (4): 1036-1042. 10.1002/prot.20079.View ArticlePubMedGoogle Scholar
- Fariselli P, Casadio R: Prediction of disulfide connectivity in proteins. Bioinformatics. 2001, 17 (10): 957-964. 10.1093/bioinformatics/17.10.957.View ArticlePubMedGoogle Scholar
- Vullo A, Frasconi P: Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics. 2004, 20 (5): 653-659. 10.1093/bioinformatics/btg463.View ArticlePubMedGoogle Scholar
- Ferrè F, Clote P: Disulfide connectivity prediction using secondary structure information and diresidue frequencies. Bioinformatics. 2005, 21 (10): 2336-2346. 10.1093/bioinformatics/bti328.View ArticlePubMedGoogle Scholar
- Song J, Yuan Z, Tan H, Huber T, Burrage K: Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure. Bioinformatics. 2007, 23 (23): 3147-3154. 10.1093/bioinformatics/btm505.View ArticlePubMedGoogle Scholar
- Cheng J, Saigo H, Baldi P: Large-scale prediction of disulphide bridges using kernel methods, two-dimensional recursive neural networks, and weighted graph matching. Proteins. 2006, 62 (3): 617-629.View ArticlePubMedGoogle Scholar
- Taskar B, Chatalbashev V, Koller D, Guestrin C: Learning structured prediction models: a large margin approach. Proceedings of the 22nd International Conference on Machine Learning (ICML '05). 2005, New York: ACM, 896-903. 10.1145/1102351.1102464.View ArticleGoogle Scholar
- Vincent M, Passerini A, Labbé M, Frasconi P: A simplified approach to disulfide connectivity prediction from protein sequences. BMC Bioinformatics. 2008, 9: 20-10.1186/1471-2105-9-20.PubMed CentralView ArticlePubMedGoogle Scholar
- Savojardo C, Fariselli P, Martelli PL, Pierleoni A, Casadio R: Improving the prediction of disulfide bonds in Eukaryotes with machine learning methods and protein subcellular localization. Bioinformatics. 2011, 27 (16): 2224-2230. 10.1093/bioinformatics/btr387.View ArticlePubMedGoogle Scholar
- Gobel U, Sander C, Schneider R, Valencia A: Correlated mutations and residue contacts in proteins. Proteins. 1994, 18: 309-317. 10.1002/prot.340180402.View ArticlePubMedGoogle Scholar
- Olmea O, Valencia A: Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold Des. 1997, 2: S25-S32.View ArticlePubMedGoogle Scholar
- Fariselli P, Olmea O, Valencia A, Casadio R: Progress in predicting inter- residue contacts of proteins with neural networks and correlated mutations. Proteins. 2001, 45 (Suppl 5): 157-162.View ArticleGoogle Scholar
- Di Lena P, Fariselli P, Margara L, Vassura M, Casadio R: Is there an optimal substitution matrix for contact prediction with correlated mutations?. IEEE/ACM Trans Comput Biol Bioinform. 2011, 8: 1017-1028.View ArticlePubMedGoogle Scholar
- Rubinstein R, Fiser A: Predicting disulfide bond connectivity in proteins by correlated mutations analysis. Bioinformatics. 2008, 24 (2): 498-504.View ArticlePubMedGoogle Scholar
- Dunn SD, Wahl LM, Gloor GB: Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics. 2008, 24 (3): 333-340. 10.1093/bioinformatics/btm604.View ArticlePubMedGoogle Scholar
- Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C: Protein 3D structure computed from evolutionary sequence variation. PLoS ONE. 2011, 6 (12): e28766-10.1371/journal.pone.0028766.PubMed CentralView ArticlePubMedGoogle Scholar
- Jones DT, Buchan DWA, Cozzetto D, Pontil M, PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2011, 28 (2): 184-190.View ArticlePubMedGoogle Scholar
- Burger L, van Nimwegen E: Disentangling direct from indirect co-evolution of residues in protein alignments. PLoS Comput Biol. 2010, 6 (1): e1000633-10.1371/journal.pcbi.1000633.PubMed CentralView ArticlePubMedGoogle Scholar
- Banerjee O, El Ghaoui L, d'Aspremont A: Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research. 2008, 9: 485-516. [http://jmlr.csail.mit.edu/papers/v9/banerjee08a.html]Google Scholar
- Friedman J, Hastie T, Tibshirani R: Sparse inverse covariance estimation with the graphical Lasso. Biostatistics. 2008, 9: 432-441. 10.1093/biostatistics/kxm045.PubMed CentralView ArticlePubMedGoogle Scholar
- Fariselli P, Savojardo C, Martelli PL, Casadio R: Grammatical-Restrained Hidden Conditional Random Fields for Bioinfomatics Applications. Algorithms for Molecular Biology. 2009, 4 (13):Google Scholar
- Casbon J, Saqi M: Analysis of superfamily specific profile-profile recognition accuracy. BMC Bioinformatics. 2004, 5 (200):Google Scholar
- Wedemeyer WJ, Welker E, Narayan M, Scheraga HA: Disulfide bonds and protein folding. Biochemistry. 2000, 39 (15): 4207-4216. 10.1021/bi992922o.View ArticlePubMedGoogle Scholar
- Das R, Baker D: Macromolecular modeling with rosetta. Annu Rev Biochem. 2008, 77: 363-382. 10.1146/annurev.biochem.77.062906.171838.View ArticlePubMedGoogle Scholar
- Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993, 234: 779-815. 10.1006/jmbi.1993.1626.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.