Prediction of disulfide connectivity in proteins with machinelearning methods and correlated mutations
 Castrense Savojardo^{1, 2},
 Piero Fariselli^{1, 2}Email author,
 Pier Luigi Martelli^{2, 3} and
 Rita Casadio^{2, 3}
https://doi.org/10.1186/1471210514S1S10
© Savojardo et al.; licensee BioMed Central Ltd. 2013
Published: 14 January 2013
Abstract
Background
Recently, information derived by correlated mutations in proteins has regained relevance for predicting protein contacts. This is due to new forms of mutual information analysis that have been proven to be more suitable to highlight direct coupling between pairs of residues in protein structures and to the large number of protein chains that are currently available for statistical validation. It was previously discussed that disulfide bond topology in proteins is also constrained by correlated mutations.
Results
In this paper we exploit information derived from a corrected mutual information analysis and from the inverse of the covariance matrix to address the problem of the prediction of the topology of disulfide bonds in Eukaryotes. Recently, we have shown that Support Vector Regression (SVR) can improve the prediction for the disulfide connectivity patterns. Here we show that the inclusion of the correlated mutation information increases of 5 percentage points the SVR performance (from 54% to 59%). When this approach is used in combination with a method previously developed by us and scoring at the state of art in predicting both location and topology of disulfide bonds in Eukaryotes (DisLocate), the perprotein accuracy is 38%, 2 percentage points higher than that previously obtained.
Conclusions
In this paper we show that the inclusion of information derived from correlated mutations can improve the performance of the state of the art methods for predicting disulfide connectivity patterns in Eukaryotic proteins. Our analysis also provides support to the notion that improving methods to extract evolutionary information from multiple sequence alignments greatly contributes to the scoring performance of predictors suited to detect relevant features from protein chains.
Keywords
Background
Disulfide bonds are covalent crosslinks between cysteine side chains that play very important roles in the native structures of globular proteins. Folding, stability, and ultimately function of secreted proteins in cells are influenced by the formation of disulfide bonds between cysteine residues [1]. Predicting the topology and the location of disulfide bridges in a protein from its sequence therefore plays a relevant role in protein structural and functional annotation. Several computational methods are presently available for computing cysteine properties in a protein sequence and they can be grouped into: i) methods that predict the disulfide bonding state [2–4]; ii) methods that predict the topological connectivity patterns by assuming that the cysteine bonding state is known [5–8]; iii) methods that compute both i) and ii)[9–12]. Recently we developed DisLocate, a twostage method for disulfide bond prediction in Eukaryotes comprising two integrated modules. The first based on Conditional Random Fields (CRFs) predicts the cysteine bonding state; the second based on a Support Vector Regression (SVR) predicts the topology of the disulfide bridges [12]. DisLocate improved over previous methods by introducing for the first time the information of the protein subcellular localization in the prediction of the disulfide bonding state [12].
Here we address the problem of improving the second step of the prediction, namely the prediction of disulfide connectivity pattern, by exploiting the role of correlated mutations. Correlated mutation analysis aims at elucidating relations between pairs of residues in the protein structure that may influence its folding. Routinely, this is done through the identification of the coevolution of different positions in a multiple sequence alignment. The notion of correlated mutation describes that an unfavourable residue mutation in a structural contact can be compensated by the simultaneous change of the direct partner in such a way that the original interaction is preserved (compensatory mutation) [13]. It has been recently observed that with sufficient and correct information about protein residueresidue contacts it is possible to predict some protein structures from the residue chain [13–16].
Correlated mutation analysis was also introduced in the context of disulfide bond connectivity prediction. Simple correlation patterns of concerted appearing and disappearing cysteines in multiple structural alignments were used to predict the topology of disulfide bonds in proteins [17].
In the present paper we propose the usage of information derived from correlated mutations to improve the prediction of disulfide connectivity over a set proteins including 1797 chains (PDBCYS). We evaluate two different approaches of computing the correlated mutations: corrected mutual information (MIp) and sparse inverse covariance estimation (iCOV). MIp is a corrected version of mutual information specifically designed to remove the background noise due to both phylogenetic and entropic biases [18]. The latter approach (iCOV) which is based on sparse inverse covariance estimation was recently introduced for the problem of predicting contact maps [20]. Here we combine information derived with both methods for computing correlated mutations with features that were previously found relevant for predicting the disulfide connectivity and implemented in our DisLocate [12]. In order to highlight the effect of correlated mutations we benchmark the newly developed predictors on the same dataset (PDBCYS) previously adopted to evaluate DisLocate [12]. Our results show that correlated mutation analysis adds to the previously introduced features and improves the prediction scores. This indicates that correlated mutations are a significant piece of information also when computing the connectivity pattern of disulfide bridges in protein structures.
Methods
Mutual information among cysteines
where f_{ i }(a) and f_{ j }(b) are the relative frequencies of amino acid types a and b at position i and j, respectively, and f_{ i,j }(a,b) is the relative frequency of the amino acid pair ab at positions ij.
where $\overline{MI}\left(i,\right)$ is the average mutual information between position i and all other positions (analogously $\overline{MI}\left(,j\right)$ for position j) and $\overline{MI}$ is the average mutual information of all positions.
Sparse inverse covariance estimation
In recent works it has been pointed out that it is possible to improve the coevolutionary information using the inverse of the covariance matrix [19, 20]. In particular, using information stored into the inverse of the covariance matrix, the performance of contact prediction improves significantly with respect to the simple MI or MIp by reducing the socalled indirect coupling effect i.e. the statistical dependency observed in multiple sequence alignment for residues that are structurally distant [19–21]. One of the proposed approach (followed here), is based on sparse inverse covariance estimation and it is called PSICOV [20].
where f_{ i }(a), f_{ j }(b) and f_{ i,j }(a,b) are defined as in the previous section and S is a 21m by 21m covariance matrix (here we also include the gap as a 21st symbol).
where $\overline{C}\left(i,\right)$ is the mean contact score between position i and all other positions (analogously $\overline{C}\left(,\phantom{\rule{0.3em}{0ex}}j\right)$ for position j) and $\overline{C}$ is the overall mean contact score. We refer to this bonding score as iCOV in the rest of the paper.
Predicting disulfide connectivity patterns
Once the cysteine bonding state is assigned, we predict the connectivity pattern of the subsets of proteins that contain at least a pair of cysteines in the bonding state by applying a Support Vector Regression approach [12]. The SVR predictions of each possible pair of cysteines is used as edge weight and the EdmondGabow algorithm is adopted to predict the most probable disulfide pattern [5]. In order to evaluate SVR, we use the same 20fold cross validation procedure described before [12], considering only proteins with at least two disulfide bridges. SVRs were trained using an input encoding based on global and local information. The global information (that does not depend on each particular cysteine pair) is defined by the Normalized Protein Length (one real value), the Protein Molecular Weight (one real value) and the protein amino acid composition (20 real values). The local pairwise encoding (that depends on each particular cysteine pair) consists of the following descriptors:

two PSSMbased windows centered on the cysteines forming the pairs. We used a window of length 13, the best performing among the differentsized windows we tested. With this choice, we ended up with a vector of 13 * 20 * 2 = 520 components;

the Relative Order of the Cysteines. This feature is encoded with 2 real values that represent the normalized relative order of a cysteines pair. Given a protein with n cysteines (C1,C2,...,Cn), the corresponding normalized ordered list of cysteines is given by (1/n, 2/n,...,n/n). For each pair of cysteines, the corresponding values are then taken from the list (e.g. the pair (C1,C4) is encoded as (1/n,4/n));

the Cysteine Separation Distance. This feature is encoded with 1 real value that represents the logcysteine sequence separation computed as SEP(Ci,Cj) = log (j  i) where i and j are sequence positions of cysteines Ci and Cj, respectively.

Correlated mutation information, based on MIp and/or iCOV.
Dataset description
In this study we used the dataset PDBCYS introduced before [12]. From PDB (release May 2010) we extracted 1797 Eukaryotic protein structures with resolution <2.5 Å with at least two cysteine residues and global pairwise sequence similarity <25%. PDBCYS includes 7619 free and 3194 bonded cysteines. Since PDBCYS contains pairs of proteins with detectable local sequence similarity, we clustered all the chains using a local sequence similarity score. First, we ran a BLAST sequence search using all the proteins of the set versus themselves. Then, for each pair of proteins we selected the higher bidirectional (say p1 vs p2 or p2 vs p1) sequence identity as reported in the BLAST output. We subsequently treated the proteins as nodes of a graph and assigned an edge between two nodes only where local sequence identity between the corresponding protein sequences was > 25%. In addition, we computed the connected components of the graph and treated each group of nodes as a protein cluster. Finally, the clusters were grouped in 20 disjoint sets used to train and test the method. We used these 20 subsets to evaluate our method and to compare its performance with previous approaches by adopting a 20fold crossvalidation procedure.
Performance measures
In the following N_{ c } is the number of correctly predicted bonds, N_{ p } is the total number of predicted bonds, N_{ b } is the total number of observed bonds, N_{ patt } is number of correctly predicted disulfide connectivity pattern and N is the total number of chains.
To score the disulfide connectivity prediction we computed the following indices:

the precision P_{ b }:${P}_{b}=\frac{{N}_{c}}{{N}_{p}}$(7)

the recall R_{ b }:${R}_{b}=\frac{{N}_{c}}{{N}_{b}}$(8)

the Q_{ p }:${P}_{b}=\frac{{N}_{patt}}{N}$(9)
For sake of readability in the Tables we report the indices in percentage (i.e. the obtained values are multiplied by 100).
Technical details
All multiple sequence alignments used to compute both the MIp and the iCOV features have been generated by running 3 iterations of the jackhmmer program which is a part of the HMMER 3.0 package (http://hmmer.org) against the UNIREF90 sequence database. The inverse covariance estimation was performed by means of the glasso R package available at the CRAN archive (http://cran.rproject.org/web/packages/glasso/index.html), the same used in [12]. All the estimations have been performed using the exact algorithm of the glasso code (see glasso package documentation for details). glasso algorithm depends on a parameter ρ that conditions the sparsity of the reconstructed inverse covariance matrix. This parameter also affects the algorithm run time: the smaller is ρ the longer is the required time. Below we report the results obtained when ρ is set to 1e8, that was chosen as tradeoff between the computational time and the method performance (computed on the validation sets). MIp values were computed as described in [18]. For the SVR implementation we used the libsvm package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) with a RBF kernel.
Results and discussion
Prediction of disulfide connectivity with known bonding state
In order to evaluate the effect of correlated mutations in the task of predicting the topology of disulfide bonds, we first assume that disulfide bonded cysteines are known. We evaluated the performance of methods considering subsets of proteins with a different number of disulfide bonds (from 2 to 5). The reported accuracy was obtained using the same 20fold cross validation procedure previously described [12].
Performance on disulfide connectivity prediction obtained with correlated mutation measures
# bonds  ICOV  MIp  Random  

Pb = Rb  Qp  Pb = Rb  Qp  Pb = Rb  Qp  
2  62  62  68  68  33  33 
3  52.6  42.4  47.8  37.7  20  7 
4  51.8  26.8  49.4  29.3  14  1 
5  39.5  16.2  33.5  13.5  11  0.1 
All  51.7  43.7  49.9  44.5  23  15 
Performance on disulfide connectivity prediction obtained with different SVRbased methods
# bonds  SVR  SVR+iCOV  SVR+MI  SVR+MI+iCOV  

Pb = Rb  Qp  Pb = Rb  Qp  Pb = Rb  Qp  Pb = Rb  Qp  
2  75  75  76  76  73  73  76  76 
3  60  48  62.8  55.3  59.6  50.6  62.8  55.3 
4  57  44  67.1  51.2  61  46.3  67.7  51.2 
5  46  19  55.1  27  54.1  29.7  58.9  32.4 
All  60  54  65.2  58.6  61.9  55.5  66.2  59.3 
Prediction of disulfide connectivity with predicted bonding state
Prediction without a prior knowledge of the cysteine bonding state
# bonds  DisLocate  SVR+MI+iCOV  

Rb  Pb  Qp  Rb  Pb  Qp  
1  83  46  76  93  46  76 
2  67  52  61  71  59  62 
3  47  41  35  55  49  38 
4  52  37  35  63  48  38 
5  39  39  15  50  49  16 
All  52  42  36  60  50  38 
Prediction performance as a function of the quality of the multiple sequence alignments
MIp and iCOV are computed over multiple sequence alignments. We therefore evaluate how the number and the type of sequences included in the alignment (used to compute the correlation among cysteines residues) can affect the final result.
Conclusions
The prediction of protein structures from their sequences it is still an open problem in Structural Bioinformatics, especially considering that the disproportion between the number of putative protein sequences with respect to the number of known 3D structures is exponentially increasing. The bonding state of cysteines plays a relevant role in stabilizing the tertiary folds of proteins, in defining protein functions and in triggering functionally relevant conformational changes [26]. The knowledge of disulfide bonds is very important to predict the protein structure in ab initio and comparative modelling since it poses constraints to the possible chain conformations [27, 28]. In this paper we introduce a new method to predict disulfide bonds starting from protein sequence. We investigate the effect of the information derived from correlated mutations on the problem of predicting the topology of disulfide bonds in proteins. We show that correlated mutations in the form of corrected mutual information (MIp) and inverse of covariance matrix (iCOV) carry a significant quantity of information that was not completely exploited before for the task of disulfide bond prediction. We present a new method that implementing information derived from correlated mutations improves the performance over the state of the art method DisLocate [12]. Finally, we highlight that the optimal performance of the method can be achieved when the number of sequences included in the multiple alignment from where information on correlated mutation is derived is in the range of 10000 protein chains and the correspondent NEFF value of the alignment is greater or equal to 10.
Declarations
The publication costs of this article were funded by the RFO (Ricerca Fondamentale Orientata) 2011, University of Bologna, awarded to Rita Casadio.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 1, 2013: Computational Intelligence in Bioinformatics and Biostatistics: new trends from the CIBB conference series. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S1.
Declarations
Acknowledgements
RC thanks the following grants: PRIN 2009 project 009WXT45Y (Italian Ministry for University and Research: MIUR), COST BMBS Action TD1101(European Union RTD Framework Programme), and PON project PON01_02249 (Italian Ministry for University and Research: MIUR). CS is a recipient of a Ph.D. fellowship from the Ministry of the Italian University and Research.
Authors’ Affiliations
References
 Inaba K: Structural basis of protein disulfide bond generation in the cell. Genes Cells. 2010, 15 (9): 93543. 10.1111/j.13652443.2010.01434.x.View ArticlePubMedGoogle Scholar
 Martelli PL, Fariselli P, Malaguti L, Casadio R: Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks. Protein Eng. 2002, 15 (12): 951953. 10.1093/protein/15.12.951.View ArticlePubMedGoogle Scholar
 MucchielliGiorgi MH, Hazout S, Tufféry P: Predicting the disulfide bonding state of cysteines using protein descriptors. Proteins. 2002, 46 (3): 243249. 10.1002/prot.10047.View ArticlePubMedGoogle Scholar
 Chen YC, Lin YS, Lin CJ, Hwang JK: Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences. Proteins. 2004, 55 (4): 10361042. 10.1002/prot.20079.View ArticlePubMedGoogle Scholar
 Fariselli P, Casadio R: Prediction of disulfide connectivity in proteins. Bioinformatics. 2001, 17 (10): 957964. 10.1093/bioinformatics/17.10.957.View ArticlePubMedGoogle Scholar
 Vullo A, Frasconi P: Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics. 2004, 20 (5): 653659. 10.1093/bioinformatics/btg463.View ArticlePubMedGoogle Scholar
 Ferrè F, Clote P: Disulfide connectivity prediction using secondary structure information and diresidue frequencies. Bioinformatics. 2005, 21 (10): 23362346. 10.1093/bioinformatics/bti328.View ArticlePubMedGoogle Scholar
 Song J, Yuan Z, Tan H, Huber T, Burrage K: Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure. Bioinformatics. 2007, 23 (23): 31473154. 10.1093/bioinformatics/btm505.View ArticlePubMedGoogle Scholar
 Cheng J, Saigo H, Baldi P: Largescale prediction of disulphide bridges using kernel methods, twodimensional recursive neural networks, and weighted graph matching. Proteins. 2006, 62 (3): 617629.View ArticlePubMedGoogle Scholar
 Taskar B, Chatalbashev V, Koller D, Guestrin C: Learning structured prediction models: a large margin approach. Proceedings of the 22nd International Conference on Machine Learning (ICML '05). 2005, New York: ACM, 896903. 10.1145/1102351.1102464.View ArticleGoogle Scholar
 Vincent M, Passerini A, Labbé M, Frasconi P: A simplified approach to disulfide connectivity prediction from protein sequences. BMC Bioinformatics. 2008, 9: 2010.1186/14712105920.PubMed CentralView ArticlePubMedGoogle Scholar
 Savojardo C, Fariselli P, Martelli PL, Pierleoni A, Casadio R: Improving the prediction of disulfide bonds in Eukaryotes with machine learning methods and protein subcellular localization. Bioinformatics. 2011, 27 (16): 22242230. 10.1093/bioinformatics/btr387.View ArticlePubMedGoogle Scholar
 Gobel U, Sander C, Schneider R, Valencia A: Correlated mutations and residue contacts in proteins. Proteins. 1994, 18: 309317. 10.1002/prot.340180402.View ArticlePubMedGoogle Scholar
 Olmea O, Valencia A: Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold Des. 1997, 2: S25S32.View ArticlePubMedGoogle Scholar
 Fariselli P, Olmea O, Valencia A, Casadio R: Progress in predicting inter residue contacts of proteins with neural networks and correlated mutations. Proteins. 2001, 45 (Suppl 5): 157162.View ArticleGoogle Scholar
 Di Lena P, Fariselli P, Margara L, Vassura M, Casadio R: Is there an optimal substitution matrix for contact prediction with correlated mutations?. IEEE/ACM Trans Comput Biol Bioinform. 2011, 8: 10171028.View ArticlePubMedGoogle Scholar
 Rubinstein R, Fiser A: Predicting disulfide bond connectivity in proteins by correlated mutations analysis. Bioinformatics. 2008, 24 (2): 498504.View ArticlePubMedGoogle Scholar
 Dunn SD, Wahl LM, Gloor GB: Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics. 2008, 24 (3): 333340. 10.1093/bioinformatics/btm604.View ArticlePubMedGoogle Scholar
 Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C: Protein 3D structure computed from evolutionary sequence variation. PLoS ONE. 2011, 6 (12): e2876610.1371/journal.pone.0028766.PubMed CentralView ArticlePubMedGoogle Scholar
 Jones DT, Buchan DWA, Cozzetto D, Pontil M, PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2011, 28 (2): 184190.View ArticlePubMedGoogle Scholar
 Burger L, van Nimwegen E: Disentangling direct from indirect coevolution of residues in protein alignments. PLoS Comput Biol. 2010, 6 (1): e100063310.1371/journal.pcbi.1000633.PubMed CentralView ArticlePubMedGoogle Scholar
 Banerjee O, El Ghaoui L, d'Aspremont A: Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research. 2008, 9: 485516. [http://jmlr.csail.mit.edu/papers/v9/banerjee08a.html]Google Scholar
 Friedman J, Hastie T, Tibshirani R: Sparse inverse covariance estimation with the graphical Lasso. Biostatistics. 2008, 9: 432441. 10.1093/biostatistics/kxm045.PubMed CentralView ArticlePubMedGoogle Scholar
 Fariselli P, Savojardo C, Martelli PL, Casadio R: GrammaticalRestrained Hidden Conditional Random Fields for Bioinfomatics Applications. Algorithms for Molecular Biology. 2009, 4 (13):Google Scholar
 Casbon J, Saqi M: Analysis of superfamily specific profileprofile recognition accuracy. BMC Bioinformatics. 2004, 5 (200):Google Scholar
 Wedemeyer WJ, Welker E, Narayan M, Scheraga HA: Disulfide bonds and protein folding. Biochemistry. 2000, 39 (15): 42074216. 10.1021/bi992922o.View ArticlePubMedGoogle Scholar
 Das R, Baker D: Macromolecular modeling with rosetta. Annu Rev Biochem. 2008, 77: 363382. 10.1146/annurev.biochem.77.062906.171838.View ArticlePubMedGoogle Scholar
 Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993, 234: 779815. 10.1006/jmbi.1993.1626.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.