 Methodology article
 Open Access
A simplified approach to disulfide connectivity prediction from protein sequences
 Marc Vincent^{1},
 Andrea Passerini^{1},
 Matthieu Labbé^{1} and
 Paolo Frasconi^{1}Email author
https://doi.org/10.1186/14712105920
© Vincent et al; licensee BioMed Central Ltd. 2008
 Received: 04 September 2007
 Accepted: 14 January 2008
 Published: 14 January 2008
Abstract
Background
Prediction of disulfide bridges from protein sequences is useful for characterizing structural and functional properties of proteins. Several methods based on different machine learning algorithms have been applied to solve this problem and public domain prediction services exist. These methods are however still potentially subject to significant improvements both in terms of prediction accuracy and overall architectural complexity.
Results
We introduce new methods for predicting disulfide bridges from protein sequences. The methods take advantage of two new decomposition kernels for measuring the similarity between protein sequences according to the amino acid environments around cysteines. Disulfide connectivity is predicted in two passes. First, a binary classifier is trained to predict whether a given protein chain has at least one intrachain disulfide bridge. Second, a multiclass classifier (plemented by 1nearest neighbor) is trained to predict connectivity patterns. The two passes can be easily cascaded to obtain connectivity prediction from sequence alone. We report an extensive experimental comparison on several data sets that have been previously employed in the literature to assess the accuracy of cysteine bonding state and disulfide connectivity predictors.
Conclusion
We reach stateoftheart results on bonding state prediction with a simple method that classifies chains rather than individual residues. The prediction accuracy reached by our connectivity prediction method compares favorably with respect to all but the most complex other approaches. On the other hand, our method does not need any model selection or hyperparameter tuning, a property that makes it less prone to overfitting and prediction accuracy overestimation.
Keywords
 Disulfide Bridge
 Bonding State
 Connectivity Pattern
 Amino Acid Environment
 DIpro
Background
Some interesting structural, functional, and evolutionary properties of proteins can be inferred from knowledge about the existence and the precise location of disulfide bridges. Since most of the proteins inferred from genomic sequencing lack this structural information, the abinitio prediction of disulfide bridges from protein sequences can be very useful in several molecular biology studies. This computational problem has received significant attention during the last few years and a number of prediction servers have been recently developed [1–5].
Typical approaches predict disulfide bridges by solving two separate subproblems. First, cysteines are partitioned into two groups: halfcystine (involved in the formation of a disulfide bridge) and the rest (free and bound to either a metal ion or to a prosthetic group). Membership in one of the two groups is predicted by a binary classifier trained on known cases. Second, given known bonding state, disulfide bridges are assigned by predicting which pairs of halfcystines are linked. The latter subproblem is considerably more difficult from a machine learning perspective as it requires methods capable of predicting structured outputs. A noticeable exception to the twostage approach was recently proposed in [6]. The main novelty in that method is the use of a recursive neural network that can predict the bonding probability between any pair of cysteines, so that bridges can be predicted directly from sequence (without previous knowledge of cysteine bonding state).
Methods for solving the bonding state subproblem have been developed after Fiser et al. [7] first noted that amino acid composition around cysteines could predict their disulfidebonding state. Neural networks were applied later by Fariselli et al. [8] and by Fiser & Simon [9], using multiple alignment profiles in a window centered around the target cysteine. MucchielliGiorgi et al. [10] introduced the idea of adding a global descriptor (consisting of amino acid composition and chain length information) to improve prediction accuracy. Ceroni et al. [11] proposed a method based on string kernels (to extract global features), vector kernels for handling local features, and supervised learning by support vector machines (SVM). The subsequent improvement described in [5] uses recurrent neural networks and Viterbi decoding to convert single predictions into a collective bonding state assignment for all the cysteines in a chain. Song et al. [12] applied a linear discriminant using dipeptides as features. Martelli et al. [13] suggested the use of hidden Markov models to refine local predictions obtained via neural networks. Chen et al. [14] used an ensemble of Support Vector Machines trained with different feature vectors refined by a linearchain Markov model. In essence, stateoftheart methods start by predicting the bonding state of each cysteine and then use a refinement procedure to improve chain prediction. Concerning bonding state prediction, in this paper, we show that (1) a much simpler technique based on binary classification of chains allows us to achieve the same levels of cysteine level accuracy, and (2) prediction errors obtained in this way do not completely overlap with those of our previous method (DISULFIND [5]) leaving room for accuracy improvement by exploiting a combination of the two classifiers.
The disulfide connectivity subproblem was pioneered in [1] with a method based on weighted graph matching. Vullo & Frasconi [15] introduced the use of multiple alignment profiles by means of recursive neural networks (RNN). Taskar et al. [16] also formulated disulfide connectivity as a structuredoutput prediction problem and solved it using a generalized largemargin machine. Ferrè & Clote [17] proposed a feedforward neuralnetwork architecture with hidden units associated to cysteine pairs and inputs encoding secondary structure. They recently extended their method to threeclass discrimination (free, halfcystine, or metal bound) [2]. Tsai et al. [3] confirmed that the profile of distances between bonded cysteines is an important feature for prediction of connectivity patterns and devised a prediction method based on SVM and weighted graph matching. Zhao et al. [18] observed that the number of observed connectivity patterns is relatively small compared to the number of possible patterns: while a set of 2B cysteines can be potentially arranged in (2B  1)!! = (2B  1)·(2B  3)···3·1 ways, only a few dozens of patterns are actually observed. Based on this observation they suggested a templatematching approach. A multiclass SVM was applied by Chen and Hwang [19] by considering each connectivity pattern as a distinct class. Finally, few recent works proposed more complex architectures. The method by Chen et al. [20] is based on a twolevel strategy where all cysteine pairs are first classified by an SVM and all possible connectivity patterns are subsequently evaluated by second binary classifier that was trained with the correct connectivity patterns as positive examples. Lu et al. [21] proposed an ensemble of SVMs, using features derived from cysteinecysteine coupling, and a genetic algorithm for feature selection; their method outputs the pattern maximizing the number of predicted pairwise interactions. In this paper, we show that a simple 1nearest neighbor classifier considering both separation and evolutionary profiles is competitive to all previously proposed approaches, including those based on structured output prediction, with the exception of the most complex multiple stage architectures. The method needs no hyperparameter tuning, an appealing property making it more robust to overtraining. Model selection is in fact sometimes difficult to carry out and different choices of hyperparameters (e.g. a set of regularization coefficients in a multistage architecture) may affect significantly the results obtained in the experiments [22]. We hope in this way to provide a method which is less prone to overfitting and instabilities in the estimation of the generalization error.
Overview of the proposed methods
Statistics of data sets
Data set  # chains  All  None  Mix 

PDBselect  1,589  488  1,051  50 
SPX^{}  2,547  1,650  757  140 
The new procedure for obtaining bridge predictions can be shortly summarized as follows (see Methods for details). In the first step, a kernel machine is trained to predict if a given chain contains at least one intrachain bridge. For this task, a chain is represented as a bag of cysteines. The resulting decomposition kernel between two chains is the sum of all the similarities between the amino acid environments around all possible pairs formed by taking one cysteine from each chain. The rationale of this kernel is that a new chain should be similar to a positive chain if it contains at least one pair of cysteine environments which is similar to a pair that is known to form an intrachain bridge. This kernel is called the allpairs decomposition kernel (APDK) in the remainder of the paper. The experiments reported below are based on this kernel in conjunction with SVM. For this purpose, we employed the publicly available software SVM^{ light }[23].
In the second step, a set of kernel machines classify chains according to their connectivity pattern. Each of these machines focuses on a given number of cysteines. In this case, a chain is seen as a tuple of amino acid environments around its cysteines. The resulting decomposition kernel between two chains is the sum of the similarities between the environments associated with the two cysteines that have the same ordinal number in the tuple. The rationale of this kernel is that two chains should be more likely to fold according to the same disulfide connectivity pattern if they share a similar sequence of cysteine environments. This kernel is called the tuple decomposition kernel (TDK) in the remainder of the paper. The experiments reported in the next section use this kernel in conjunction with the cysteine separation profiles [3] to compute distances (in feature space) for the 1nearest neighbor (1NN) algorithm.
In both the above kernels, the amino acid environments around cysteines are enriched with evolutionary information derived from multiple alignments in order to boost performance.
Results and discussion
Data preparation
We used three representative subsets of the Protein Data Bank to assess the performance of our kernel methods. A third data set, extracted from the SWISSPROT database, was employed in connectivity prediction assuming knowledge of the cysteine bonding state, for the sake of comparing results with respect to previous methods.
PDBselect data set
The July 2005 PDBselect data set [24, 25] used in this paper contains 2,810 non redundant chains. During the chain selection process, for any group of chains with homologies, only the one with the best quality was kept. The structures of chains included were determined both using NMR and Xray crystallography. See [26] for the complete list of chains with explanations. Disulfide bridges were obtained by running the DSSP program [27] with default options. Unresolved residues were labeled as free. In order to reduce noise in the data, we visually inspected protein structures in all cases in which two cysteines were found within a distance of 2.5 Å, but were not labeled by DSSP as being disulfide bonded. In 62 cases we overruled the DSSP assignment from free to disulfide bonding. For the chain classification experiments only proteins with at least 2 cysteines were considered, resulting in a set of 1,589 chains. The final data set with labeling information is available as Additional file 1.
SPX data set
The data set is described in [6] and available from the DIpro website [28]. It consists of two sets of chains: one used for the chain classification problem (SPX^{}), the other used for the pattern classification problem (SPX^{+}). The former contains 897 chains with at least one intrachain bridge (positive examples) and 1,650 without any intrachain bridge (negative examples), for a total of 2,547 chains. The latter contains 1,018 chains with at least one intrachain bridge. Positives chains in SPX^{} are less redundant (HSSP cutoff of 5) than those in SPX^{+}(HSSP cutoff of 10). A first difference with respect to the PDBselect data set is that no chain in the SPX data set contains interchain disulfide bonds. A second difference is that disulfide bonds in SPX are extracted from the SSBOND record of the PDB files [6].
PDB_{4136}
The data set is described in [13] and available from the CysPred website [29]. It consists of 4,136 cysteine containing segments from the crystallographic data of the PDB, with less than 25% sequence identity and no chain breaks. The data set is included for the sake of comparison with the approach by Chen et al. [14].
SP39 data set
The data set is described in [1, 15] and available as Additional file 2. It consists of 446 chains from the SWISSPROT database release n. 39 (October 2000), having from two to five experimentally verified intrachain disulfide bridges. The data set has been widely used as a benchmark for disulfide connectivity prediction assuming knowledge of the cysteine bonding state.
In order to incorporate evolutionary information, we obtained multiple alignments by running one iteration of the PSIBLAST [30] program on the nonredundant (nr) NCBI database using an Evalue cutoff of 0.005. Depending on the experimental setting, we have used either position specific scoring matrices (PSSM) or multiple alignment profiles.
Evaluation procedure
Bonding state
Prediction performance was estimated by a 10fold crossvalidation procedure for PDBselect and SPX, while for PDB_{4136} we employed a 20fold cross validation procedure with exactly the same folds as in [13]. For each of the folds, we optimized the main hyperparameters (i.e. the kernel Gaussian width γ, and the SVM regularization parameter C) by nesting a 10fold crossvalidation on each training set.
Hyperparameters were found by a variableresolution grid search algorithm in which we started by optimizing on a coarse logscale and then refined the best set of hyperparameters on a finer scale. In this setting, a significant computational speedup was obtained by caching the entire kernel matrix in memory.
Connectivity patterns
Prediction performance was estimated by a 10fold crossvalidation procedure for the SPX^{+} and SPX^{} data sets, while for the SP39 data set we employed a 4fold cross validation procedure with exactly the same folds as in [1, 15]. No model selection was carried out to fine tune kernel parameters, as our aim was to show the predictive power of the plainest approach.
Performance measures
For binary classification problems, let us denote by T_{ p }, T_{ n }, F_{ p }, and F_{ n }the number of true positives, true negatives, false positives, and false negatives, respectively. Also let N denote the total number of cases. We report the following measures:
accuracy Q = (T_{ p }+ T_{ n })/N;
precision P = T_{ p }/(T_{ p }+ F_{ p });
recall R = T_{ p }/(T_{ p }+ F_{ n }).
In the case of bonding state predictions we can define the above measures at different levels:

Cysteine classification measures: Q_{ c }, P_{ p }, R_{ p }. These are obtained by counting single cysteines as cases.

Sequence classification measures: Q_{1}, P_{1}, R_{1}. These are obtained by counting chains as cases. Positive examples are chains having at least one intrachain bridge and negative examples are all the remaining chains.
Performance measures for the disulfide pattern prediction problems are defined as follows:

Pattern prediction accuracy: Q_{ p }defined as the total number of chains for which the correct pattern was predicted, divided by the total number of chains.

Bridgelevel precision P_{ b }, defined as the number of correctly predicted bridges divided by the number of predicted bridges, and bridgelevel recall R_{ b }, defined as the number of correctly predicted bridges divided by the true number of bridges.
Binary classification of chains and cysteines
Binary classification of chains
Method  PDBselect  SPX^{}  

Q _{1}  P _{1}  R _{1}  Q _{1}  P _{1}  R _{1}  
APTK (PSSM)  87  83  77  82  79  67 
APTK (profiles)  86  83  75  82  80  64 
DISULFIND (PSSM)  86  82  75  81  80  60 
DISULFIND (profiles)  86  82  73  81  80  63 
Dsimple (PSSM)  85  81  74  82  81  64 
Dsimple (profiles)  86  84  74  82  82  62 
DIpro [6]        74  83  56 
The last row contains the best results reported in [6] for the DIpro predictor on the SPX^{} data set, obtained with a spectrum kernel with mismatches.
Binary classification of cysteines
Method  PDBselect  SPX^{}  PDB_{4136}  

Q _{ c }  P _{ c }  R _{ c }  Q _{ c }  P _{ c }  R _{ c }  Q _{ c }  P _{ c }  R _{ c }  
APTK (PSSM)  88.8  83.8  87.7  86.1  79.0  82.8  88.2  84.0  82.5 
APTK (profiles)  87.7  83.1  85.1  85.3  78.6  80.5  89.7  81.0  88.5 
DISULFIND (PSSM)  88.3  85.0  84.3  85.3  82.6  74.1  88.0  79.1  85.5 
DISULFIND (profiles)  88.6  87.4  82.1  86.5  83.0  77.5  89.4  81.2  87.4 
Dsimple (PSSM)  82.2  77.0  76.4  81.3  74.5  71.5  83.0  79.5  69.3 
Dsimple (profiles)  81.5  76.0  75.3  81.1  74.3  71.1  83.0  77.1  73.4 
APTK + DISULFIND  89.9  87.8  85.5  87.0  82.6  80.2  90.3  82.1  89.2 
multiple SVM + CSS [14]              90  91  77 
What is perhaps more interesting, is that a correlation analysis between the errors of the APTK and DISULFIND reveals that the two predictors disagree on many of the cysteines that are incorrectly classified by one of the two methods (10.6% of such cysteines on PDBselect and 8.6% on SPX^{}). The relatively low correlation between the two methods may be an advantage because we can hopefully boost performance by combining their predictions. Indeed, as reported in the last row of Table 3, by combining APTK (with PSSM) and DISULFIND (with profiles) we gain about one percentage point of accuracy in both data sets. The last columns of Table 3 report evaluation results on the PDB_{4136} data set, which confirm the overall behavior, with the exception that profiles are always as good or better than PSSM, and are thus employed in the APTK + DISULFIND combination. The combination achieves basically the same accuracy as that of the best method from Chen et al. [14], which we indicated by multiple SVM + CSS, and shows a more balanced precision/recall ratio.
Connectivity prediction for positive chains
Prediction of bridges and connectivity patterns
# bridges  1NN  DISULFIND  DISULFIND+1NN  DIpro  

R _{ b }  P _{ b }  Q _{ p }  R _{ b }  P _{ b }  Q _{ p }  R _{ b }  P _{ b }  Q _{ p }  R _{ b }  P _{ b }  Q _{ p }  
1  65  61  58  66  62  59  68  63  59  71  47  58 
2  59  61  52  53  54  49  68  69  63  59  59  55 
3  70  71  63  46  46  35  73  73  64  59  65  50 
4  58  59  42  24  24  9  59  59  48  44  49  27 
all  60  59  52  49  48  41  64  62  55  71  47  48 
By inspecting the overall results (all number of bridges, last row of Table 4) we note that precision and recall levels for bridge prediction are very similar for both 1NN and DISULFIND. Conversely, DIpro has higher recall but lower precision, which is mainly to be due to the higher number of false positives in the case of chains with a single bridge.
Note that the bonding state of individual cysteines is unknown in the SPX^{+} data set. Thus the 1NN algorithm is also implicitly solving a bonding state prediction problem (although it was not expressly tuned for this purpose). Alternatively, the 1NN classifier can be preceded by an explicit bondingstate predictor. In this case, the tuplebased kernel (see Equation 5 in Methods) as well as the topological features are restricted to those cysteines that are known to be bonded (for training examples) or that are predicted to be bonded (for test examples) when computing the nearest neighbor. We employed the first stage of DISULFIND to this aim, and the results of such a pipeline are reported in the DISULFIND+1NN column of Table 4. The advantage of this combination is more evident in the case of two and four bridges.
Connectivity prediction assuming knowledge of the bonding state
Prediction of bridges and connectivity patterns
# bridges  1NN  DISULFIND  DIpro  SOSVM  CSP  SVMpattern  

P_{ b }= R_{ b }  Q _{ p }  P_{ b }= R_{ b }  Q _{ p }  P_{ b }= R_{ b }  Q _{ p }  P_{ b }= R_{ b }  Q _{ p }  P_{ b }= R_{ b }  Q _{ p }  P_{ b }= R_{ b }  Q _{ p }  
2  76  76  73  73  74  74  77  77  73  73  74  74 
3  66  55  51  41  61  51  62  52  66  55  69  61 
4  53  38  37  24  44  27  51  36  49  33  40  30 
5  39  18  30  13  41  11  43  13  36  17  31  12 
2–5  64  55  49  44  56  49  65  53  62  53  57  55 
Results from 1NN are compared to those found in the literature for the following singlestage approaches: DISULFIND [15], DIpro [6], Taskar et al.'s structured output large margin algorithm [16] (SOSVM), the patternwise SVM by Chen and Hwang [19] (SVMpattern), while the cysteine separation profile approach (CSP) by Tsai et al. [3] was reimplemented in order to get results on exactly the same folds. In predicting entire connectivity pattern (Q_{ P }), 1NN outperforms all other methods except SVMpattern that obtains the same overall results. On the other hand, SVMpattern is worse at predicting single pairwise interactions (Q_{ c }), where SOSVM is the only approach achieving slightly better results. The latter, however, requires solving a hard convex optimization problem. It is interesting to note that the accuracy of 1NN is consistently better than that of all other methods for more than three bridges. This is quite reasonable, as increasing the number of bridges, B, implies dramatically increasing the number of alternative patterns, (2B  1)!!, while lowering the amount of available data as well as the number of observed patterns (see Prediction of Connectivity Patterns in Methods). Such a setting favors 1NN, which can only predict observed patterns. The small advantage of 1NN with respect to CSP is due to the contribution of the evolutionary profile to the distance metric. In order to elucidate the cases in which this advantage is apparent, we analysed the differences between chains incorrectly predicted by the two methods. The analysis showed that the main reason for the increase in performance of 1NN is due to correctly predicting the pattern of chains from two families: the Alphatype family in the conotoxin A superfamily, and the Alpha subfamily of the Sodium channel inibitor family, within the long (4 CC) scorpion toxin superfamily. Note however that in few cases adding the evolutionary profile actually decreased performance with respect to CSP, as happened for two chains from the glycosyl hydrolase 13 family.
Some recent multistage architectures [20, 21] outperform the above mentioned methods. However, the aim here is to stress the effectiveness of our method in comparison to the other existing singlestage approaches. By requiring no hyperparameter tuning, it can also be seen as a candidate component that might be more easily integrated into complex architectures.
Connectivity prediction from scratch
Prediction of bridges and connectivity patterns from scratch
# bridges  APTK(PSSM)+1NN  DISULFIND  DISULFIND+1NN  DIpro  

R _{ b }  P _{ b }  Q _{ p }  R _{ b }  P _{ b }  Q _{ p }  R _{ b }  P _{ b }  Q _{ p }  R _{ b }  P _{ b }  Q _{ p }  
1  30  30  27  30  30  30  30  30  30       
2  51  54  47  38  39  36  51  51  49       
3  63  65  58  27  27  15  66  67  61       
4  50  51  40  30  30  10  48  49  37       
all  43  44  37  29  29  23  43  44  39  32  48   
Conclusion
We have presented a new set of kernelbased methods for predicting disulfide bridges and cysteine bonding state from protein sequences. Despite their extreme simplicity, these algorithms compare favorably to most existing techniques proposed in the literature. In the case of cysteine bonding state we have found that the correlation between predictions from the new chain classifier and previous methods (DISULFIND) is low enough to allow us improving accuracy by combining the two classifiers. The combination achieves competitive results with the stateoftheart approaches. Concerning connectivity pattern prediction, we found that a simple 1nearest neighbor approach performs surprisingly well, being outperformed by the most complex multistage architectures only. It must be remarked that the algorithm does not need any hyperparameter tuning, which makes it less prone to overfitting and prediction accuracy overestimation, and appealing for prediction of other properties of proteins that are inherently structured. The result is also interesting from a machine learning perspective as it shows that, depending on the probability distribution on the output space, it may be an advantage to employ multiclass classification instead of much more complex algorithms for prediction of structured outputs. Different approaches to classification could be used in place of 1NN. For example, using multiclass support vector machines one could take advantage of a loss function that weights differently prediction errors according to the number of correctly assigned bridges. These variants are currently under our investigation.
Methods
Decomposition kernels
A kernel is a realvalued function K : $\mathcal{X}$ × $\mathcal{X}$ ↦ ℝ where $\mathcal{X}$ is the input space and can be any set (e.g., a set of protein chains). If for any finite set of instances {x_{1},....,x_{ m }}, x_{ i }∈ $\mathcal{X}$ the matrix with entries K(x_{ i }, x_{ j }) has all nonnegative eigenvalues, then K is positive semidefinite and by Mercer's theorem there exists a feature space $\mathcal{H}$ and a map φ : $\mathcal{X}$ ↦ $\mathcal{H}$ such that the kernel can be written as the inner product in feature space: K(x, x') = ⟨φ(x), φ(x')⟩.
where R^{1}(x) = {(x_{1},...,x_{ D }) : R(x_{1},...,x_{ D }, x)} denote the set of all possible decompositions of x. In the following, we introduce new decomposition kernels for protein chains.
Bonding state prediction
Let us decompose chains using as parts cysteines and their amino acid environments. Specifically, let us assume D = 1 and let R(c, x) hold true if c is cysteine in x. In this way, R^{1}(x) = cys(x), the set of cysteines in x.
The kernel function obtained in this way is thus based on pairwise comparisons between all cysteines in two given chains. Since the sum can be interpreted as a kind of soft OR operator, the intuitive meaning of Equation 2 is that two chains are dissimilar if no cysteine in one chain has a conserved amino acid environment that is similar to that of another cysteine in the other chain.
Combination between the allpairs kernel and DISULFIND
We describe here the strategy used to obtain a weighted majority voting algorithm based on the predictions from DISULFIND [5] and from the APTK. DISULFIND outputs a conditional probability p while the APTK a realvalued margin in (∞ +∞). In order to combine the two classifiers, we first convert DISULFIND output into a pseudomargin ${f}_{d}=\mathrm{log}\phantom{\rule{0.5em}{0ex}}\frac{p}{1p}$. This operation can be interpreted as the inverse of converting margins into conditional probabilities by using a sigmoidal function as described in [32]. The margin from the APDK, f, and the DISULFIND pseudomargin, f_{ d }are then summed and the prediction about the bonding state of a cysteine is obtained by taking the sign of the result. Note that the margin f depends on the entire chain while the pseudomargin f_{ d }depends on the particular cysteine. Thus, combined predictions are cysteinespecific.
Prediction of connectivity patterns
The above formula can be explained as follows. Assuming the number of bridges is B, there are (2B  1)!! alternative patterns. The number of bridges can vary from 1 to $\lfloor \frac{n}{2}\rfloor $ and for each case, there are $\left(\begin{array}{c}n\\ 2B\end{array}\right)$ possible subsets of cysteines that form these bridges.
The above analyses suggest that connectivity patterns may be predicted by defining a set of multiclass classification problems (one for each value of n) where the class corresponds to the pattern. Although the number of classes can be high for many values of n, experimental results show that this approach can be successfully pursued.
and then assign to a test chain the same class as the nearest neighbor in the training set.
Thus we did not try to learn any complex combination of the two distances. Note that definition (8) implies that two chains having exactly the same CSP have zero distance.
Declarations
Acknowledgements
We are grateful to Alessio Ceroni and Alessandro Vullo for useful discussions about the methods developed in this paper. The work of MV and ML was supported by the Marie Curie Early Stage Training programme BIOPTRAIN (contract No MESTCT2004007597). The work of PF and AP was supported by EU STREP APrIL II (contract No FP6508861) and by EU NoE BIOPATTERN (contract no. FP6508803).
Authors’ Affiliations
References
 Fariselli P, Casadio R: Prediction of disulfide connectivity in proteins. Bioinformatics 2001, 17(10):957–964. 10.1093/bioinformatics/17.10.957View ArticlePubMedGoogle Scholar
 Ferrè F, Clote P: DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classification. Nucleic Acids Research 2006, 34: W182W185. 10.1093/nar/gkl189PubMed CentralView ArticlePubMedGoogle Scholar
 Tsai CH, Chen BJ, Chan CH, Liu HL, Kao CY: Improving disulfide connectivity prediction with sequential distance between oxidized cysteines. Bioinformatics 2005, 21(24):4416–4419. 10.1093/bioinformatics/bti715View ArticlePubMedGoogle Scholar
 Cheng J, Randall AZ, Sweredoski MJ, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res 2005, (33 Web Server):W72W76. 10.1093/nar/gki396Google Scholar
 Ceroni A, Passerini A, Vullo A, Frasconi P: DISULFIND: a Disulfide Bonding State and Cysteine Connectivity Prediction Server. Nucleic Acids Research 2006, 34(Web Server):W177W181. 10.1093/nar/gkl266PubMed CentralView ArticlePubMedGoogle Scholar
 Cheng J, Saigo H, Baldi P: Largescale prediction of disulphide bridges using kernel methods, twodimensional recursive neural networks, and weighted graph matching. Proteins 2006, 62(3):617–629. 10.1002/prot.20787View ArticlePubMedGoogle Scholar
 Fiser A, Cserzo M, Tudos E, Simon I: Different sequence environments of cysteines and half cystines in proteins. Application to predict disulfide forming residues. FEBS Lett 1992, 302(2):117–20. 10.1016/00145793(92)80419HView ArticlePubMedGoogle Scholar
 Fariselli P, Riccobelli P, Casadio R: Role of evolutionary information in predicting the disulfidebonding state of cysteine in proteins. Proteins 1999, 36(3):340–346. 10.1002/(SICI)10970134(19990815)36:3<340::AIDPROT8>3.0.CO;2DView ArticlePubMedGoogle Scholar
 Fiser A, Simon I: Predicting the oxidation state of cysteines by multiple sequence alignment. Bioinformatics 2000, 16(3):251–256. 10.1093/bioinformatics/16.3.251View ArticlePubMedGoogle Scholar
 MucchielliGiorgi M, Hazout S, Tuffery P: Predicting the Disulfide Bonding State of Cysteines Using Protein Descriptors. Proteins 2002, 46: 243–249. 10.1002/prot.10047View ArticlePubMedGoogle Scholar
 Ceroni A, Frasconi P, Passerini A, Vullo A: Predicting the Disulfide Bonding State of Cysteines with Combinations of Kernel Machines. Journal of VLSI Signal Processing 2003, 35(3):287–295. [ps/jvlsi03cys.pdf] 10.1023/B:VLSI.0000003026.58068.ceView ArticleGoogle Scholar
 Song JN, Wang ML, Li WJ, Xu WB: Prediction of the disulfidebonding state of cysteines in proteins based on dipeptide composition. Biochem Biophys Res Commun 2004, 318: 142–147. 10.1016/j.bbrc.2004.03.189View ArticlePubMedGoogle Scholar
 Martelli PL, Fariselli P, Casadio R: Prediction of disulfidebonded cysteines in proteomes with a hidden neural network. Proteomics 2004, 4(6):1665–1671. 10.1002/pmic.200300745View ArticlePubMedGoogle Scholar
 Chen YC, Lin YS, Lin CJ, Hwang JK: Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences. Proteins 2004, 55(4):1036–1042. 10.1002/prot.20079View ArticlePubMedGoogle Scholar
 Vullo A, Frasconi P: Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics 2004, 20(5):653–659. 10.1093/bioinformatics/btg463View ArticlePubMedGoogle Scholar
 Taskar B, Chatalbashev V, Koller D, Guestrin C: Learning Structured Prediction Models: A Large Margin Approach. Proceedings of the Twenty Second International Conference on Machine Learning (ICML05) 2005.Google Scholar
 Ferrè F, Clote P: Disulfide connectivity prediction using secondary structure information and diresidue frequencies. Bioinformatics 2005, 21(10):2336–2346. 10.1093/bioinformatics/bti328View ArticlePubMedGoogle Scholar
 Zhao E, Liu HL, Tsai CH, Tsai HK, hsiung Chan C, Kao CY: Cysteine separations profiles on protein sequences infer disulfide connectivity. Bioinformatics 2005, 21(8):1415–1420. 10.1093/bioinformatics/bti179View ArticlePubMedGoogle Scholar
 Chen YC, Hwang JK: Prediction of disulfide connectivity from protein sequences. Proteins 2005, 61(3):507–512. 10.1002/prot.20627View ArticlePubMedGoogle Scholar
 Chen BJ, Tsai CH, Chan CH, Kao CY: Disulfide connectivity prediction with 70% accuracy using twolevel models. Proteins 2006, 64: 246–252. 10.1002/prot.20972View ArticlePubMedGoogle Scholar
 Lu CH, Chen YC, Yu CS, Hwang JK: Predicting disulfide connectivity patterns. Proteins 2007, 67(2):262–270. 10.1002/prot.21309View ArticlePubMedGoogle Scholar
 Gold C, Sollich P: Model Selection for Support Vector Machine Classification. Neurocomputing 2003, 55: 221. [doi:10.1016/S0925–2312(03)00375–8] 10.1016/S09252312(03)003758View ArticleGoogle Scholar
 Joachims T: Making largeScale SVM Learning Practical. In Advances in Kernel Methods – Support Vector Learning Edited by: Schölkopf B, Burges C, Smola A. MIT Press; 1999. [http://svmlight.joachims.org/]Google Scholar
 Hobohm U, Scharf M, Schneider R, Sander C: Selection of a representative set of structures from the Brookhaven Protein Data Bank. Protein Science 1992, 1: 409–417.PubMed CentralView ArticlePubMedGoogle Scholar
 Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Science 1994, 3: 522.PubMed CentralView ArticlePubMedGoogle Scholar
 PDBselect[http://bioinfo.tg.fhgiessen.de/pdbselect/]
 Hobohm U, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers 1983, 22: 2577–2637. 10.1002/bip.360221211View ArticleGoogle Scholar
 DIpro[http://contact.ics.uci.edu/intro.html]
 CysPred[http://www.biocomp.unibo.it/piero/cyspred/cysdataset.tgz]
 Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
 Haussler D: Convolution Kernels on Discrete Structures. In Tech Rep UCSCCRL99–10. University of California, Santa Cruz; 1999.Google Scholar
 Platt J: Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in Large Margin Classiers. Edited by: Smola A, Bartlett P, Scholkopf B, Schurmans D. MIT Press; 1999.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.