Combining classifiers for improved classification of proteins from sequence or structure
© Melvin et al; licensee BioMed Central Ltd. 2008
Received: 01 February 2008
Accepted: 22 September 2008
Published: 22 September 2008
Predicting a protein's structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology. Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins. However, because sufficiently many positive examples are required to train such classifiers, all SVM-based methods are hampered by limited coverage.
In this study, we develop a hybrid machine learning approach for classifying proteins, and we apply the method to the problem of assigning proteins to structural categories based on their sequences or their 3D structures. The method combines a full-coverage but lower accuracy nearest neighbor method with higher accuracy but reduced coverage multiclass SVMs to produce a full coverage classifier with overall improved accuracy. The hybrid approach is based on the simple idea of "punting" from one method to another using a learned threshold.
In cross-validated experiments on the SCOP hierarchy, the hybrid methods consistently outperform the individual component methods at all levels of coverage.
Code and data sets are available at http://noble.gs.washington.edu/proj/sabretooth
To facilitate the automatic annotation of newly sequenced proteins or newly resolved protein structures, we are interested in developing computational methods to automatically assign proteins to structural and functional categories. Traditional computational methods for comparing protein structures depend on pairwise structural alignment programs such as CE , DALI  or MAMMOTH . Similarly, sequence-based algorithms such as Smith-Waterman , BLAST , SAM-T98  and PSI-BLAST  assign similarity scores to pairs of protein sequences. Using pairwise structural comparisons of a query sequence or structure against a curated database, one can use any of these tools to implement a nearest neighbor (NN) strategy to classify the query.
In 1999, Jaakkola et al.  first applied the support vector machine (SVM) classifier  to the problem of predicting a protein's structural class from its amino acid sequence. They focused on a particular protein structural hierarchy called the Structural Classification of Proteins (SCOP) , and they trained SVMs to recognize novel families within a given superfamily. This seminal work led to the development of many SVM-based protein classifiers (reviewed in ), and this work continues up to the present [12–15].
Primarily, these classifiers differ in their kernel functions. In this context, a kernel is a function that defines similarities between pairs of proteins. For this task, a good kernel function is one that allows the SVM to separate proteins easily according to their SCOP categories. In the experiments reported here, we train SVMs to classify amino acid sequences into SCOP superfamilies using the profile kernel , which is among the best-performing SVM-based methods.
More recently, several groups have extended SVM-based methods to the classification of protein structures, rather than protein sequences [17–19]. In the current work, for prediction of SCOP superfamilies from structures, we train SVMs using a kernel function based on MAMMOTH . Benchmark experiments have shown that SVM-based discrimination with a MAMMOTH kernel outperforms several other SVM-based methods and also outperforms using MAMMOTH in a nearest neighbor fashion .
In this work, we aim to address a fundamental limitation of any SVM-based method, namely, that an SVM can only be trained when a sufficient number of training examples are available. In particular, to train an SVM to recognize a given SCOP category, we must be able to present to the SVM at least a handful of representative proteins. For under-represented SCOP categories, the SVM cannot be trained, and as a result, the classifier has limited coverage. For example, in SCOP version 1.69, 60.2% of the superfamilies contain three or fewer proteins. Failing to make predictions for these small superfamilies significantly decreases the effective accuracy of the SVM-based method, making it impractical for automated classification of the entire SCOP hierarchy.
In this study, we develop a hybrid machine learning approach that we apply to the problems of classifying proteins from sequence or from structure. Our goal is to combine nearest neighbor methods, which in principle have complete coverage over any given data set, with higher accuracy but reduced coverage multiclass SVM approaches to produce a full coverage method with overall improved accuracy. The hybrid approach is based on the simple idea of "punting" from one method to another. We use held-out data to learn a set of score thresholds. At test time, predictions from the primary method that receive scores below the threshold are "punted" to the secondary method. In addition, we consider different coverage thresholds at which to punt out of the secondary method (i.e., abstain from making a prediction altogether), and we compute error rates of the hybrid method at these different coverage levels.
We use this punting method to build hybrid predictors of SCOP superfamilies, taking as input either protein sequences or structures. Using punting, we find that the hybrid methods consistently outperform the individual component methods at all levels of coverage.
We compare this method to a few simple variants. First, we can apply punting to a single method, rather than a hybrid method. In this setting, when the punting algorithm decides to punt, there is simply no prediction made at all. Second, for a given method, rather than having a vector of class-specific score thresholds T, we can use a single threshold that applies to all of the classes predicted. This threshold is selected so that the class-specific SVMs collectively achieve the user-specified coverage on the threshold training set. The motivation for this simpler thresholding strategy is to reduce the risk of overfitting on the threshold training set. If the confidence scores are well calibrated, then this single threshold approach should also perform well; conversely, if the scores are not well calibrated then the multi-threshold method should perform better.
We tested two methods for predicting SCOP superfamilies. In the first, we made predictions from amino acid sequences, and in the second we made predictions from protein structures. For prediction from amino acid sequence, we used pairwise alignments based on PSI-BLAST for the nearest neighbor method, and we used the profile kernel  to define a kernel representation. For prediction from protein structure, we used structural alignments based on MAMMOTH, both for the nearest neighbor method and to define a kernel representation for training SVMs to recognize SCOP superfamilies . For simplicity, in both cases we used a standard one-vs-all approach for making multiclass predictions from binary SVM classifiers.
We divided the data set (all of SCOP version 1.69) into four parts: A trn , A tst , B trn , B tst . We determined A trn and A tst to suit the requirement of training and testing binary SVM superfamily classifiers: A tst consists of totally held-out families from superfamilies that have 2 or more member families of at least 3 proteins each; A trn consists of all other families belonging to these superfamilies. Data set B consists of all superfamilies in SCOP that are not covered by data set A. B is then split into train and test by families at random such that the ratio of families for B tst /B train is equal to the ratio A tst /A train . The data set for superfamily detection has 74 superfamilies in A and 1458 superfamilies in B (total 1532).
We considered punting both from SVMs to the nearest neighbor method and vice versa. When using SVMs as the primary method, we used B trn as additional negative examples on which to calculate punting thresholds. In the reverse case, because the nearest-neighbor method had accrued no bias in "training," we used all of the negative superfamilies in A trn and B trn to determine thresholds.
Superfamily detection error rates at full coverage.
A tst + B tst
A tst + B tst
Results from the classification of protein structures are shown in the right half of Table 1. For this task, a drop in error rate of 4.5% (30.8% to 26.3%) is achieved from MAMMOTH to MAMMOTH → SVM. Again, McNemar's test shows that both hybrid methods outperform both of the single classifiers at p < 0.01.
Punting once versus punting twice
In practical applications, it may be preferable for the classifier to say "I don't know" rather than return an incorrect classification. To achieve this behavior, we included a second level of punting, based on a second set of thresholds (Figure 1B). This strategy allows the classifier to punt completely and not give a prediction for an example. The target percentage for both the primary and final punting thresholds were varied for both hybrid methods, yielding a range of coverage and error rates.
Comparing Figure 4A and 4B, we see a different overall trend for the two classification tasks. For the sequence classification problem, as coverage approaches 100%, the two methods end up sharing predictions almost 50/50. In contrast, for the structure classification problem, the SVM method converges to fewer predictions – MAMMOTH makes approximately twice as many predictions as the SVM at full coverage. This observation may explain why the improvement provided by the hybrid classifier is smaller in the structure classification problem (4.5% decrease in error) compared with the sequence classification problem (10.8% decrease). For the structure classification task, the high coverage classifier (MAMMOTH) is already very good, so adding a second, supervised classifier does not yield a large improvement.
Single versus multiple thresholds
Combining low and high coverage methods
As mentioned above, approximately 60% of the SCOP superfamilies in our data set contain fewer than three members. The punting methodology allows us to predict members of these superfamilies, even though an SVM is not trained for small superfamilies. Moreover, if the high coverage (NN) classifier incorrectly places a member of a large superfamily into a small superfamily, then the low coverage classifier (SVM) can correct this error, because it has high accuracy for large superfamilies.
If the effect shown in Figure 6 were not a problem – i.e., if both classifiers worked well enough across all superfamily sizes – then one could use standard methods for combining classifiers, such as a voting scheme. However, even in such a case, one would still not be able to control the accuracy versus coverage of predictions. This flexibility, which is provided by the punting strategy, is one of the main contributions of our work.
Punting as stacked generalization
Stacked generalization  is a general scheme for optimizing the generalization error of several classifiers by learning how to combine them accurately. The basic idea is to (i) train each classifier on the same problem and then (ii) use a second set of data to learn a combining scheme when using these classifiers. One example of this approach is that in stage (ii) one could construct a feature space whose inputs are the guesses of the classifiers trained in stage (i), so training a linear classifier in stage (ii) would mean learning a weighted majority vote over the classifiers. However, the stacked generalization approach, as Wolpert describes it, can include any two-stage method of combination. In that sense, our punting method is an instance of stacked generalization where the second stage learns a function that chooses which classifier to apply, depending on the magnitude of the real-valued outputs (i.e., the classifier decides when to punt). Just as in stacked generalization, we divide our data set into two portions: one for training stage (i), the classifiers, and one for training stage (ii), the punting thresholds. However, Wolpert neither describes the use of punting for choosing classifiers, nor for finding a trade-off between coverage and accuracy of the resulting combined classifier, making our approach a novel instance of his general scheme.
We compared the performance of our hybrid classifiers with that of the webserver AutoSCOP .
SVM → PSI-BLAST
PSI-BLAST → SVM
We have described a simple method of combining a high coverage, low accuracy classifier with a low coverage, high accuracy classifier, based on learning a collection of class-specific thresholds from held-out data. For SCOP superfamily recognition from structure and sequence, the resulting hybrid classifiers yield consistently lower error rates across a wide range of coverage.
A priori, punting seems most intuitive when the low-coverage/high-accuracy classifier punts to the high-coverage/low-accuracy classifier. However, the results in Figure 3 suggest that, for the combination of SVM and NN classifiers applied to SCOP classification, punting in the opposite direction is slightly more effective. We speculate that the best performance will be obtained when the primary classifier is the one that returns the most accurate confidence measure in its predictions, rather than the most accurate generalization performance. In this way, if the primary classifier always punts accurately when it is incorrect, then the combined generalization performance can be optimized. Hence, the NN → SVM hybrid may be slightly better than the SVM → NN hybrid because the NN method punts more accurately.
One of the primary contributions of this work is to make SVM-based classifiers practically applicable. Although they have been shown to provide superior performance for protein classification problems in which the number of examples is large enough, SVMs have not been used in practice because of their limited coverage. On the other hand, the goal of this paper is not to argue that SVMs are better than other methods, but to show how to make an SVM classifier practical, by giving it complete coverage. Our results presumably generalize to other supervised classification algorithms, though we have not tested this hypothesis directly.
For simplicity of exposition, we have used a simple one-vs-all approach to multiclass SVM classification. In practice, it is generally preferable to use a more complex multiclass approach such as code learning . Combining code-learning with the punting approach described here yields even lower error rates than are shown in Figure 3 (data not shown). In general it is straightforward to combine any pair of (low and high coverage) classifiers using our approach. The only prerequisite is that they provide a real-valued output for each class, and that these values are correlated with the confidence in their predictions. From these outputs we can learn punting thresholds.
In this work, we use a relatively simple strategy to define the data for learning punting thresholds given the user-specified hyperparameter ρ. More complex internal cross-validation schemes would likely yield slightly better performance and increased running time.
Eventually, rather than combining two existing classifiers, we would like to train a single classifier that has the advantages of both systems in one. This approach would obviate the need for the punting strategy described here. We are currently investigating approaches to this problem by training a ranking based algorithm, rather than a class predictor.
This work was supported by National Institutes of Health award R01 GM74257.
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 1998, 11: 739–747. 10.1093/protein/11.9.739View ArticlePubMedGoogle Scholar
- Holm L, Sander C: Protein Structure Comparison by Alignment of Distance Matrices. Journal of Molecular Biology 1993, 233: 123–138. 10.1006/jmbi.1993.1489View ArticlePubMedGoogle Scholar
- Ortiz AR, Strauss CEM, Olmea O: MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison. Protein Science 2002, 11: 2606–2621. 10.1110/ps.0215902PubMed CentralView ArticlePubMedGoogle Scholar
- Smith T, Waterman M: Identification of common molecular subsequences. Journal of Molecular Biology 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: A basic local alignment search tool. Journal of Molecular Biology 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998, 14(10):846–56. 10.1093/bioinformatics/14.10.846View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Jaakkola T, Diekhans M, Haussler D: Using the Fisher kernel method to detect remote protein homologies. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 1999:149–158.Google Scholar
- Boser BE, Guyon IM, Vapnik VN: A Training Algorithm for Optimal Margin Classifiers. In 5th Annual ACM Workshop on COLT. Edited by: Haussler D. Pittsburgh, PA: ACM Press; 1992:144–152.Google Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 1995, 247: 536–540.PubMedGoogle Scholar
- Noble WS: Support vector machine applications in computational biology. In Kernel methods in computational biology. Edited by: Schoelkopf B, Tsuda K, Vert JP. Cambridge, MA: MIT Press; 2004:71–92.Google Scholar
- Melvin I, Ie E, Kuang R, Weston J, Noble WS, Leslie C: SVM-fold: a tool for discriminative multi-class protein fold and superfamily recognition. BMC Bioinformatics 2007, 8(Suppl 4):S2. 10.1186/1471-2105-8-S4-S2PubMed CentralView ArticlePubMedGoogle Scholar
- Melvin I, Ie E, Weston J, Noble WS, Leslie C: Multi-class protein classification using adaptive codes. Journal of Machine Learning Research 2007, 8: 1557–1581.Google Scholar
- Rangwala H, Karypis G: Building multiclass classifiers for remote homology detection and fold recognition. BMC Bioinformatics 2006, 16(7):455. 10.1186/1471-2105-7-455View ArticleGoogle Scholar
- Shamim MT, Anwaruddin M, Nagarajaram HA: Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics 2007, 23(24):3320–3327. 10.1093/bioinformatics/btm527View ArticlePubMedGoogle Scholar
- Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C: Profile-based string kernels for remote homology detection and motif extraction. Journal of Bioinformatics and Computational Biology 2005, 3(3):527–550. 10.1142/S021972000500120XView ArticlePubMedGoogle Scholar
- Dobson PD, Doig AJ: Predicting Enzyme Class From Protein Structure Without Alignments. Journal of Molecular Biology 2005, 345: 187–199. 10.1016/j.jmb.2004.10.024View ArticlePubMedGoogle Scholar
- Borgwardt K, Ong CS, Schoenauer S, Vishwanathan S, Smola A, Kriegel HP: Protein Function Prediction via Graph Kernels. Bioinformatics 2005, 21(Suppl 1):i47-i56. 10.1093/bioinformatics/bti1007View ArticlePubMedGoogle Scholar
- Qiu J, Hue M, Ben-Hur A, Vert JP, Noble WS: A structural alignment kernel for protein structures. Bioinformatics 2007, 23(9):1090–1098. 10.1093/bioinformatics/btl642View ArticlePubMedGoogle Scholar
- Wolpert D: Stacked generalization. Neural Networks 1992, 5(2):241–259. 10.1016/S0893-6080(05)80023-1View ArticleGoogle Scholar
- Jan E, Gewehr VH, Zimmer R: AutoSCOP: Automated Prediction of SCOP Classifications using Unique Pattern-Class Mappings. Bioinformatics 2007, 23(10):1203–1210. 10.1093/bioinformatics/btm089View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.