The combination approach of SVM and ECOC for powerful identification and classification of transcription factor
© Zheng et al; licensee BioMed Central Ltd. 2008
Received: 09 January 2008
Accepted: 16 June 2008
Published: 16 June 2008
Transcription factors (TFs) are core functional proteins which play important roles in gene expression control, and they are key factors for gene regulation network construction. Traditionally, they were identified and classified through experimental approaches. In order to save time and reduce costs, many computational methods have been developed to identify TFs from new proteins and to classify the resulted TFs. Though these methods have facilitated screening of TFs to some extent, low accuracy is still a common problem. With the fast growing number of new proteins, more precise algorithms for identifying TFs from new proteins and classifying the consequent TFs are in a high demand.
The support vector machine (SVM) algorithm was utilized to construct an automatic detector for TF identification, where protein domains and functional sites were employed as feature vectors. Error-correcting output coding (ECOC) algorithm, which was originated from information and communication engineering fields, was introduced to combine with support vector machine (SVM) methodology for TF classification. The overall success rates of identification and classification achieved 88.22% and 97.83% respectively. Finally, a web site was constructed to let users access our tools (see Availability and requirements section for URL).
The SVM method was a valid and stable means for TFs identification with protein domains and functional sites as feature vectors. Error-correcting output coding (ECOC) algorithm is a powerful method for multi-class classification problem. When combined with SVM method, it can remarkably increase the accuracy of TF classification using protein domains and functional sites as feature vectors. In addition, our work implied that ECOC algorithm may succeed in a broad range of applications in biological data mining.
Transcription factors (TFs) are special DNA-binding proteins, which are commonly recognized by RNA polymerases for transcription initiation. Under certain physiologic conditions, TFs regulate expression levels of downstream genes effectively by binding to specific DNA fragments in the promoter regions. Such a process is closely related to important biological processes such as activation of cell cycle, regulation of differentiation, and maintenance of immunologic tolerance etc [1–3]. Generally, according to their structure and function, TFs can be grouped into four classes: (1) TFs with basic domains (basic-TFs), (2) TFs with zinc-coordinating DNA binding domains (zinc-TFs), (3) TFs with Helix-turn-helix (helix-TFs), and (4) TFs with Beta-Scaffold factors (beta-TFs). It is well known that interaction mechanisms of TFs and motifs differ for different types of TFs [4–6]. Therefore, it is a momentous task to identify and classify TFs for protein functional annotation and interaction mechanism investigations in this post genome era.
Traditionally, a transcription factor, as a special case of DNA-binding protein, is identified and classified by biochemical experiments, which can be time-consuming and costly, and difficult to apply to a large scale. To overcome these defects, computational approaches are often used. Kumar et al. developed a support vector machine method to identify DNA-binding proteins. Hwang et al. constructed a web server for prediction of DNA-binding residues in DNA-binding proteins, where three machine learning methods (support vector machine, kernel logistic regression and penalized logistic regression) were implemented. Cho et al. built up a hidden markov model to find out possible DNA binding sites for zinc finger proteins. As for transcription factors, BLAST methods were applied in most cases [10–13]. We have also constructed a simple model based on the nearest neighbor algorithm (NNA) for TF prediction in our previous work.
In this paper, support vector machine (SVM) and error-correcting output coding (ECOC) algorithm were utilized for TF identification and classification respectively. SVM is a method of machine learning with minimum structure risk, and it is generally employed for classification of two classes. ECOC is a method originated from information and communication engineering field, and it is commonly used to solve multi-class classification problems. Protein domains have been used as prediction signatures for protein-protein-interaction, protein structures[16, 17], and protein sub-cellular locations. On the other hand, some proteomics studies indicated close correlation exists between functional sites (such as sites of post transcriptional modification) and protein functions [19–21]. Therefore, we chose protein domains and functional sites as features to represent proteins and constructed a detector to distinguish TFs from non-TFs through a SVM method. Subsequently, a classifier based on ECOC algorithm was built to categorize TFs into four classes mentioned above. After building the detector and classifier, jackknife tests were used to assess performance of these two programs. In order to further investigate the efficiency of our approach, comprehensive comparison among BLAST, NNA, and SVM methods was carried out for TF identification, and comparison among BLAST, NNA, and ECOC was executed for TF classification. A web server was implemented to facilitate the use of these two tools.
Results and discussion
Identification of transcription factors
A detector was constructed based on a linear SVM model to distinguish TFs from non-TFs. We built a training data set excluding those proteins that were not annotated with any protein domains or functional sites. This training set contained 450 TFs and 1727 non-TFs [see additional file 1]. Each item of the dataset was denoted with a 4758-dimension feature vector (see "Methods" part for details).
Jackknife outcomes of TF identification
Data set size
Jackknife test results
true positive rate
true negative rate
Total success rate
Comparison among BLAST, NNA and SVM algorithms
Comparison among the BLAST, NNA, and SVM algorithm
Success rate (%)
Success rate (%)
Success rate (%)
Classification of transcription factors
Performance of TF classification
35/37 = 94.59%
36/37 = 97.30%
30/33 = 90.91%
32/33 = 96.97%
34/36 = 94.44%
35/36 = 97.22%
29/32 = 90.63%
32/32 = 100.00%
128/138 = 92.75%
135/138 = 97.83%
Comparison among BLAST, NNA, and ECOC algorithms
Comparison among the BLAST, NNA, and ECOC algorithm
25/37 = 67.57%
34/37 = 91.89%
36/37 = 97.30%
29/33 = 87.88%
31/33 = 93.94%
32/33 = 96.97%
33/36 = 91.67%
34/36 = 94.44%
35/36 = 97.22%
27/32 = 84.38%
28/32 = 87.50%
32/32 = 100.00%
114/138 = 82.61%
127/138 = 92.03%
135/138 = 97.83%
A web server for the detector and classifier has been constructed to facilitate the application of the two tools. Currently, two data types are supported by the server: Swiss-Prot AC numbers and protein sequences in FASTA format. For protein with Swiss-Prot AC numbers, information of protein domains and functional sites for the protein was extracted from the InterPro database. For a new sequence that is not covered in InterPro database, we used a program named InterProScan to screen its potential protein domains and functional sites. InterProScan is a program developed by EMBL-EBI. It combines different protein signature recognition methods into one system. Input of the program is a protein sequence with FASTA format and its output is a result file that contains InterPro entries of the sequence. Default parameters of the program were used in our research. For more detailed information of the program, please refer to webpage of InterProScan. Currently, we have downloaded the program and combined it with our transcription factor tools. Users are required to provide an email address when submitting a new task. After the task is done, a reminding email will be sent to the user automatically.
In this paper, an automatic detector was built for TF identification and a multi-class classifier was constructed for TF classification. Results of our work indicated that protein domains and functional sites were valid features for TF identification and classification. Moreover, our research was carried out on datasets with removed redundancy of sequence similarity, which meant our methods could provide beneficial supplement to sequence-similarity-based algorithms, such as the BLAST method, for TF identification and classification. We also believe that ECOC algorithm will have a broad application in life science, for example, classification of protein quaternary structures, categorization of kinase and prediction of protein subcellular localization etc. The detector and classifier implemented in our web server can be utilized as effective tools for TF discovery and annotation, especially for proteins with little previous knowledge. Although the two tools presented here can identify and classify TFs accurately when they have some protein domains and/or functional sites available, the two tools can not predict a protein with no protein domain or functional site annotated since this information are required in order to represent the protein in a vector. However, we believe that the impact of this limitation may become less significant since more protein domains and functional sites are obtained by biological experiments and more programs can get them directly from the protein sequences with better accuracy.
For TF identification, the SVM algorithm was employed to build the detector and performance of the detector was fairly good. Further investigations on datasets with different sample mixtures showed that the detector was robust and stable. Moreover, with protein domains and functional sites, both NNA and SVM methods perform notably better than the BLAST method. The SVM method is comparable to the NNA method for TF identification.
For TF classification, a brand-new algorithm called ECOC was introduced and employed for TF classification. In order to investigate the power of ECOC algorithm, comparison was executed in following two levels: In the first level, the ECOC algorithm was utilized as a connection framework for multi-class and was compared with a general multi-class connection algorithm named one-against-all, where the SVM method was used to build basic binary classifier for both algorithms. Comparison on this level showed that the capability of ECOC was outstanding and it surpassed the general connection algorithm for multi-class classification problems. In the second level, the ECOC was combined with SVM as the underlining method and was compared with the BLAST and NNA method. Comparison on this level indicated that the ECOC algorithm did have strong power in error correcting and fine tuning performance in multi-class categorization. Considering results of the two levels, we concluded that the ECOC combined with SVM was a powerful tool for TF classification.
Positive and negative datasets
Positive and negative (TF/non-TF) datasets
TF with class information(138)
TF without class information
Feature vectors of a support vector machine
Support vector machine algorithm
w•X + b = 0
Error-correcting output coding algorithm
Machine learning method such as SVM is more commonly used to handle the problem of two-class. When such a method is applied to a multi-class problem, the problem should be transformed into several independent two-class tasks[34, 35]. Then the method runs on each task and combines the output of these tasks. If the output of one task was wrong, the whole classifier would make incorrect classification. Error-correcting output coding algorithm (ECOC) can effectively minimize this kind of error through redundant coding information [35–37].
For row 1, assigns ones to all bits;
For row 2, consists of 2(m-2) zeros followed by 2(m-2) - 1 ones;
For row 3, consists of 2(m-3) zeros, followed by 2(m-3) ones, followed by 2(m-3) zeros, followed by 2(m-3) - 1 ones;
For row i, alternatively runs of 2(m-i) zeros and ones;
According to rules mentioned above, the transformation between coding and class for one-against-all and ECOC algorithm can be visualized as in Table 6, where yes and no are mapped to 1 and 0 respectively. After encoding, four unrelated binary classifiers are built and executed independently for one-against-all algorithm. Correspondingly, seven binary classifiers are constructed for ECOC algorithm. For one-against-all algorithm, in 4-bit coding, when one binary classifier is wrong, the algorithm will make a mistake in the final results. For instance, suppose an item belongs to class y1 and output of four binary classifiers is 1,0,1,0. Comparing it with the 4-bit coding list, the algorithm can not correctly categorize the item because the hamming distance between the item to y1 and y3 is equal. For ECOC algorithm, in 7-bit coding, when an error occurs in an independent binary classifier, the algorithm can still properly identify the item by surplus information. For example, suppose an item belongs to class y1 and the output of seven binary classifiers is 1, 1, 1, 1, 1, 0, 1. Comparing it with the 7-bit coding list, we can logically draw a conclusion that the item belongs to y1 with maximal likelihood because the hamming distance between the item and y1 is the shortest. Through this mechanism, the ECOC algorithm can correct output error and improve performance of classification for multi-class problems. In our work, we established a combination classifier for TF categorization based on one-against-all and ECOC algorithms respectively, where SVM was utilized as basic classifier. Subsequently, performances of the one-against-all and ECOC algorithm were assessed by the jackknife test.
Availability and requirements
Coding words for multi-class task
We thank Yudong Cai, Guohui Ding, and Tu Kang for suggestion of the article. This work was supported by grants of High-Tech Research and Development Program of China (No. 2006AA02Z329), National Natural Science Foundation of China (No. 60573093), National Basic Research Program of China (No. 2006CB910700, 2004CB720103, 2004CB518606, 2003CB715901), Funding of Chinese Academy of Sciences (No. KSCX2-YW-R-112) and Shanghai Pujiang Program (06PJ14073).
- Duncan SA, Navas MA, Dufort D, Rossant J, Stoffel M: Regulation of a transcription factor network required for differentiation and metabolism. Science 1998, 281: 692–695. 10.1126/science.281.5377.692View ArticlePubMedGoogle Scholar
- Hori S, Nomura T, Sakaguchi S: Control of regulatory T cell development by the transcription factor Foxp3. Science 2003, 299: 1057–1061. 10.1126/science.1079490View ArticlePubMedGoogle Scholar
- Vaughan PS, Aziz F, van Wijnen AJ, Wu S, Harada H, Taniguchi T, Soprano KJ, Stein JL, Stein GS: Activation of a cell-cycle-regulated histone gene by the oncogenic transcription factor IRF-2. Nature 1995, 377: 362–365. 10.1038/377362a0View ArticlePubMedGoogle Scholar
- Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 2003, 31: 374–378. 10.1093/nar/gkg108PubMed CentralView ArticlePubMedGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, 34: D108–10. 10.1093/nar/gkj143PubMed CentralView ArticlePubMedGoogle Scholar
- Pabo CO, Sauer RT: Transcription factors: structural families and principles of DNA recognition. Annu Rev Biochem 1992, 61: 1053–1095. 10.1146/annurev.bi.61.070192.005201View ArticlePubMedGoogle Scholar
- Kumar M, Gromiha MM, Raghava GP: Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics 2007, 8: 463. 10.1186/1471-2105-8-463PubMed CentralView ArticlePubMedGoogle Scholar
- Hwang S, Gou Z, Kuznetsov IB: DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 2007, 23: 634–636. 10.1093/bioinformatics/btl672View ArticlePubMedGoogle Scholar
- Cho SY, Chung M, Park M, Park S, Lee YS: ZIFIBI: Prediction of DNA binding sites for zinc finger proteins. Biochem Biophys Res Commun 2008, 369: 845–848. 10.1016/j.bbrc.2008.02.106View ArticlePubMedGoogle Scholar
- Ghosh D: Object-oriented transcription factors database (ooTFD). Nucleic Acids Res 2000, 28: 308–310. 10.1093/nar/28.1.308PubMed CentralView ArticlePubMedGoogle Scholar
- Guo A, He K, Liu D, Bai S, Gu X, Wei L, Luo J: DATF: a database of Arabidopsis transcription factors. Bioinformatics 2005, 21: 2568–2569. 10.1093/bioinformatics/bti334View ArticlePubMedGoogle Scholar
- Bork P, Doerks T, Springer TA, Snel B: Domains in plexins: links to integrins and transcription factors. Trends Biochem Sci 1999, 24: 261–263. 10.1016/S0968-0004(99)01416-4View ArticlePubMedGoogle Scholar
- Iida K, Seki M, Sakurai T, Satou M, Akiyama K, Toyoda T, Konagaya A, Shinozaki K: RARTF: database and tools for complete sets of Arabidopsis transcription factors. DNA Res 2005, 12: 247–256. 10.1093/dnares/dsi011View ArticlePubMedGoogle Scholar
- Qian Z, Cai YD, Li Y: Automatic transcription factor classifier based on functional domain composition. Biochem Biophys Res Commun 2006, 347: 141–144. 10.1016/j.bbrc.2006.06.060View ArticlePubMedGoogle Scholar
- Wojcik J, Schachter V: Protein-protein interaction map inference using interacting domain profile pairs. Bioinformatics 2001, 17 Suppl 1: S296–305.View ArticlePubMedGoogle Scholar
- Chou KC, Cai YD: Predicting protein structural class by functional domain composition. Biochem Biophys Res Commun 2004, 321: 1007–1009. 10.1016/j.bbrc.2004.07.059View ArticlePubMedGoogle Scholar
- Yu X, Wang C, Li Y: Classification of protein quaternary structure by functional domain composition. BMC Bioinformatics 2006, 7: 187. 10.1186/1471-2105-7-187PubMed CentralView ArticlePubMedGoogle Scholar
- Jia P, Qian Z, Zeng Z, Cai Y, Li Y: Prediction of subcellular protein localization based on functional domain composition. Biochem Biophys Res Commun 2007, 357: 366–370. 10.1016/j.bbrc.2007.03.139View ArticlePubMedGoogle Scholar
- Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, Workman C, Andersen CA, Knudsen S, Krogh A, Valencia A, Brunak S: Prediction of human protein function from post-translational modifications and localization features. J Mol Biol 2002, 319: 1257–1265. 10.1016/S0022-2836(02)00379-0View ArticlePubMedGoogle Scholar
- Bode AM, Dong Z: Post-translational modification of p53 in tumorigenesis. Nat Rev Cancer 2004, 4: 793–805. 10.1038/nrc1455View ArticlePubMedGoogle Scholar
- Laufs U, Liao JK: Post-transcriptional regulation of endothelial nitric oxide synthase mRNA stability by Rho GTPase. J Biol Chem 1998, 273: 24266–24271. 10.1074/jbc.273.37.24266View ArticlePubMedGoogle Scholar
- Stone M: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society 1974, 36: 111–147.Google Scholar
- Miller RG: The jackknife-a review. Biometrika 1974, 61: 1–15.Google Scholar
- G.Gong BE: A leisurely look at the bootstrap, the jackknife, and cross-validation. The American Statistician 1983, 37: 36–48. 10.2307/2685844Google Scholar
- The InterProScan webpage[http://www.ebi.ac.uk/InterProScan/]
- The Universal Protein Resource (UniProt) Nucleic Acids Res 2007, 35: D193–7. 10.1093/nar/gkl929Google Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–1659. 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar
- Wang G, Dunbrack RL Jr.: PISCES: a protein sequence culling server. Bioinformatics 2003, 19: 1589–1591. 10.1093/bioinformatics/btg224View ArticlePubMedGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, Langendijk-Genevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Nikolskaya AN, Orchard S, Orengo C, Petryszak R, Selengut JD, Sigrist CJ, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C: New developments in the InterPro database. Nucleic Acids Res 2007, 35: D224–8. 10.1093/nar/gkl841PubMed CentralView ArticlePubMedGoogle Scholar
- V.Vapnik: The Nature of Statistical Learning Theory. New York, Springer Verlag; 1995.View ArticleGoogle Scholar
- V.Vapnik: Statistical Learning Theory. 2nd edition. New York, John Wiley &Sons; 1998.Google Scholar
- The svmlight webpage[http://svmlight.joachims.org/]
- Joachims T: Making large-Scale SVM Learing Practical. Advances in Kernal Methods - Support Vector Learing. Edited by: Bernhard Scholkopf CJCBAJS. Cambridge, USA, MIT Press; 1999.Google Scholar
- David M.J. Tax and Robert P.W.Duin: Using Two-Class Classifiers for Multiclass Classification.: ; Quebec, Canada.. ; 2002:124–127.Google Scholar
- Frank IHWE: Data Mining Practical Machine Learning Tools and Techniques. 2nd edition. New York, Diane Cerra; 2005.Google Scholar
- Eun Bae Kong TGD: Error-Correcting Output Coding Corrects Bias and Variance: ; Tahoe City, CA. ; 1995:313–321.Google Scholar
- G.Bakiri TGD: Solving Multiclass Learning Problems via Error-Correcting Output Codes. Journal of Artificial Intelligence Research 1995, 2: 263–286.Google Scholar
- Ding CH, Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 2001, 17: 349–358. 10.1093/bioinformatics/17.4.349View ArticlePubMedGoogle Scholar
- Nguyen MN, Rajapakse JC: Multi-class support vector machines for protein secondary structure prediction. Genome Inform 2003, 14: 218–227.PubMedGoogle Scholar
- Kuncheva LI: Using diversity measures for generating error-correcting output codes in classifier ensembles. Pattern Recognition Letters 2005, 26: 83–90. 10.1016/j.patrec.2004.08.019View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.