Applying Support Vector Machines for Gene ontology based gene function prediction
© Vinayagam et al. 2004
Received: 11 May 2004
Accepted: 26 August 2004
Published: 26 August 2004
Skip to main content
© Vinayagam et al. 2004
Received: 11 May 2004
Accepted: 26 August 2004
Published: 26 August 2004
The current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions.
We have developed a large-scale annotation system that tackles all of these shortcomings. In our approach, annotation was provided through Gene Ontology terms by applying multiple Support Vector Machines (SVM) for the classification of correct and false predictions. The general performance of the system was benchmarked with a large dataset. An organism-wise cross-validation was performed to define confidence estimates, resulting in an average precision of 80% for 74% of all test sequences. The validation results show that the prediction performance was organism-independent and could reproduce the annotation of other automated systems as well as high-quality manual annotations. We applied our trained classification system to Xenopus laevis sequences, yielding functional annotation for more than half of the known expressed genome. Compared to the currently available annotation, we provided more than twice the number of contigs with good quality annotation, and additionally we assigned a confidence value to each predicted GO term.
We present a complete automated annotation system that overcomes many of the usual problems by applying a controlled vocabulary of Gene Ontology and an established classification method on large and well-described sequence data sets. In a case study, the function for Xenopus laevis contig sequences was predicted and the results are publicly available at ftp://genome.dkfz-heidelberg.de/pub/agd/gene_association.agd_Xenopus.
Ongoing genome sequencing and recent developments in cDNA sequencing projects have led to an exponential rise in the amount of sequence information. This has increased the need for acquiring knowledge from sequences as to their biological function. Annotating a single sequence is the gateway to interpreting its biological relevance. However, the usefulness of these annotations is highly correlated with their quality. Accurate annotation has traditionally been maintained manually with the experience of individual experts and the experimental characterisation of sequences. However, the increasing gap between the amount of sequence data available and the time needed for their experimental characterisation demands computational function prediction in complementing manual curation [1–4]. Commonly, computational functional assignment is based on homologues identified from database searches . Such an automated annotation process provides comparable results due to a uniform analysis of all query sequences across the same databases and the possibility of repeating the annotation to updated sequence data . However, crucial aspects for consideration in automated annotation are i) the problems associated with the databases themselves: sequence errors, erroneous annotation due to spelling ambiguities, incomplete functional annotation, inconsistent functional annotation across databases, consistent but wrong annotation across databases, and ii) the problems associated with the inference, i.e. false positives, where an assignment is made on the basis of a wrongly inferred homology [3, 7, 8]. A number of excellent annotation systems have been developed to tackle these problems, e.g. RiceGAAS , GAIA , Genotator , Magpie , GeneQuiz , GeneAtlas  and PEDANT . However, little has been done to quantify the annotation accuracy by defined benchmarks and establish a method to provide a confidence value for each annotation.
The current annotation, written in a rich, non-formalised language also complicates this automated process. We addressed this problem by applying a controlled vocabulary from Gene Ontology (GO) [15–17]. GO provides consistent descriptions of gene products in a species-independent manner. The GO terms are organised in structured, controlled vocabularies (ontologies) to describe gene products in terms of their associated biological processes, cellular components and molecular functions. An increasing number of GO-mapped sequence databases make it possible to replace traditional database searches with GO-related searches. These include databases such as GenBank , SWISS-PROT , SwissPROT/TrEMBL , the TIGR Gene Index  and several other genome databases. Many annotation approaches have now been developed based on Gene Ontology. The uncharacterised sequences are searched across GO-mapped protein databases and assigned with GO terms of the best hits [21, 22]. Jensen and co-workers used neural networks to predict specific subsets of GO terms . Furthermore, Schung et al predicted GO terms by intersecting domain profiles . The SwissPROT/TrEMBL entries were associated with GO terms by an automated process coupled with manual verification . Text mining and similarity searches were combined to annotate SWISS-PROT and GenBank entries with GO terms . However, these approaches were either applied to specific GO subsets or did not provide defined benchmarks and confidence values for their predictions.
We have developed an automated system for large-scale cDNA function assignment, designed and optimised to achieve a high-level of prediction accuracy without any manual refinement. Our system assigns molecular function GO terms to uncharacterised cDNA sequences and defines a confidence value for each prediction. The cDNA sequences were searched against GO-mapped protein databases and the GO terms were extracted from the homologues. In the training phase, these GO terms were compared to the GO annotation of the query sequences and labelled correspondingly. We applied Support Vector Machines (SVMs) as the machine learning method to classify whether the extracted GO terms were appropriate to the cDNA sequence or not. In order to classify the GO terms we used a broad variety of elaborated features (attributes) including sequence similarity measures, GO term frequency, GO term relationships between homologues, annotation quality of the homologues, and the level of annotation within the GO hierarchy. To enhance the reliability of the prediction, we used multiple SVMs for classification and applied a committee approach to combine the results with a voting scheme . The confidence values for the predicted GO terms were assigned based on the number of votes i.e. number of SVMs predicting particular GO term as correct. The performance of the system was benchmarked with 36,771 GO-annotated cDNA sequences derived from 13 organisms. It achieved 80% precision for 74% of the test sequences. We applied our annotation system to predict the function for Xenopus laevis, a widely studied model organism in developmental biology. Because many researchers are now focussing on the functional genomics of this organism, a demand exists for a quality annotation . Therefore we applied our system to improve the quality and coverage of the existing annotation. We predicted the function for 17,804 Xenopus laevis contig sequences (from TIGR Gene Indices) yielding annotation with good confidence values for more than half of these sequences.
The classifier (SVM) needs to specify attribute values (features) for a broad list of samples and a class label for each of these samples. Through the training samples it learns the feature patterns and tries to group them according to their class labels. After training, the algorithm assigns class labels to new samples according to the class that they best match.
Training and test dataset: Datasets used for training and testing the support vector machines. The columns are: 1. The number of cDNA sequences for training. 2. The number of cDNA sequences with BLAST hits having GO molecular function terms. 3. The average number of GO molecular function terms per cDNA sequence of the BLAST-hits. 4. The classification of GO terms coming from the hits, positive if the GO terms were similar to original annotation, negative otherwise.
Number of cDNAs
cDNA with MF GO
Number of GO/cDNA
The cDNA sequences were searched across the protein databases covering a wide range of organisms from prokaryotes to eukaryotes and SWISSPROT. For 36,771 sequences we got hits with GO terms, contributing to 856,632 sample GO terms and yielding an average of 23.29 GO terms per query sequence (Table 1). These 856,632 samples were used to train our classifier. Generally, the number of GO terms per sequence was less for prokaryotes than for eukaryotes. Rat had the maximum number of GO terms per sequence (36.9), followed by fish (32.1) and worm (27.13). In contrast, Shewanella, Coxiella and Vibrio sequences had the lowest number of GO terms per sequence (10.78, 12.33 and 12.54, respectively).
We set up multiple classifiers by splitting the whole dataset (856,632 samples) into 99 equal subsets. Note that, amongst these 99 subsets, 96 contained data from a single organism and the remaining 3 from two organisms each. Subsequently, we built 99 classifiers with these subsets. Since the training sets were created organism-wise, the classifiers were trained from different ranges of data, based on purely manual annotation (yeast, fly), mostly automated annotation or a mixture of both. For training each of these classifiers, we performed a model selection (parameter optimisation by cross-validation; see Methods), which yielded varying accuracy values ranging from 78.81% to 96.03%, with an average accuracy of 85.11%.
To test the classifiers performance, we prepared 13 test sets (each set corresponding to a single organism) using the same 856,632 sample GO terms. The prediction quality of all 99 classifiers were assessed by an organism-wise cross-validation approach, i.e. for each organism (test set), we used all the classifiers for prediction except those that corresponded to the same organism. With this approach, we were able to simulate the annotation of a new organism. The number of classifiers used for predictions varied highly across organisms (maximum: Plasmodium and Leishmania, 98 classifiers; minimum: Arabidopsis, 74 classifiers). The quality of the predictions was estimated by comparing the predicted terms with the original annotation and the results were expressed in terms of precision and accuracy values (see Methods). The average-accuracy refers to the average of the accuracy values attained by all classifiers used for the prediction. The maximum average-accuracy was achieved for fly (81.51%), followed by yeast (80.50%), and the minimum for mouse (76.0%).
Additionally, we compared the classification efficiency of the classifier derived from automatic annotation (mouse, worm and Arabidopsis) with the manually annotated test sequences (yeast and fly). The prediction of the yeast and fly sequences with the 20 classifiers from the mouse sequences produced an average-accuracy of 79% and 80% respectively. Similar results were acquired with the 25 classifiers from Arabidopsis (79% and 80%). Likewise, the worm classifiers (11 classifiers) yielded the average-accuracy of 82% for yeast and 83% for fly. These values were comparable with the average-accuracy of 81% achieved by both, using yeast as test sequences against fly classifiers (16 classifier) and vice-versa (fly test sequences against yeast classifiers). Likewise, we classified the mouse test sequences against yeast classifiers (5 classifier) and fly classifiers yielding 69% and 71% average-accuracy respectively.
We extracted all Xenopus laevis contig sequences from the TIGR Xenopus laevis Gene Index (XGI)  and got a total of 35,251 contig sequences, excluding singletons. We applied our method to predict functional GO terms for these contig sequences. We predicted the function for 17,804 sequences with an average of 12.16 GO terms per sequence. In total, 23.4% of all the GO terms were predicted with less than 50% confidence value, 51.5% of them were between 50% to 80% confidence and the remaining 25% with a predicted confidence value of above 80%. At 80% stringency (predicted if the GO term possessed a confidence value of 80% or more), we made predictions for 9,510 contig sequences including 55,994 GO terms, yielding on average 5.88 GO terms per sequence.
TIGR provides a GO mapping for Xenopus contigs (TIGR Xenopus laevis gene indices). We compared our annotation with the TIGR GO annotation for molecular function. From 35,251 contig sequences, TIGR annotated 5,444 contigs with a total of 16,432 molecular function GO terms. In contrast, our approach was able to predict function terms for 17,804 contigs, i.e. more than three times that of TIGR sequences. Our procedure did not annotate 295 contigs from the TIGR annotated contigs. For the remaining 5,149 contigs, 85% of all TIGR terms were found to be exact with those using our method; 3.2% of the TIGR terms were at a higher-level of the GO tree than our annotation, so in this case we provided annotation at a deeper level; in 0.9% of the cases our annotation was at a higher-level; 8.3% of the cases were completely different; and 0.6% of the TIGR terms were obsolete. We compared the quality of TIGR and that of our annotations by a raising stringency and found that when we applied a confidence threshold of 80% for our annotation, we lost 46.6% of the sequences. This included 1,492 sequences holding equivalent TIGR annotation or 27.4% of the total TIGR annotation. With this stringency, our system annotated 9,510 contig sequences, i.e. twice the TIGR annotation at this quality.
The novel predictions with high confidence values (100%). The columns correspond to TC (TIGR-id) number, GO term level in GO-structure, the confidence values and the GO number and GO terms.
serine-type peptidase activity
serine-type peptidase activity
serine hydrolase activity
transmembrane receptor activity
protein kinase activity
transmembrane receptor activity
frizzled receptor activity
glutathione peroxidase activity
serine hydrolase activity
prolyl aminopeptidase activity
In this paper, we presented an automatic annotation system that is able to cope with the expanding amount of biological sequence data. Our approach efficiently combines the ongoing efforts of Gene Ontology and the availability of GO-mapped sequences with a profound machine learning system. The GO-mapped databases provide annotation described in a controlled vocabulary and also a measure of reliability, as these GO entries are labelled with their type of origin. Furthermore, GO terms are structured hierarchically, which allow us a twofold use of the information: i) the level within the tree is taken as a classification criterion to distinguish low from high-level annotations during the learning procedure, and, ii) the hierarchical structure allows us to extend hits by slightly moving up and down within a restricted local area of the tree. This may overcome fluctuations of the annotation levels coming from varying annotation experts.
Our annotation system exploits the different combinations of attributes and yields functional transitivity: SVM learning and prediction are organism-independent and comparable to manual annotation, which may be supported by the nature of the attributes we utilise. Subsets and overlaps are counted in a balanced fashion to avoid biases due to the complexity of an organism and a potentially correlated complexity of its sequences. The committee approach allows us to improve the prediction quality as well as to assign confidence values for the new predictions in a straightforward manner. Our classifiers performance is hardly limited by the varying quality of the training data, whether manual or automatic annotated. The prediction results of manually annotated test sets with the classifiers based on automated annotation as well as classifiers based on manual annotation were comparable. Regarding the outcome of the overall classifiers, we achieve consistency with existing annotation from automatic annotations. This is the less complex part of our work and shows a comparable efficiency of our system. Additionally, our system reproduces annotation of purely manually annotated datasets (fly, yeast, etc). However, the performance results for these datasets are low in terms of recall, i.e. 47.4% recall with 80% precision compared to 60.6% recall with the same precision of the complete test set. Note that manual annotation tends to be conservative and sparse, yielding stringent true positive definitions, whereas automatically annotated sequences may accumulate information to a greater extent.
We were interested in annotating Xenopus since it is a familiar model organism. However, the sequences were not very well annotated. Our system was applied to annotate the Xenopus contig sequences from TIGR. Through our approach, we annotated 50.5% of all contig sequences available at present, and associated a confidence value for each prediction, yielding roughly three times more sequences as compared to the currently available GO annotation. However, the coverage of annotation to new organism like Xenopus is crucial. We were able to attain predictions for 50.5% of all Xenopus contig sequences (no singletons). This compares to the applied databases that contained 53% satisfactory annotation for their sequences (not regarding sequences with unknown function terms), and better than the organism specific databases (36%). Obviously, improving the quality and quantity of annotation within the available databases goes along with the coverage exploit of machine learning algorithms for new organisms. In future we want to extend our method with the information from other sources such as domain databases and protein family databases.
We developed an automated annotation system to assign functional GO terms to an unknown sequence. We used the well-established technique of Support Vector Machines (SVM) for the classification of correct and incorrect GO terms. Our approach benefited from the broad variety of potential attributes used for the functional transitivity and a vast amount of data used for training and validating. The committee scheme exploited in our system provided a means to assign confidence values in a straightforward manner. Our system performance was robust, organism-independent and reproduced the high-quality manual annotation. When applying it to Xenopus laevis contig sequences, we obtained a remarkably enhanced annotation coverage compared to the existing annotation.
Accuracy was the rate of correct predictions compared to all predictions,
Accuracy: = (TP + TN) / (TP + FP + TN + FN), (1)
where TP denotes true positives, FP false positives, TN true negatives and FN false negatives. Precision was the portion of true positives with respect to all positives,
Precision: = TP / (TP + FP). (2)
Also used were sensitivity := TP / (TP + FN), specificity := TN / (FP + TN), and false positive rate := 1 - specificity. We defined the term "coverage-of-sequences" as the portion of query sequences for which the classifier delivers a prediction; "Precision-per-sequence" the (average) portion of correct GO terms for a single query sequence, with respect to all GO terms assigned to it. Note that these terms were defined within our model, i.e. a good "accuracy" meant good consistency with respect to our training and test sets.
We focused on the molecular function terms from GO, because the information extracted from the gene products is usually more predictive for determining molecular functions than for biological processes or cellular components. The functional terms and their hierarchy were obtained from the web pages of the Gene Ontology Consortium  (version of June 2003). In our study, relationships "is-a" and "part-of" were not distinguished. Note, that the "part-of" relationship is rare in the molecular function ontology (26 out of 6521 child-parent relationships). The annotation level varies across databases depending on the curator's individual knowledge about the gene product. To consider varying levels of annotation in the databases for similar gene products, we traced the relationships to match GO terms of different granularity for the same function. To find a relationship between two terms, the whole path of a GO term was traced back to the root (the root is the "molecular function" node, GO:0003674). We defined the distance between two GO terms as the distance of the shortest path. GO terms are organised in directed acyclic graphs, i.e. a child (more specialised term) may have multiple parents (less specialised terms). Therefore, we defined single path and multiple path relationships. In the case of single path relationships, GO terms had only one possible path to the root. The relationship of the term GO1 with respect to GO2 was classified as "parent", "child", "sibling" or "different" (Figure 1) according to the following rules:
GO2 is a "parent" of GO1 if their respective paths P2 and P1 intersect in such a manner that
P1 ⊂ P2, (3)
P i denotes the set of nodes from GO i to the root
GO2 is a "child" of GO1 if their paths P2 and P1 intersect such that
P1 ⊃ P2, (4)
GO2 is a "sibling" of GO1 if a common parent exists with a distance of one to GO1 and GO2 (Figure 1E). To avoid ambiguities for less differentiated terms, the sibling relationship was set only, if GO1 and GO2 were at least 5 nodes away from the root.
The relationship "different" was set if none of the previously stated criteria was fulfilled.
The parent relationship was set if at least one of the path-pairs gave a (single path) parent relationship;
The child relationship was set if at least one of the path-pairs gave a child relationship. To avoid a bias due to an overwhelming number of path-pairs that did not match, we set a threshold: we considered this relationship only, if the number of path-pairs with no child relationship was equal or less than four times the number of path-pairs with child relationship;
The sibling relationship was set if at least one of the path-pairs gave a sibling relationship. We again set a threshold: we considered this relationship only, if the number of path-pairs with no sibling relationship was equal or less than twice the number of pairs with sibling relationship;
If none of these criteria could be applied, the relationship "different" was set.
Note that we also implemented the hierarchy of these relations by tuning the stringencies for the fractions of path-pairs that must match (parent: no threshold, child: 1/4, sibling: 1/2).
Since the function transitivity at the protein level is more reliable, we used GO-mapped protein databases for searching homologues. Gene association files were obtained via the Gene Ontology Consortium. By combining the gene association files with corresponding sequence databases we created the unified protein databases. The following organisms were used: yeast, fly, mouse, Arabidopsis, worm, rat, fish, Leishmania, Bacillus, Coxiella, Shewanella, Vibrio, Plasmodium, Oryza sativa, Trypanosoma brucei, and Homo sapiens. Apart from this, the SWISS-PROT database was also included . For SVM training and testing we selected 39,740 cDNA sequences from 13 organisms. These cDNA sequences were collected from the following organisms: yeast, fly, mouse, Arabidopsis, worm, rat, fish, Leishmania, Bacillus, Coxiella, Shewanella, Vibrio and Plasmodium (see Table 1). Out of all the known cDNA sequences we extracted 39,740 with GO molecular function terms, discarding incompletely annotated ones, i.e. sequences assigned with the GO term "molecular function unknown" (GO:0005554).
GO level and path: The GO structure was exploited to derive the first two attributes,
GO level: the distance of the sample GO term to the root (molecular function node);
GO path: the number of paths from the sample GO term to the root.
Alignment quality criteria: These attributes are based on the BLAST alignments. For attributes B.1 - B.4, the best value for the corresponding attribute was taken, if a GO term occurred in more than one hit,
Expectation value: the expectation value ("E-value") from BLASTX;
Bit score: the bit score value provided by BLASTX;
Query coverage score (QCS): Combined measure of alignment size and quality concerning the query sequence,
QCS := (AL / QL) × (I + S), (5)
Subject coverage score (SCS): as in B.3, however only with respect to the corresponding subject sequence (database hit),
SCS := (AL / SL) × (I + S), (6)
where SL denotes the length of the subject sequence;
Query percentage (QPC): Percentage of coverage of the alignment region in the query sequence (with respect to QCS), i.e.
Subject percentage (SPC) Percentage of coverage of the alignment region in the corresponding subject sequence (with respect to SCS), i.e.
Query identity (QI): Percentage of identical residues in the BLASTX alignment (with respect to QCS);
Subject identity (SI): Percentage of identical residues in the BLASTX alignment (with respect to SCS);
Query similarity (QS): Percentage of similar or positively contributing residues in the alignment (with respect to QCS);
Subject similarity (SS): Percentage of similar or positively contributing residues in the alignment (with respect to SCS).
GO frequency related attributes: We extracted information about the frequency of GO terms in the hits by the following attributes:
GO frequency (FG): the number of hits that contained the sample GO term.
Number of hits (TH): the total number of hits for the query.
Frequency score (FS): the number of hits that contained the sample GO term. Unlike C.1, we limited this score to emphasize differences in queries with few hits:
Species frequency: The number of organisms contributing to a sample GO term for a single query sequence;
Total GO (TG): total number of GO terms from all hits.
Unique GO (UG): as C.5, except, that GO terms occurring more than once (in the hits) were counted only once.
GO frequency by considering relationships: For these attributes we applied the structure of the Gene Ontology graph. Not only perfectly matching terms were considered, but also their defined parents, children or siblings:
Relative frequency for all (R A): the relationships for the sample GO term with all GO terms that occurred in the hits were calculated. The sum of non-"different" relationships i.e. parent, child, or sibling was used for this attribute;
Relative frequency for unique (RU): similar to attribute D.1, with the exception that GO terms occurring more than once were counted only once.
Relative frequency for all (limited) (R Alim ): same as attribute D.1, however this score was limited to emphasize differences of queries with few hits:
Relative frequency for unique (limited) (R Ulim ): same as attribute D.2, however this score was limited to emphasize differences of queries with few hits:
Annotation quality related attributes: Quality attributes were selected from the evidence codes provided by the gene association tables of the GO-mapped sequence databases. We selected 9 commonly used evidence codes (TAS, NAS, ISS, IPI, IMP, IGI, IEP, IEA, IDA), resulting in attributes E.1 to E.9. The entries of these attributes for each sample GO term were calculated by summing the occurrences of the corresponding evidence codes of all hits.
Before training, normalisation was performed. We normalised the attributes by taking the logarithm (log) and log of log if necessary. We used log values for 16 attributes (B.3-B.10, C.3, C.4, D.1, D.2, D.4 and E.1) and log of log for 8 attributes (B.2, C.1, E.2, E.4-E.8). Furthermore, we converted the attribute values into mean 0 and standard deviation 1 by applying the Z-transformation. The feature table contained 856,632 samples and 31 attributes. We split the dataset into 99 training subsets. Each subset comprised of approximately 1% of the samples i.e. 8,566 GO terms. This resulted in 96 organism specific subsets and 3 hybrid subsets. We applied the support vector machines in the implementation of LIBSVM , which supports a weighted SVM for unbalanced data. We used a higher penalty (5 instead of 1) for false positives (FP) for the model selection and also the training process to support a high specificity of the resulting classifiers. Also note, that our training set contained a high portion of negative samples (see Table 1) due to our relaxed E-value threshold. We utilised the radial basis function kernel and set the parameter epsilon (tolerance of termination criterion) to 0.01. The parameter C (regularisation term, cost for false classification) and gamma (kernel width) of the SVM were optimised using a grid search. The grid search determined the combination of C (log2-range: 13 to 15, step 1) and gamma (log2-range: 10 to 15, step 1) with the lowest classification error according to a five-fold cross validation such that each of the 99 data subsets was subdivided into a training set (90%) and a validation set (10%). The validation sets were used to estimate the parameters C and gamma for each of the 99 classifiers individually. Finally, the parameters from the classifier selection were applied to train each of the classifiers with 90% of each data set, respectively. The testing was based on the same 13 organisms and 856,632 GO terms corresponding to 36,771 sequences as described above. We performed the testing by an organism-wise cross-validation so that one organism was used as test set and the remaining ones as the training set.
The annotation for Xenopus laevis contig sequences is downloadable at ftp://genome.dkfz-heidelberg.de/pub/agd/gene_association.agd_Xenopus. We followed the standard GO annotation style (using Gene ontology guideline). The evidence code is always IEA. The confidence value is included for each GO term.
We thank the Gene Ontology Consortium and all groups that established GO association databases for making their data available through the web. This work was funded by the German Cancer Research Center (DKFZ), the Deutsche Krebshilfe and the Nationales Genom-Forschungs-Netz (NGFN). We also thank Suresh Kumar Swaminathan and Coral del Val for their suggestions during the preparation of the manuscript, and Andrea Mclntosh-suhr and Christopher Previti for proofreading.
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.