Applying Support Vector Machines for Gene ontology based gene function prediction

Background The current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions. Results We have developed a large-scale annotation system that tackles all of these shortcomings. In our approach, annotation was provided through Gene Ontology terms by applying multiple Support Vector Machines (SVM) for the classification of correct and false predictions. The general performance of the system was benchmarked with a large dataset. An organism-wise cross-validation was performed to define confidence estimates, resulting in an average precision of 80% for 74% of all test sequences. The validation results show that the prediction performance was organism-independent and could reproduce the annotation of other automated systems as well as high-quality manual annotations. We applied our trained classification system to Xenopus laevis sequences, yielding functional annotation for more than half of the known expressed genome. Compared to the currently available annotation, we provided more than twice the number of contigs with good quality annotation, and additionally we assigned a confidence value to each predicted GO term. Conclusions We present a complete automated annotation system that overcomes many of the usual problems by applying a controlled vocabulary of Gene Ontology and an established classification method on large and well-described sequence data sets. In a case study, the function for Xenopus laevis contig sequences was predicted and the results are publicly available at .


Background
Ongoing genome sequencing and recent developments in cDNA sequencing projects have led to an exponential rise in the amount of sequence information. This has increased the need for acquiring knowledge from sequences as to their biological function. Annotating a single sequence is the gateway to interpreting its biological relevance. However, the usefulness of these annotations is highly correlated with their quality. Accurate annotation has traditionally been maintained manually with the experience of individual experts and the experimental characterisation of sequences. However, the increasing gap between the amount of sequence data available and the time needed for their experimental characterisation demands computational function prediction in complementing manual curation [1][2][3][4]. Commonly, computational functional assignment is based on homologues identified from database searches [5]. Such an automated annotation process provides comparable results due to a uniform analysis of all query sequences across the same databases and the possibility of repeating the annotation to updated sequence data [6]. However, crucial aspects for consideration in automated annotation are i) the problems associated with the databases themselves: sequence errors, erroneous annotation due to spelling ambiguities, incomplete functional annotation, inconsistent functional annotation across databases, consistent but wrong annotation across databases, and ii) the problems associated with the inference, i.e. false positives, where an assignment is made on the basis of a wrongly inferred homology [3,7,8]. A number of excellent annotation systems have been developed to tackle these problems, e.g. RiceGAAS [9], GAIA [10], Genotator [11], Magpie [12], GeneQuiz [6], GeneAtlas [13] and PEDANT [14]. However, little has been done to quantify the annotation accuracy by defined benchmarks and establish a method to provide a confidence value for each annotation.
The current annotation, written in a rich, non-formalised language also complicates this automated process. We addressed this problem by applying a controlled vocabulary from Gene Ontology (GO) [15][16][17]. GO provides consistent descriptions of gene products in a speciesindependent manner. The GO terms are organised in structured, controlled vocabularies (ontologies) to describe gene products in terms of their associated biological processes, cellular components and molecular functions. An increasing number of GO-mapped sequence databases make it possible to replace traditional database searches with GO-related searches. These include databases such as GenBank [18], SWISS-PROT [18], Swiss-PROT/TrEMBL [19], the TIGR Gene Index [20] and several other genome databases. Many annotation approaches have now been developed based on Gene Ontology. The uncharacterised sequences are searched across GOmapped protein databases and assigned with GO terms of the best hits [21,22]. Jensen and co-workers used neural networks to predict specific subsets of GO terms [23]. Furthermore, Schung et al predicted GO terms by intersecting domain profiles [24]. The SwissPROT/TrEMBL entries were associated with GO terms by an automated process coupled with manual verification [19]. Text mining and similarity searches were combined to annotate SWISS-PROT and GenBank entries with GO terms [18]. However, these approaches were either applied to specific GO subsets or did not provide defined benchmarks and confidence values for their predictions.
We have developed an automated system for large-scale cDNA function assignment, designed and optimised to achieve a high-level of prediction accuracy without any manual refinement. Our system assigns molecular function GO terms to uncharacterised cDNA sequences and defines a confidence value for each prediction. The cDNA sequences were searched against GO-mapped protein databases and the GO terms were extracted from the homologues. In the training phase, these GO terms were compared to the GO annotation of the query sequences and labelled correspondingly. We applied Support Vector Machines (SVMs) as the machine learning method to classify whether the extracted GO terms were appropriate to the cDNA sequence or not. In order to classify the GO terms we used a broad variety of elaborated features (attributes) including sequence similarity measures, GO term frequency, GO term relationships between homologues, annotation quality of the homologues, and the level of annotation within the GO hierarchy. To enhance the reliability of the prediction, we used multiple SVMs for classification and applied a committee approach to combine the results with a voting scheme [25]. The confidence values for the predicted GO terms were assigned based on the number of votes i.e. number of SVMs predicting particular GO term as correct. The performance of the system was benchmarked with 36,771 GO-annotated cDNA sequences derived from 13 organisms. It achieved 80% precision for 74% of the test sequences. We applied our annotation system to predict the function for Xenopus laevis, a widely studied model organism in developmental biology. Because many researchers are now focussing on the functional genomics of this organism, a demand exists for a quality annotation [26]. Therefore we applied our system to improve the quality and coverage of the existing annotation. We predicted the function for 17,804 Xenopus laevis contig sequences (from TIGR Gene Indices) yielding annotation with good confidence values for more than half of these sequences.

General workflow of training and classification
The classifier (SVM) needs to specify attribute values (features) for a broad list of samples and a class label for each of these samples. Through the training samples it learns the feature patterns and tries to group them according to their class labels. After training, the algorithm assigns class labels to new samples according to the class that they best match.
We selected GO-annotated cDNA sequences for training the SVM classifier. The nucleotide sequences were searched against GO-mapped protein databases and GOannotations were extracted from the significant hits. Then, each GO term obtained was utilized as a sample for the feature table. The sample GO terms were then labelled as either correct ("+1") or false ("-1") by comparing them to the original annotation. Note that we applied the relationships of the GO terms based on their graph structure: "Correct" was assigned not only if they were exact matches but also if the GO terms were related as either "parent" or "child" (Figure 1). Next, the samples were attached with their features or attributes, calculated from the BLAST [27] results. With this data, the classifier was trained to distinguish between the attribute patterns that contributed to class +1 (correct prediction of a GO term) and class-1 (false prediction). To predict the function of unknown sequences, the same procedure was applied as for the training sequences in order to obtain their GO terms and corresponding attribute values. According to these attribute values, the classifier assigned a class for every GO term of the BLAST hits ( Figure 2).

Datasets for training and testing SVM
For training and testing the SVM, we selected 39,740 GOannotated cDNA sequences from the following organisms: Saccharomyces cerevisiae (yeast), Drosophila melanogaster (fly), Mus musculus (mouse), Arabidopsis thaliana (Arabidopsis), Caenorhabditis elegans (worm), Rattus norvegicus (rat), Danio rerio (fish), Leishmania major (Leishmania), Bacillus anthracis Ame (Bacillus), Coxiella burnetii RSA 493 (Coxiella), Shewanella oneidensis MR-1 (Shewanella), Vibrio cholerae (Vibrio) and Plasmodium falciparum (Plasmodium) ( Table 1). From these, 55.3% of the cDNA sequences were contributed by Arabidopsis, mouse and fly (22.1%, 18%, and 15.2% respectively). Prokaryotic bacteria (Bacillus, Coxiella, Shewanella and Vibrio) contributed 20.6% and the remaining 24.1% of the sequences came from rat, fish, worm, Plasmodium, Leishmania and yeast. Yeast and fly are purely manually annotated datasets. Where as Bacillus, Coxiella, Vibrio, Shewanella, Leishmania and Plasmodium are mostly manually, and the rest mostly automatically annotated datasets. Manual annotation tends to be conservative and sparse, since the GO terms are assigned only if the annotator is highly confident. Therefore, a GO term may be missed due to a poor definition of a false negative. To reduce this critical problems, yeast and fly annotations are accompanied by an "unknown molecular function" term for sequences with questionable further functions. To reduce false negatives, we discarded all sequences with these tags for training and testing (yeast: 2999 discarded out of 6355, fly: 8495 out of 14335).
A schematic representation of possible GO term relationships Figure 1 A schematic representation of possible GO term relationships: A: GO 1 is a "parent" of GO 2 in a single path relationship. B: GO 1 is a "parent" of GO 2 in a multiple path relationship. C: GO 1 is a "child" of GO 2 in a single path relationship. D: GO 1 is a "child" of GO 2 in a multiple path relationship. E: GO 1 and GO 2 are "siblings" in a single path relationship. F: GO 1 and GO 2 are "siblings" in a multiple path relationship. MF denotes the molecular function node (root). Figure 2 General prediction scheme: The training sequences (S1) with known function (GO x , GO y , GO z ) were searched across the protein databases, yielding hits with molecular function GO terms (GO 1 , GO 2 , GO 3 , GO 4 , GO 5 , GO 6 ) and their features (see methods), sketched as dots in a two-dimensional feature space. If GO terms of the hits compared to GO terms of the query, they were classified as +1 (correct, green), and -1 otherwise (red). The classifier (SVM) separated the classes by an optimal separating hyperplane (OSH). Unknown sequences (S2) were searched in the same manner and the GO terms (GO n , GO m , GO o ) were extracted. Their features were calculated and mapped into the feature space. The corresponding labels were assigned (correct/false).

General prediction scheme
The cDNA sequences were searched across the protein databases covering a wide range of organisms from prokaryotes to eukaryotes and SWISSPROT. For 36,771 sequences we got hits with GO terms, contributing to 856,632 sample GO terms and yielding an average of 23.29 GO terms per query sequence (Table 1). These 856,632 samples were used to train our classifier. Generally, the number of GO terms per sequence was less for prokaryotes than for eukaryotes. Rat had the maximum number of GO terms per sequence (36.9), followed by fish (32.1) and worm (27.13). In contrast, Shewanella, Coxiella and Vibrio sequences had the lowest number of GO terms per sequence (10.78, 12.33 and 12.54, respectively).

SVM training and testing SVM training
We set up multiple classifiers by splitting the whole dataset (856,632 samples) into 99 equal subsets. Note that, amongst these 99 subsets, 96 contained data from a single organism and the remaining 3 from two organisms each. Subsequently, we built 99 classifiers with these subsets. Since the training sets were created organism-wise, the classifiers were trained from different ranges of data, based on purely manual annotation (yeast, fly), mostly automated annotation or a mixture of both. For training each of these classifiers, we performed a model selection (parameter optimisation by cross-validation; see Methods), which yielded varying accuracy values ranging from 78.81% to 96.03%, with an average accuracy of 85.11%.

SVM testing
To test the classifiers performance, we prepared 13 test sets (each set corresponding to a single organism) using the same 856,632 sample GO terms. The prediction quality of all 99 classifiers were assessed by an organism-wise crossvalidation approach, i.e. for each organism (test set), we used all the classifiers for prediction except those that corresponded to the same organism. With this approach, we were able to simulate the annotation of a new organism. The number of classifiers used for predictions varied highly across organisms (maximum: Plasmodium and Leishmania, 98 classifiers; minimum: Arabidopsis, 74 classifiers). The quality of the predictions was estimated by comparing the predicted terms with the original annotation and the results were expressed in terms of precision and accuracy values (see Methods). The average-accuracy refers to the average of the accuracy values attained by all classifiers used for the prediction. The maximum averageaccuracy was achieved for fly (81.51%), followed by yeast (80.50%), and the minimum for mouse (76.0%).
Additionally, we compared the classification efficiency of the classifier derived from automatic annotation (mouse, worm and Arabidopsis) with the manually annotated test sequences (yeast and fly). The prediction of the yeast and fly sequences with the 20 classifiers from the mouse sequences produced an average-accuracy of 79% and 80% respectively. Similar results were acquired with the 25 classifiers from Arabidopsis (79% and 80%). Likewise, the worm classifiers (11 classifiers) yielded the average-accuracy of 82% for yeast and 83% for fly. These values were comparable with the average-accuracy of 81% achieved by both, using yeast as test sequences against fly classifiers (16 classifier) and vice-versa (fly test sequences against yeast classifiers). Likewise, we classified the mouse test sequences against yeast classifiers (5 classifier) and fly classifiers yielding 69% and 71% average-accuracy respectively.

Combining multiple classification results by the committee approach
Though we already achieved a good accuracy with some of the classifiers, our intention was to improve the precision and, furthermore, to obtain confidence values for the predicted GO terms. To this end, we combined the predictions of multiple classifiers by the committee approach. If a classifier predicted a particular GO term as correct, it contributed a vote. Votes were collected from all classifiers and summed up to yield a final score value. If no vote supported a GO term as correct, it was assigned with the label "false". Otherwise, the number of votes provided a measure of the reliability. Figure 3 shows precision and accuracy versus the number of votes. If we made predictions with a minimum of one vote, we were able to achieve 43% precision and 59% accuracy. When the stringency was raised to 25 votes, a minimum of 25 votes was required to classify a GO term as correct, yielding an accuracy of 84% and precision of 75%. At a cut-off value of 74 votes, we attained 91% precision and 71% accuracy. A cut-off value of 94 votes gave 100% precision and 67% accuracy. Our accuracy reached a plateau at 20 votes. However, it decreased slightly for stringencies of more than 30 votes. Note, that this was due to the increasing number of false negatives. The relation between the precision and the number of votes ( Figure 3) was used as a means of calibrating to assign the confidence values for new predictions.
For each threshold value of the votes, we calculated the sensitivity and the false positive rate to obtain a Receiver Operating Characteristic plot (ROC; Figure 4). The graph shows that the classification performance was comparable for different classes of organisms like prokaryotes, single cell eukaryotes and multi-cellular eukaryotes, which reflect the organism-independent performance of our method. Note that for fish, worm, Plasmodium and Leishmania the classification performance was particularly good due to the low number but well characterised test sequences.
Accuracy and precision against the number of votes ROC plots for the classifiers performance We compared the prediction performance for GO terms annotated with the evidence code IEA (automated annotation) and non-IEA (manually verified annotation). All sequences from Bacillus, Coxiella, Vibrio, Shewanella, yeast, Leishmania, and Plasmodium were non-IEA annotated and 99.5% of the fly GO terms were non-IEA annotated. In contrast, all sequences from fish and worm were IEA annotated. The remaining test organisms were mostly IEA annotated (rat: 88%, Arabidopsis: 79.4%, and mouse: 69.5%). The classification performances revealed by the ROC plots were comparable between IEA and non-IEA annotated test organisms ( Figure 4). Therefore, the classifier could reproduce the annotation of other automated systems as well as high-quality manual annotation. We were interested in the coverage of sequences with respect to the average precision of the annotations (shown in Figure 5

Xenopus annotation
We extracted all Xenopus laevis contig sequences from the TIGR Xenopus laevis Gene Index (XGI) [28] and got a total of 35,251 contig sequences, excluding singletons. We applied our method to predict functional GO terms for these contig sequences. We predicted the function for 17,804 sequences with an average of 12.16 GO terms per sequence. In total, 23.4% of all the GO terms were predicted with less than 50% confidence value, 51.5% of them were between 50% to 80% confidence and the remaining 25% with a predicted confidence value of above 80%. At 80% stringency (predicted if the GO term possessed a confidence value of 80% or more), we made predictions for 9,510 contig sequences including 55,994 GO terms, yielding on average 5.88 GO terms per sequence.
To compare the functional abundance of the expressed genome across the organisms, we mapped the predicted GO terms (with at least one vote) to the high-level, i.e. more generalised or high-level terms of the molecular function ontology ("GO slim" for molecular function) [29]. These molecular function GO slim nodes were taken from the second level of the molecular function ontology. The distribution of higher-level GO terms were compared between Xenopus, fly, yeast and mouse ( Figure 6). Note that some of the deeper-level terms had multiple paths. They were mapped to two or more higher-level nodes, so that the total sum of the higher-level nodes exceeded 100%.
Comparison to the TIGR Xenopus annotation TIGR provides a GO mapping for Xenopus contigs (TIGR Xenopus laevis gene indices). We compared our annotation with the TIGR GO annotation for molecular function. From 35,251 contig sequences, TIGR annotated 5,444 contigs with a total of 16,432 molecular function GO terms. In contrast, our approach was able to predict function terms for 17,804 contigs, i.e. more than three times that of TIGR sequences. Our procedure did not annotate 295 contigs from the TIGR annotated contigs. For the remaining 5,149 contigs, 85% of all TIGR terms were found to be exact with those using our method; 3.2% of the TIGR terms were at a higher-level of the GO tree than our annotation, so in this case we provided annotation at a deeper level; in 0.9% of the cases our annotation was at a higher-level; 8.3% of the cases were completely different; and 0.6% of the TIGR terms were obsolete. We compared the quality of TIGR and that of our annotations by a raising stringency and found that when we applied a confidence threshold of 80% for our annotation, we lost 46.6% of the sequences. This included 1,492 sequences holding equivalent TIGR annotation or 27.4% of the total TIGR annotation. With this stringency, our system annotated 9,510 contig sequences, i.e. twice the TIGR annotation at this quality.
Precision against the sequence coverage Figure 5 Precision against the sequence coverage: Average precision against sequence coverage for all 13-test organisms (circles). The red line denotes a fitting curve.
We were interested in novel annotated sequences with the highest confidence values and found we could predict GO terms for 557 contigs with a confidence value of 100% (all votes matched). Interestingly, 192 of these lacked any GO annotation by TIGR. Out of these, 184 had got a descriptive TIGR annotation and the rest had not got any. Table  2 shows the novel annotation for these eight sequences. Our novel predictions are as follows: 1) TC212171 and TC196381 are predicted to display endopeptidase activity and more specifically serine-type peptidase activity (98% and 97% confidence respectively). 2) TC209487 and TC190605 are predicted to be aminopeptidases, however for the latter the more specific prediction of prolyl aminopeptidase activity is assigned with 86% confidence. 3) TC199713 is predicted as glutathione peroxidase at 100% confidence and TC194305 is annotated as protein kinase with the same confidence. 4) Both TC187949 and TC210151 are transmembrane receptors but the latter one is classified as frizzled receptor with 82% confidence. In most of these examples the functional assignment and associated confidence were recorded in multiple levels of granularity.

Discussion
In this paper, we presented an automatic annotation system that is able to cope with the expanding amount of biological sequence data. Our approach efficiently combines the ongoing efforts of Gene Ontology and the availability of GO-mapped sequences with a profound machine learning system. The GO-mapped databases provide annotation described in a controlled vocabulary and also a measure of reliability, as these GO entries are labelled with their type of origin. Furthermore, GO terms are structured hierarchically, which allow us a twofold use of the information: i) the level within the tree is taken as a classification criterion to distinguish low from high-Comparison of GO slims between Xenopus, fly, yeast and mouse Figure 6 Comparison of GO slims between Xenopus, fly, yeast and mouse: Distributions of higher-level GO terms (,,GO slim", see text) for Xenopus, fly, yeast and mouse. The sum of all high-level terms may exceed the total number of the annotated terms, since some terms may have more than one high-level "parent" terms due to multiple paths.
level annotations during the learning procedure, and, ii) the hierarchical structure allows us to extend hits by slightly moving up and down within a restricted local area of the tree. This may overcome fluctuations of the annotation levels coming from varying annotation experts.
Our annotation system exploits the different combinations of attributes and yields functional transitivity: SVM learning and prediction are organism-independent and comparable to manual annotation, which may be supported by the nature of the attributes we utilise. Subsets and overlaps are counted in a balanced fashion to avoid biases due to the complexity of an organism and a potentially correlated complexity of its sequences. The committee approach allows us to improve the prediction quality as well as to assign confidence values for the new predictions in a straightforward manner. Our classifiers performance is hardly limited by the varying quality of the training data, whether manual or automatic annotated. The prediction results of manually annotated test sets with the classifiers based on automated annotation as well as classifiers based on manual annotation were comparable. Regarding the outcome of the overall classifiers, we achieve consistency with existing annotation from automatic annotations. This is the less complex part of our work and shows a comparable efficiency of our system. Additionally, our system reproduces annotation of purely manually annotated datasets (fly, yeast, etc). However, the performance results for these datasets are low in terms of recall, i.e. 47.4% recall with 80% precision compared to 60.6% recall with the same precision of the complete test set. Note that manual annotation tends to be conservative and sparse, yielding stringent true positive definitions, whereas automatically annotated sequences may accumulate information to a greater extent.
We were interested in annotating Xenopus since it is a familiar model organism. However, the sequences were not very well annotated. Our system was applied to annotate the Xenopus contig sequences from TIGR. Through our approach, we annotated 50.5% of all contig sequences available at present, and associated a confidence value for each prediction, yielding roughly three times more sequences as compared to the currently available GO annotation. However, the coverage of annotation to new organism like Xenopus is crucial. We were able to attain . This compares to the applied databases that contained 53% satisfactory annotation for their sequences (not regarding sequences with unknown function terms), and better than the organism specific databases (36%). Obviously, improving the quality and quantity of annotation within the available databases goes along with the coverage exploit of machine learning algorithms for new organisms. In future we want to extend our method with the information from other sources such as domain databases and protein family databases.

Conclusions
We developed an automated annotation system to assign functional GO terms to an unknown sequence. We used the well-established technique of Support Vector Machines (SVM) for the classification of correct and incorrect GO terms. Our approach benefited from the broad variety of potential attributes used for the functional transitivity and a vast amount of data used for training and validating. The committee scheme exploited in our system provided a means to assign confidence values in a straightforward manner. Our system performance was robust, organism-independent and reproduced the highquality manual annotation. When applying it to Xenopus laevis contig sequences, we obtained a remarkably enhanced annotation coverage compared to the existing annotation.

Quality criteria for assessing the performance of the classifier
We used the following statistical terms [30,31]. Also used were sensitivity := TP / (TP + FN), specificity := TN / (FP + TN), and false positive rate := 1 -specificity. We defined the term "coverage-of-sequences" as the portion of query sequences for which the classifier delivers a prediction; "Precision-per-sequence" the (average) portion of correct GO terms for a single query sequence, with respect to all GO terms assigned to it. Note that these terms were defined within our model, i.e. a good "accuracy" meant good consistency with respect to our training and test sets.

Defining the GO term relationships
We focused on the molecular function terms from GO, because the information extracted from the gene products is usually more predictive for determining molecular functions than for biological processes or cellular components. The functional terms and their hierarchy were obtained from the web pages of the Gene Ontology Consortium [29] (version of June 2003). In our study, relationships "is-a" and "part-of" were not distinguished. Note, that the "part-of" relationship is rare in the molecular function ontology (26 out of 6521 child-parent relationships). The annotation level varies across databases depending on the curator's individual knowledge about the gene product. To consider varying levels of annotation in the databases for similar gene products, we traced the relationships to match GO terms of different granularity for the same function. To find a relationship between two terms, the whole path of a GO term was traced back to the root (the root is the "molecular function" node, GO:0003674). We defined the distance between two GO terms as the distance of the shortest path. GO terms are organised in directed acyclic graphs, i.e. a child (more specialised term) may have multiple parents (less specialised terms). Therefore, we defined single path and multiple path relationships. In the case of single path relationships, GO terms had only one possible path to the root. The relationship of the term GO 1 with respect to GO 2 was classified as "parent", "child", "sibling" or "different" ( Figure  1) according to the following rules: GO 2 is a "parent" of GO 1 if their respective paths P 2 and P 1 intersect in such a manner that P i denotes the set of nodes from GO i to the root GO 2 is a "child" of GO 1 if their paths P 2 and P 1 intersect such that GO 2 is a "sibling" of GO 1 if a common parent exists with a distance of one to GO 1 and GO 2 ( Figure 1E). To avoid ambiguities for less differentiated terms, the sibling relationship was set only, if GO 1 and GO 2 were at least 5 nodes away from the root.
The relationship "different" was set if none of the previously stated criteria was fulfilled.
We could apply the single path relationship for most of the GO terms (3665 out of 5391). However, for the remaining 1726 terms more than one path to the root were found. For these cases we defined multiple path rela-tionships and each path was considered individually. The single path relationship was applied to each possible pair of these paths (path for GO 1 and GO 2 , respectively) and is henceforth referred to as "path-pairs". This method could yield a list of several relations. To select the appropriate relation from this list, we considered the parent relationship to be most relevant, followed by the child relationship, and the sibling was considered least relevant. We implemented the following order: 1. The parent relationship was set if at least one of the path-pairs gave a (single path) parent relationship; 2. The child relationship was set if at least one of the pathpairs gave a child relationship. To avoid a bias due to an overwhelming number of path-pairs that did not match, we set a threshold: we considered this relationship only, if the number of path-pairs with no child relationship was equal or less than four times the number of path-pairs with child relationship; 3. The sibling relationship was set if at least one of the path-pairs gave a sibling relationship. We again set a threshold: we considered this relationship only, if the number of path-pairs with no sibling relationship was equal or less than twice the number of pairs with sibling relationship; 4. If none of these criteria could be applied, the relationship "different" was set.
Note that we also implemented the hierarchy of these relations by tuning the stringencies for the fractions of path-pairs that must match (parent: no threshold, child: 1/4, sibling: 1/2).

Data basis used for this study
Since the function transitivity at the protein level is more reliable, we used GO-mapped protein databases for searching homologues. Gene association files were obtained via the Gene Ontology Consortium. By combining the gene association files with corresponding sequence databases we created the unified protein databases. The following organisms were used: yeast, fly, mouse, Arabidopsis, worm, rat, fish, Leishmania, Bacillus, Coxiella, Shewanella, Vibrio, Plasmodium, Oryza sativa, Trypanosoma brucei, and Homo sapiens. Apart from this, the SWISS-PROT database was also included [32]. For SVM training and testing we selected 39,740 cDNA sequences from 13 organisms. These cDNA sequences were collected from the following organisms: yeast, fly, mouse, Arabidopsis, worm, rat, fish, Leishmania, Bacillus, Coxiella, Shewanella, Vibrio and Plasmodium (see Table 1). Out of all the known cDNA sequences we extracted 39,740 with GO molecular function terms, discarding incompletely anno-tated ones, i.e. sequences assigned with the GO term "molecular function unknown" (GO:0005554).

Computing the attributes
Each cDNA sequence was searched across the protein databases, using BLASTX within the HUSAR system [33]. A query sequence was not searched within the database of their own organism. In case of SWISSPROT, hits corresponding to the query (cDNA) organism were filtered out. The BLAST files were parsed using the BLAST parser modules from W3H [34] and a low-stringent e-value cut-off of 0.01 was applied to yield a high number of possible hits. Multiple high scoring segment pairs were combined as described elsewhere [35] and used for computing the alignment features. GO terms for each database hit were extracted by considering only GO terms corresponding to the molecular function and by discarding GO terms that were prefixed with NOT (annotators state that a particular gene product is NOT associated with a particular GO term), or corresponding to "molecular function unknown" (GO:0005554). These steps reduced our dataset to 36,771 sequences, contributing to 856,632 samples. Each GO term that occurred in the hits represented a sample entry in the feature table. Below it will be referred to as "sample GO term". If a GO term occurred several times in the hits, it was considered only once. We defined 31 attributes for each GO term, representing 5 major classes of information (A)-E)): A) GO level and path: The GO structure was exploited to derive the first two attributes, We wanted to award alignment length and quality by combining features. This was done with respect to the length of the query and the hits to offset biases due to different complexities of the query and subject organisms.  6) where S L denotes the length of the subject sequence; Additionally, we decomposed these attributes into the following further six attributes (B.5 -B.10). For these attributes, we considered the hit with the best coverage score if a GO term occurred in more than one hit (query coverage score for attributes B.  , which supports a weighted SVM for unbalanced data. We used a higher penalty (5 instead of 1) for false positives (FP) for the model selection and also the training process to support a high specificity of the resulting classifiers. Also note, that our training set contained a high portion of negative samples (see Table 1) due to our relaxed E-value threshold. We utilised the radial basis function kernel and set the parameter epsilon (tolerance of termination criterion) to 0.01. The parameter C (regularisation term, cost for false classification) and gamma (kernel width) of the SVM were optimised using a grid search. The grid search determined the combination of C (log2-range: 13 to 15, step 1) and gamma (log2-range: 10 to 15, step 1) with the lowest classification error according to a five-fold cross validation such that each of the 99 data subsets was subdivided into a training set (90%) and a validation set (10%). The validation sets were used to estimate the parameters C and gamma for each of the 99 classifiers individually. Finally, the parameters from the classifier selection were applied to train each of the classifiers with 90% of each data set, respectively. The testing was based on the same 13 organisms and 856,632 GO terms corresponding to 36,771 sequences as described above. We performed the testing by an organism-wise cross-validation so that one organ-ism was used as test set and the remaining ones as the training set.

Data availability
The annotation for Xenopus laevis contig sequences is downloadable at ftp://genome.dkfz-heidelberg.de/pub/ agd/gene_association.agd_Xenopus. We followed the standard GO annotation style (using Gene ontology guideline). The evidence code is always IEA. The confidence value is included for each GO term.