GO for gene documents

Background Annotating genes and their products with Gene Ontology codes is an important area of research. One approach is to use the information available about these genes in the biomedical literature. The goal in this paper, based on this approach, is to develop automatic annotation methods that can supplement the expensive manual annotation processes currently in place. Results Using a set of Support Vector Machines (SVM) classifiers we were able to achieve Fscores of 0.49, 0.41 and 0.33 for codes of the molecular function, cellular component and biological process GO hierarchies respectively. We find that alternative term weighting strategies are not different from each other in performance and feature selection strategies reduce performance. The best thresholding strategy is one where a single threshold is picked for each hierarchy. Hierarchy level is important especially for molecular function and biological process. The cellular component hierarchy stands apart from the other two in many respects. This may be due to fundamental differences in link semantics. This research shows that it is possible to beneficially exploit the hierarchical structures by defining and testing a relaxed criteria for classification correctness. Finally it is possible to build classifiers for codes with very few associated documents but as expected a huge penalty is paid in performance. Conclusion The GO annotation problem is complex. Several key observations have been made as for example about topic drift that may be important to consider in annotation strategies.


Background
Annotating genes and their products with Gene Ontology codes is an important area of research. One approach for doing this is to use the information available about these genes in the biomedical literature. This is in contrast to other annotation methods such as ones involving sequence homology and protein domain analysis (e.g. [1]). Our goal is to contribute to research on literature based automatic annotation strategies.
The importance of this GO annotation problem and the value of computational methods to solve for it are well recognized. In the 2004 BioCreAtIve challenge a set of tasks were designed to assess the performance of current systems in supporting GO annotations for specific proteins. In particular, the second task to identify text passages that provide the evidence for annotation resembles most the manual process of GO annotation [2]. The participating systems showed a variety of approaches (from heuristics to Support Vector Machine based classification) exploring different levels in text analysis (such as sentences or paragraphs) [3]. In Rice et al. [4], Support Vector Machines (SVM) classification was applied to the relevant documents for each GO code. Features from the documents were selected and conflated as sets of synonymous terms. Their methods worked better when a substantial set of relevant documents were available. In Ray et al. [5], statistical methods were first applied to identify n-gram informative terms from the relevant documents of each GO term. These term models provided hypothesized annotation models which could be applied to the test documents. In Chiang et al. [6], a hybrid method that combined sentence level classification and pattern matching seemed to achieve higher precision with fewer true positive documents. In some of these previous studies, the GO hierarchical structure was explored but to a limited extent. This was done primarily to add information to the classification models.
Genes (or more strictly their products) are annotated with GO codes. Our interest is in predicting annotations from the literature, specifically from MEDLINE records. We approach the annotation problem in three phases. In the first phase we find documents that are relevant to the gene. In the second phase we determine which codes should be assigned to each document. In the third phase we decide which codes should be assigned to a gene/gene product based on its classified documents. In recently completed work we studied phase 1, the problem of retrieving MEDLINE records for genes [7]. In it we consider the special challenges of dealing with gene name and symbol ambiguity. In this research we focus mainly on phase 2. That is, given a document we ask: what GO codes should be assigned to it? We also close this paper with preliminary results for phase 3 using a very simple strategy. Specifically a gene is assigned a code if it is assigned to any of its relevant documents. More sophisticated strategies for phase 3 are left to future research.
The document annotation or classification problem of phase two is interesting in that the codes themselves are structured hierarchically. Similar hierarchical classification problems have been addressed [8][9][10] including by our own group [11,12]. When working on GO annotation one may certainly draw from these related papers. However, the three hierarchies of Gene Ontology, molecular function (MF), biological process (BP) and cellular component (CC), may have special characteristics that could be exploited beneficially. Or there may be special properties that must be considered by automatic annotation systems in order to be effective. In fact these hierarchies differ significantly in link semantics. Molecular function is built out of "is_a" links, biological process links are one-fifth "part_of" and four-fifth "is_a" while cellular component is about evenly split between the two link types. Although both link types are asymmetric and transitive, their semantics are very different. A final distinguishing aspect is that with GO, document classification is not the end point but a step toward the goal which is gene/gene product annotation (i.e., phase 3).
Our research goal is to gain a better understanding of the GO annotation problem using Support Vector Machines classification algorithms. Continuing from earlier work [13], we will study several open issues in the GO context. One is the effect of the hierarchical level on performance. Another is the effect of skewed distributions where the negative examples tend to overwhelm the positives in the training data. Yet another is to explore a more relaxed definition of classification correctness. We will also study the effectiveness of classifiers built for codes associated with very few (less than five) documents. We will pay close attention to differences between the three GO hierarchies. Looking beyond achieving good performance, our aim in this research is to contribute to our understanding of the problem itself. The annotation of genes and their products is an important contribution to developments in bioinformatics. As new genes are discovered and as new functions of genes are identified, these annotations serve as key mechanisms for organizing and providing access to the accumulated knowledge.

Code specific SVM classifiers
We adopt a classifier-based machine learning approach using the open source software SVM Light [14]. In all experiments parameters are set at their default values. The positive instances for a GO code are those records associated with it in our dataset (extracted from Entrez Gene/ LocusLink). The negative instances are records assigned to all the other GO codes of the same hierarchy. Document term feature vectors were generated using the "atc" weighting scheme described in the Methods section.
We built a distinct binary SVM classifier for each code (class) where the classifier decides whether a document belongs to the code's class or not. The hierarchy within each GO dimension is not used at this point. The only connection among the codes of a hierarchy is that they share a common dataset of documents, albeit with different positive and negative instances. Each hierarchy's dataset is split into five parts such that the number of positive documents for each code is about evenly distributed. This allows us to follow a 5 split cross validation design. Specifically 4 parts are combined to get the training data and the remaining fifth part is used as test data. This is repeated five times with performance reported as averages of scores across the iterations. Details of the data are given in the Methods section. Results are shown in table 1. The performance measures used are recall, precision and FScore described in the Methods section.
Unfortunately, this approach yields extremely poor results. We noticed that most of the scores calculated by the SVM classifiers are negative, mainly due to the highly skewed nature of the training data for most codes. As observed by several others this problem may be fixed with judicious thresholding [15]. So in the next experiment we calculate optimal SVM score thresholds using training data.

Hierarchy specific SVM score thresholds
Here we explore a single threshold score for each hierarchy, such that documents with scores assigned by the SVM classifier above this threshold are declared positive. We select the best threshold from the training data identified for each split. In particular, we take the training data of a split and divide it into 4 parts. (We call these 'folds' in order to maintain a distinction from the higher level 'splits'). Cross validation over these four folds is done to generate a single best threshold which is then applied to the test side of the split. The single best threshold was the average of the best thresholds in the four folds [15].
Results are presented in table 2. The table shows for each hierarchy, the threshold score selected for each split as well as the recall, precision and Fscore values achieved on both the training and test sets. Averages across the splits are also provided. First we observe that the thresholds selected fall within a small range from -0.87 to -0.82 across all hierarchies. Molecular function has the smallest spread of threshold values (-0.85 to -0.84). We also observe that molecular function offers a relatively easier problem compared to cellular component with biological process being the hardest to solve. Finally, the test set scores are actually better than the training set scores indicating that we have successfully avoided over training our models in each case as these are able to generalize to the unseen test cases. Thus we see that setting the thresholds appropriately for these SVM classifiers offers enormous benefits in performance (when compared to the results in table 1).

Document representations with LTC term weights
We also evaluated the use of the "ltc" weighting scheme (described in the Methods section) for weighting features (terms) in the document vectors. Results are shown in table 3. Comparing these results with the results for atc weights shown in table 2 shows that there is no significant difference between the two strategies. For example the ltc strategy was less than 3% better than the atc strategy for molecular function. Differences were also negligible for the other two hierarchies. Thus most of the remaining results in this paper, excepted where noted, are presented with atc as the weighting scheme.

Feature selection
It is widely acknowledged that it is important to explore feature selection when building text classifiers. Thus we studied three feature selection strategies for our annotation problem. The first is based on document frequency which is the number of unique documents in which a term occurs. We computed each term's document frequency in the training data set, and applied a heuristic threshold to eliminate terms that rarely appear in the corpus. The assumption is that terms with low document frequency carry little class-specific information. Essentially we set the threshold as 0.1% of the training document set size. This was decided based on preliminary tests that assessed alternative thresholds.   The second feature selection strategy uses the χ 2 statistic.
This tests the null hypothesis that the observed term frequency in the documents of a certain class is not different from its statistically expected frequency. If the null hypothesis is rejected it implies this term is important in defining the class of the document.
The third strategy, Z(t, c), denotes the degree of independence of the distribution of term t in the documents of class c with respect to its distribution in the documents not belonging to class c. The formal definition of Z(t, c) is:  Table 4 shows the results of feature selection combined with the ltc feature weighting scheme. These results are obtained from a 10% sample of the code set for each hierarchy -with a minimum of 10 codes. We find that the best strategy is χ 2 which is significantly better than no feature selection. (The ranking of feature selection strategies for atc are similar). Unfortunately, when the χ 2 feature selection method is combined with the hierarchy specific thresholding strategy described earlier, the results are not as good as with no feature selection. For example, χ 2 com-bined with the ltc strategy drops performance by 23% from 0.4939 (see table 3) to 0.3816. Hence we do not utilize feature selection in the remainder of this paper.

Code specific SVM score thresholds
In the previous experiment a single threshold score was set for each hierarchy. In this experiment thresholds are set specific to individual GO codes. This strategy is reasonable to explore as it may indeed be that although the average thresholds fall within a small range (see tables 2 and 3), the optimal threshold varies considerably across the codes. The overall structure of the experiment is the same as in the previous experiment. Code specific thresholds are set using a 4-fold cross validation experiment on each training set. The selected threshold is the average of the best threshold for the code across the 4 folds.
Results are presented in table 5. Interestingly, this time the Fscores achieved on the training runs are considerably higher than the Fscores achieved in the test runs of the single threshold experiment (compare with table 2). However, the penalty is clearly paid on the test side, indicating that this code specific strategy over-trains and fails to generalize effectively on new data. The one exception is in the case of CC where the Fscores are about the same in both cases. However, performance for MF and BP drop significantly by 10.4% and 17.5% respectively. Thus a single threshold over all codes of a hierarchy is superior to code specific thresholding. We also find a similar pattern with the previous experiment in that molecular function is easier to work with than cellular component which in turn is less challenging than biological process.

Analysis of results
We now analyze the results obtained thus far. The results selected for analysis are those obtained using hierarchy specific SVM score thresholding with atc as the feature weighting scheme and with no feature selection. Our goal is to obtain further insights into factors influencing the results.

Recall versus precision
It is well understood that the same Fscore may be obtained from different combinations of recall and precision. In this regard a key point to note from   table 3) is that recall is always considerably higher than precision. Although recall could also be improved, our results indicate that the more serious problem for us lies in the context of precision. That is in general we are making the correct decisions. The problem is we are making too many false positive declarations. In other words we need to tighten the constraints and apply some filtering criteria on the positive decisions declared. This angle will be pursued in future research. It seems that with MF and BP hierarchies the difficult decisions are closer to the upper levels. This is contrary to common intuition which suggests that classifying into more general categories (such as animal or plant) should be easier than classifying into more specific categories (such as hawk or eagle). CC is different in that the decisions become more challenging as we descend the hierarchy. The difference between MF and BP on the one hand and CC on the other could be because of differences in the underlying semantics of the links. As mentioned before CC links are about evenly split between is_a and part_of whereas BP links are about 75% made of is_a links while MF is almost exclusively is_a. These performance differences observed across the levels of the hierarchies have important implications in the design of automated annotation systems for GO.  results. With the BP hierarchy we again see a similar tendency for performance to drop with increasing numbers of positive examples. The exception is the first row which has significantly lower Fscore than the next few ranges.

Number of positives for training & performance
These observations are interesting especially because they are counter to the generally accepted notion that with a supervised approach we may expect better results with more positive data.

Correlations between level and number of positives for training
Taking this analysis the next logical step forward we explore the relationship between level, positive set size and performance for each code. Table 8 presents the computed correlations.
We find a moderate and significant negative correlation between level and size in the case of MF and BP but interestingly not in the case of CC. So with MF and BP more specific codes tend to have fewer positives in the training data but this is not the case with CC. There is also a moderate and significant positive correlation between level and FScore in the case of MF and BP but again not for CC. That is we tend to get better Fscores with more specific codes in MF and BP hierarchies but not so with CC. Thus with MF and BP we need to pay closer attention to the higher level codes. Once again our efforts indicate that CC is a hierarchy that might require classification methods that are different from those that are appropriate for MF and BP. Again this may be due to the underlying differences in link semantics.
A second observation may be made from the correlations between performance and the other two variables. Specifically, level is far more important than the number of positives available for training, at least in the case of MF and BP. Thus in order to seek improvements in performance it would be prudent to develop methods capable of exploiting the level information for the GO codes. Size of training set on the other hand does not correlate with performance. As mentioned before this is a surprising observation given the commonly accepted notion that larger amounts of (positive) training data tend to yield better performance scores.

Level specific thresholds
To explore the effect of level further we adopt a simple strategy of setting the threshold by level. Table 9 shows the effect of this strategy for the MF and BP hierarchies, focussing only on levels 2 and 3. We do not apply this strategy to CC as there was no correlation between level and performance for this hierarchy. Also we consider only levels 2 and 3 as level 1 has too few codes and these are the levels where we seek improvements.
Interestingly, we find improvements at level 2 for both MF and BP (+7.4% and +4.6% improvements in Fscore respectively). However, the strategy does not work for level 3 in both cases. We will consider a different approach in future research, one that involves including examples from the neighborhood of the code. This could optionally include weighting by distance to the code.

Relaxing the correctness criteria
Thus far we have not utilized the hierarchical structure in any way. There are at least two major directions in which the hierarchy may be utilized. One is where the hierarchy is used somehow during model building. For example, a node's training data may be augmented with training data from its neighbors [5]. Alternatively, a top down approach for model building may be employed, with examples that filter through higher level nodes participating in lower level decisions [8]. Many variations on these themes have been explored in the general machine learning literature. In this research we explore a second direction that has recently attracted the attention of researchers, especially in the context of bioinformatics problems (e.g. [16]). Specifically, we use the hierarchy to relax the criteria for correctness of a classification decision during evaluation. Essentially we assume that when a document is assigned a GO code it is implicitly assigned the ancestor GO codes as well. This is reasonable since the GO hierarchies encode is_a and part_of semantics along the parent-child links and these are transitive relationships. With this assumption we relax the calculation of recall and precision and therefore also of FScore as follows.
Recall = A/B where B is as usual the number of known correct code -pmid pairs in the dataset. The relaxation is applied to the calculation of A.
Consider a code -pmid pair (C -P) which is known to be correct. If our classifiers assign code C to P then A is increased by 1. Otherwise if our classifiers assign a code C' to P where C' is an ancestor of C then again A is increased by 1.
Precision = E/F where F is as usual the number of positive decisions declared by the classifiers. The relaxation is applied to the calculation of E. Consider a code -pmid pair (C -P) which is declared a positive by our classifiers. If code C is correctly assigned to P then E is increased by 1. Otherwise if there exists a code C" which is known to be assigned to P where C is an ancestor of C" then E is increased by 1.
Note that our relaxed evaluation accepts as correct those decisions that are more general than the correct code and not those decisions that are more specific than the correct code. Thus if the target code is glucoside transport, we will accept as correct classification with the higher level (general) carbohydrate transport code but not classification with the lower level (specific) alpha-glucoside transport or betaglucoside transport codes.
The definition of 'ancestor' can of course be varied depending upon how far up the tree one considers. This is formalized by ANCESTOR_LEVEL, a parameter that can be varied systematically. For example, when set to 1 ancestors are limited to parents. Table 10 presents our results using this relaxed evaluation scheme with ANCESTOR_LEVEL varying from 1 to 5. Unfortunately the results indicate that we do not achieve improvements in FScore even when we consider ancestors 5 levels up the hierarchies. But all is not lost as we see next! Table 11 takes a different perspective on assessing performance within the context of this experiment. Note first that thus far results have been obtained from averages of scores for each GO code. To explain we have 5 splits in our experiment design (see section 2), and each GO code appears in each split with roughly equal number of positive examples. Within a split we first calculate FScore for each code and then average these FScores. Tables 4 and 5 show such averages for each split as also the global average. This approach for evaluation reflects a 'code' perspective with all codes being considered equally important. A different way to summarize performance is to consider each code -pmid combination as an independent decision that has to be made. Each combination needs to be declared as positive or negative by our classifiers. Thus given N codes and M pmids, N × M decisions are to be made. Averages may then be computed across the set of decisions in a split. In table 9 results are presented from this perspective of individual decisions.
Observe first that we have new baselines identified for each hierarchy. Note also that from the decision perspective, CC is the easier hierarchy followed by MF and then BP. When compared to these baselines we find steady improvements as the definition of ancestor changes. Is the decision perspective useful? The answer is yes. Averaging by the code (as done in the previous experiments) tells us which codes are more challenging than others. While designing annotation systems, we need to know code level differences that may lead to tailored strategies. For example the classifier system may differ by code level  in the hierarchies. So the "code perspective" is certainly important. However, the decision perspective is more indicative of performance in terms of our end goalannotation at the gene product level. The decision perspective implies that each annotation decision, irrespective of code, is equally important.
Finally, we consider the annotation of the gene/gene product (i.e., the locus id) itself. We test a simple strategy of annotating a gene with a code if the code is assigned by our system of classifiers to a document that is relevant to the gene. Using this strategy we obtain for MF an Fscore of 0.31 (recall = 0.35 and precision = 0.28), for CC an Fscore of 0.36 (recall = 0.47 and precision = 0.29) and an Fscore of 0.22 for BP (recall = 0.26 and precision = 0.191). These scores are on the low side indicating that on the whole the problem of annotation is hard and one that offers many challenges.
We observe that the order of difficulty for the hierarchies at the gene product annotation level has CC being easier than MF and then BP. This parallels the order observed with the decision perspective (see table 9). We view these phase 3 (of the gene annotation problem, see section 2.3) results as preliminary. Our focus in this paper is on gaining a better understanding of phase 2 which is document classification with GO codes.

Codes with less than five positive documents
We observe that our dataset contains 1,125, 960 and 239 codes that have less than five associated documents from the molecular function, biological process and cellular component hierarchies respectively. The question we ask is what would be the level of performance if we built classifiers with very few positive examples (1, 2, 3 or 4)?
One challenge in addressing this question is that even if we build a classifier with a training set that has say 3 positive examples, we will then have only 1 positive example at best in our test set (for a code that has only 4 positives). This is insufficient to give a true reading about the quality of the model. Hence we address this question with an experiment that simulates the situation of codes with few positive examples.
We first identify codes that have at least 10 positive documents and then divide the data temporally into two parts such that each part has five positive documents. We use the earlier part to build the classifier model and the later part for testing the model. Figure 1 illustrates the division process. The documents in the dataset are first organized in publication date sequence. Ties are broken randomly. Then the temporal stream from the beginning to the position of the fifth positive document is taken to be the training data while the stream from the next document to the tenth positive document is taken to be the testing data.
Since the temporal partition is code-specific different codes are likely to have different numbers of positives and negatives in their datasets. This approach for splitting the data is realistic in that it reflects the manner in which information (i.e., documents) about a code collects over time which in turn depends upon the timestamp of publications.
Assuming that we are now testing classifier models built with only one positive document, we generate five datasets from the training set (as labeled in Figure 1). These differ only in the positive document that is included. Each dataset is used to build a classifier which is then tested on the newer testing dataset. The average Fscore for the five classifiers is computed. The same 5-fold strategy is used to simulate codes with only 2 or 3 or 4 positive documents.
Since the initial training portion has 5 positive documents we can generate, at least, 5 different combinations of up to 4 positives which enables five-fold cross validation.
We observe again, as in earlier sections that we need to set thresholds appropriately. This is because once again the SVM classifiers produce mostly negative scores. Thus we calculate optimal thresholds using a tuning sets of codes with a single threshold set for each hierarchy.
We run the experiment in two modes. In the first mode, labeled FirstFiveTest, we create the test set as described above, i.e., consisting of the temporal sequence that runs up to the fifth new positive document. In the second mode, labeled FullTest, the test set includes all remaining documents in the data stream. Our goal is to see if test sets that are temporally closer to the training data have an advantage in terms of classifier performance. Table 12 shows the results. As expected performance improves for all three hierarchies as the number of positives increases. Going from 1 to 4 positives we see performance at least doubling from MF and BP and about a Temporal Document Stream Figure 1 Temporal Document Stream. The figure illustrates the process used to divide the dataset by time into two portions, one for training and one for testing. @ (%) represents a document that is (is not) associated with the code.
50% increase for CC. The table also provides for comparison (in row labeled GT4) the performance for these same codes when code classifiers are built in the standard way (see section titled "Hierarchy Specific SVM Score Thresholds"). Since those runs were made against the full set of data, we may only make comparisons with the FullTest runs. We see that for each hierarchy we achieved far stronger results in the earlier experiment. Performance is at least halved when we compare GT4 with performance using 4 positives. These observations are limited by the fact that the designs of the two experiments are different. The earlier experiment used randomized 5-fold cross validation while this one is designed along a temporal dimension. And yet a key factor is also that the number of positives used for training is far larger in the earlier experiment.
Interestingly, significantly higher performance is achieved when the test set is limited to documents that are temporally close to the training data. Focussing only on the rows with 4 positive documents we see improvements in the range of 19% for molecular function, 17% for biological process and 16% for cellular component hierachies. These results suggest that there may be a topic drift in the way in which these codes are assigned to documents over time.
However this suggestion is limited by the fact that there were different numbers of documents in the two test sets. We will study this angle further in future research.

Conclusion
We presented a series of experiments designed to explore the value of Support Vector Machine based classifiers for assigning Gene Ontology codes to MEDLINE documents. We find that by using thresholds selected for each hierarchy Fscores of 0.49, 0.41 and 0.33 are obtained for the MF, CC and BP hierarchies respectively (Table 3). This is with a system of SVM classifiers that does not yet capitalize on the hierarchical organization of the codes and does not rely on a relaxed definition of accuracy. We compared the atc and ltc weighting schemes for feature weighting in document vectors. Differences in performance were negligible. Unfortunately our evaluation of feature selection methods did not yield further improvements.
We experimented further with threshold selection strategies. Interestingly, thresholding at the individual code level (as opposed to the full hierarchy) decreases performance due to over training. We explored performance by level and by the number of positives in the training set. The former appears more important especially for MF and BP. CC in general differs from the other two hierarchies. This may be due to differences in link semantics as almost 50% of links are part_of in CC. In contrast, only a fifth of the links in BP are part_of and there is only 1 such link in MF. Setting level specific thresholds for the second highest level of MF and BP lead to appreciable improvements in Fscore. But this was not the case for level 3. We explored a more relaxed evaluation criteria where classification with a more general code compared to the target code is considered correct. This yielded appreciable improvements when a decision perspective was taken during evaluation. Finally we presented an experiment studying the effectiveness of classifiers built for GO codes with less than five positive example documents. The loss in performance is severe, at least 50%. We also make an observation that is interesting though tentative -that there may be a topic drift in the way in which a code is assigned to a document over time. By implication annotation methods may need to consider this drift to succeed.
From this study we conclude that the hierarchies are different. Also hierarchical level is important. Counter to common intuition more general codes in MF and BP are actually more challenging for classifi-cation. Also counter to common intuition it is not necessarily the case that having more positives in our training data yields better performance. However this intuition is strikingly supported when using less than five positive examples for classification.
There are several other ways in which we will exploit the hierarchical structure in future work. For example, we plan to try an ensemble of classifiers where ensembles are defined through the hierarchy. Finally, we plan on exploring other strategies for phase 3 of the annotation problem which is to determine the codes for a gene/gene product after these codes have been assigned to their relevant documents. The current study has given us a better understanding of the problem of classifying documents with GO codes and prepares us for future work in this direction.

Gene Ontology
Gene Ontology (GO) [17] provides a structured vocabulary that is used to annotate gene products in order to succinctly indicate their molecular functions, biological processes, and cellular components [18]. Although different subsets of GO may be used to annotate different species, the intent is to provide a common annotation infrastructure. We looked at the distribution of the GO codes in our dataset in terms of the number of documents associated with each. The range is 1 to 333 for MF, 1 to 789 for CC and 1 to 579 for BP.
In all experiments (except for the section exploring classifiers for codes with less than 5 positive documents) we limited ourselves to only those codes that had at least 5 (unique) documents associated. Thus we get 283 unique codes for BP, 93 for CC and 214 for MF. We used 5 as the threshold given the 5 times cross validation design for our experiments. Thus we wish to ensure that each code had at least 1 evidence document in each split. Interestingly some code -pmid combinations occur more than once. This happens when the same document offers two different kinds of evidence, say TAS as also IDA, for annotation. Limiting these combinations to the unique occurrences gives us 7, 200 annotations for BP, 4, 391 for CC and 3, 877 for MF.
The data for each hierarchy was randomly split into 5 splits such that each code appears in each split with near equal numbers of evidence documents. The overall cross validation strategy is to iteratively take 4 splits as training data and test the trained model on the remaining fifth split. As an example for split1, we take splits 2 -5 as training data and 1 as testing. This ensures that there are at least 4 relevant documents for a code in the training side and at least 1 in the test side.
For the experiment with codes having less than 5 positive documents we simulated the situation using codes with at least 10 positive documents. Our dataset contained 89, 152 and 50 codes with at least 10 positive documents for the molecular function, biological process and cellular component hierarchies respectively. From this collection we removed approximately 10% of the codes for each hierarchy with a minimum of 10 codes for tuning data. Specifically, we tuned for the thresholds with 10, 15 and 10 codes for the three hierarchies respectively.

Document representation
In information retrieval research, the most widely used document representation method is the "bag of words" approach where all the terms are used to form a vector representation. Functional or connective words are considered as stop words and are generally removed since they are assumed to have no information content. The term features could be weighted for example, with TF × IDF weights or boolean weights. Alternative methods of defining terms have been explored, but with little significant improvement for text classification performance. Recent research by Moschitti and Basili [20] suggests that the elementary textual representation based on words applied to SVMs models is very effective in text classification. More complex linguistic features such as part-ofspeech information and word senses did not contribute to the predictive accuracy of SVMs.
We used the title, abstract, RN and MeSH fields of the MEDLINE records. Stemmed words from these fields (after removing stop words) were used to generate vector representations for documents. These were produced using the SMART system [21]. The "atc" [22] construction of TF × IDF weighting scheme was applied to the terms for most of the experiments. This representation has worked well in our previous research [23]. We also test the "ltc" weighting scheme. These schemes are described below.
Here tf is the number of times a term occurs in a document. maxtf is the highest tf observed. N is the number of documents in the dataset and n is the number in which the term i occurs.