In computational biology, critical assessment of algorithms plays an important role in keeping the field honest about utility by ensuring progress is measurable and in a direction that is helpful in solving biological problems. The recognition of the need for assessment dates back to the first Critical Assessment of Structure Prediction (CASP), which aimed to determine the state of the art in protein structure prediction . CASP's ongoing assessment has proven highly successful in characterizing progress, and 20 years later CASP largely defines the field of protein structure prediction. CASP has a number of features that are important to its success, some of which were built in from the start and others which were the result of lessons learned along the way. Among those features are forcing participants to make true predictions rather than blinded post-dictions (limiting over-training), the use of fully representative evaluation metrics (limiting artifactual performance), and the recognition of sub-problems that are treated as distinct tasks (allowing for different strategies, e.g. "template-free" vs. "template-based" prediction). In addition to these inherent lessons, CASP has taught the field that progress in "template-free" (ab initio) prediction, while substantial, is slower than prediction that can directly leverage existing protein structures thanks to sequence similarity. CASP's results have also shown that the aggregation of algorithms is an effective way to reduce the effect of "noisy" low-quality predictions .
CASP has inspired numerous other critical assessments, including the topic of this paper, the first Critical Assessment of Functional Annotation (CAFA). CAFA was aimed at assessing the ability of computational methods to predict gene function, starting from protein sequences. The general approach for predicting gene function is often referred to as "guilt by association" (GBA) . In a computational GBA framework, the input data takes the form of similarities among genes (sometimes this is treated as a graph or gene network) and some initial functional labeling (often based on the Gene Ontology or a related scheme ). Genes which are in some sense "near" genes with a given label might be proposed to share that label (function) with them. While a common measure of "guilt" uses sequence similarity, numerous other data types have been used alone or in combination, such as coexpression, protein interactions, patterns of conservation and genetic interactions. All of these have been shown to be predictive of gene function to varying degrees when tested by cross-validation, and numerous algorithms of varying levels of sophistication have been proposed. However, independent assessment of computational GBA is not routine. CAFA represented an attempt to fill this gap. This is a very challenging goal due to problems in defining gold standards and evaluation metrics .
In the first CAFA, in which we were only observers, approximately 47000 protein sequences from UniProt were selected as targets. These sequences were chosen because they lacked high-quality evidence codes ("EXP", "TAS" or "IC") on any their Gene Ontology (GO) annotations (if they had any), and thus were considered to have "unknown function" (http://biofunctionprediction.org/node/262). Participants were asked to assign GO terms to the proteins, with no more than 1000 terms assigned to any given target. Importantly, the GO terms that would eventually be used for assessment were not known in advance, and users were left to decide which data to use as input to their algorithms, without restrictions. The large number of targets in CAFA and the unpredictable nature of which sequences would be used for assessment in the end ensured that it would be difficult for participants to "game" the system.
After participants submitted their predictions, a six month waiting period ensued. At the end of this period, UniProt/GOA was checked for updates to the GO annotations of the targets. New GO annotations which had "EXP", "TAS" or "IC" evidence codes were treated as the gold standard "truth" to which each participant's predictions would be compared. Such new GO annotations were available for ~750 sequences. As set out by the CAFA rules, performance for a submission was to be measured for each target, by comparing the GO annotation predicted by the algorithm with the truth. To capture the idea of "near misses", a novel measure of precision and recall were devised by the organizers using the number of common (up-propagated) GO terms shared by the truth and prediction.
CAFA was structured differently from an earlier assessment that had similar motivations, Mousefunc . Mousefunc provided participants with a blinded set of prepared data (gene networks and the like), and an honor system was used to prevent the nine participating groups from reverse-engineering the coding. In addition to classifying a training set of genes against the target GO terms, participants made predictions for a held-out 10% of the genes, which were unmasked for the assessment. The conclusions of the Mousefunc assessors were that in general methods performed fairly similarly (combined yielding 41% precision at 20% recall), with one method standing out as the "winner" by a modest margin; and that "molecular function" terms were easier to predict than those for the "biological process" aspect of GO. By far the most informative data sets were those based on protein sequence (while sequences were not directly used, two of the data sets were founded on protein sequence patterns). A set of 36 novel predictions (that is, high-scoring apparent "false positives") were evaluated by hand and found to have literature support at an encouraging rate of around 60% . Recently, we reanalyzed the Mousefunc data and many other gene networks and showed that much of the learnability of gene function in cross-validation is explained by node degree effects (to a first approximation, assigning all functions to "hubs" is a surprisingly effective strategy) . We hypothesized that the problem of prediction specificity would play a similar role in CAFA results.
In this paper, we report the results of our independent assessment of a portion of the CAFA results made available to us. Our intention was to assist the CAFA organizers in making the most of the assessment, and to gain insight into how gene function predicts "in the wild". Our results suggest that most prediction algorithms struggle to beat BLAST. An evaluation based on an information-based metric suggest that informative predictions are made at a rate of at best 15%, and that many of the informative predictions made could be inferred from the pre-existing literature and GO annotations. In agreement with our previous results on multifunctionality, informative predictions tended to be made to GO groups containing highly multifunctional genes. We find a comparison of the CASP and CAFA tasks to be informative, in terms of the some lessons learned through CASP and how they might be applied to CAFA in the future. However, the evidence suggests that many of the challenges are fundamentally due to reliance on an imperfect gold standard.