Task 1 was divided into two sub-tasks, reflecting different sources of data. Task 1a focused on the identification of gene or protein names in running text. The data for this task was provided by Lorrie Tanabe and John Wilbur (NCBI)  and was derived from annotation of single sentences taken from MEDLINE abstracts. This task was very close to the "named entity tagging" task that has been used extensively in the natural language processing community. This made it easy for many groups to participate whose main expertise was in natural language processing – this was the most heavily subscribed BioCreAtIvE task, with 15 teams participating.
An example sentence is shown below:
Furthermore, as in the human gene, the 3' end of the Cacna1f gene maps within 5 kb of the 5' end of the mouse synaptophysin gene in a region orthologous to Xp11/23.
In this example, the system must identify the gene/protein names Cacna1f gene (or Cacna1f) and mouse synaptophysin gene (or minimally, synaptophysin). However, a phrase like "the human gene" is not marked because it is not the name of a particular gene. The answer key provides for alternative forms, e.g., Cacna1f gene or Cacna1f.
Participants were given 10,000 annotated training sentences and were tested on an additional 5000 blind test sentences. The main findings from task 1a were that four different teams, using techniques such as Hidden Markov Models and Support Vector Machines, were able to achieve F-measures over 0.80 (F-measure is the harmonic mean of precision and recall). This is somewhat lower than figures for similar tasks from the news wire domain. For example, extraction of organization names has been done at over 0.90 F-measure. The article by Yeh et al  provides an analysis of these differences, attributing about half the difference in F-measure to the fact that systems show lower performance for longer names (also noted in ), and the distribution of gene and protein names is skewed towards longer names than seen for organization names.
Data preparation for task 1a [1, 2] had several interesting features. In particular, the data were annotated by biologists, without explicit annotation guidelines. This is a novel approach to annotation: annotation of named entities for news wire (e.g., person, organization, location, etc) for the Message Understanding Conference tasks required extensive multi-page annotation guidelines . For task 1a, there were no systematic inter-annotator agreement studies carried out to assess the quality of the test data. However, some post-evaluation analysis indicated that there may have been inconsistencies in how compound terms were annotated, such as "Mek-Erk1/2 pathway".
These inconsistencies made it difficult to learn generalizations from the training data, thus reducing scores; this may also account for some of the discrepancy between performance on the gene/protein name extraction task, compared to the news wire tasks.
Task 1a was viewed as a "building block" task – a task that could be treated as a natural language processing task that required no significant biological expertise. It also constitutes a first step for more complex tasks, such as gene name normalization (task 1b) or functional annotation of genes (task 2).
Task 1b focused on creating normalized gene lists; this is a task that is currently performed (manually) by curators for various model organism databases. This meant that there was a readily available data set for both training and testing. We chose three model organism databases (fly , mouse , yeast ) as sources of gene lists associated with papers. Our goal in choosing several model organisms was to encourage approaches that could be readily applied to different vocabularies.
We were committed to providing large training and test sets for this task. Due to the difficulties of obtaining large quantities of full text articles, we chose to provide only the abstracts of articles from MEDLINE for the evaluation. This meant that we had to edit the gene lists to make them correspond to genes mentioned in the abstract, rather than all the genes curated in the full text article. We developed a procedure to automatically remove genes not found in the abstract and were able to provide a large quantity of "noisy" training data for the three organisms, together with small collections of carefully corrected development and test data . We estimated the quality of the noisy training data for the three organisms. Yeast training data quality appeared to be quite good (precision 0.99, recall 0. 86); fly training data was a little noisier (precision 0.92, recall 0.86); and mouse training data had poor recall (precision 0.99, recall 0.55). We also provided synonym lists for each organism, consisting of the unique gene identifier and its alternate names, as listed in the resources provided by each model organism database.
Figure 2 shows a sample abstract with the associated unique gene identifiers, plus an excerpt from the lexicon, showing the many alternate names associated with genes. Although genes may be mentioned more than once in an abstract, the gene list consists of the set of unique mouse genes mentioned in the abstract.
There were eight groups participating in task 1b. The results  varied considerably, from a high for yeast of 0.92 F-measure, to somewhat lower scores for fly (high F-measure of 0.82) and mouse (high F-measure of 0.79). Our analysis  showed that the differences among organisms could be attributed to a variety of factors, including extensive ambiguity in names and overlap of gene names with English terms (fly); complex multi-word gene names (mouse); and quality of the training data, especially for mouse, where recall on the training data was estimated at 55%.
These results lead us to believe that tools for automated gene name identification and normalization may be ready to be incorporated into the curation process, at least where organism nomenclature is highly regular, such as yeast, and authors adhere to the model organism database conventions in the literature. However in many cases, the real task is even more complicated, for example, when papers for several organisms are simultaneously analyzed, since the same names are used for different genes in different species.
Task 2 focused on the automatic assignment of GO annotations to human proteins, based on full text articles. There were several parts to task 2, corresponding to ascending degrees of difficulty . For task 2, the organizers made a conscious decision to provide data "as is" to reflect the realities of a biological application. The training set consisted of around 800 full text journal articles and their associated annotations (protein and GO code) taken from GOA http://www.ebi.ac.uk/GOA/. These were released to participants with no further annotations – that is, it was left to participants to determine the evidence passages that supported the GO annotations. The test set consisted of approximately 200 articles that were curated by the GOA team specifically for the assessment; these were not released until after the assessment was complete, to keep the data blind. In contrast to task 1, the participants also had to find their own lexical resources, such as synonyms for GO terms as well as protein name synonyms.
The input for task 2.1 consisted of triples made up of a pointer to a full text article, a protein (SWISS-PROT ID) and a GO code. The task was to return a short text passage providing evidence for the GO code assigned to that protein. Ideally, the text passage was to contain a mention of the protein and the evidence for the GO code assignment. These passages were judged for correctness by expert curators from the EBI GOA team . There were approximately 1000 triples presented to the systems for task 2.1.
Figure 3 shows three examples of triples and the corresponding text passages. Example 3a is relatively easy, because both the protein and a description of the function or process appear in a single sentence. Figure 3b illustrates why this task is hard. The first sentence provides the information that the protein of interest (RGS16) is an RGS protein: "We report that calmodulin binds in a Ca2+-dependent manner to all RGS proteins we tested, including RGS1, RGS2, RGS4, RGS10, RGS16, and GAIP...". This knowledge then makes it possible to identify evidence in a later sentence that supports the GO annotation (regulation of G-protein coupled receptor protein signaling pathway): "To investigate the role of Ca2+ in feedback regulation of G protein signaling by RGS proteins, we characterized..." Finally, Figure 3c is harder still, requiring some reasoning to determine evidence for the annotation of MIP-1alpha. The first sentence establishes that CCR1 is related to a G-protein coupled receptor pathway and the second sentence states that MIP-1alpha binds to this receptor, which supports the deduction that it is also related to this process.
As these examples show, task 2.1 was a very difficult task. It required not only name extraction and normalization for proteins (as in task 1), but also the ability to recognize different ways of phrasing GO terms – without any training data. In addition, it also required an understanding of the connections among multiple sentences in an article, including the handling of co-reference and reasoning about connections among entities mentioned in those sentences.
We found it encouraging that several systems were able to return over 300 answers (out of approximately 1000 possible) that were judged correct by the assessors. The different systems used a wide range of strategies. For example, several systems returned answers only where the evidence was very strong; these systems returned very few answers but with a higher proportion correct.
Task 2.2 was more difficult still: for this task, the test data consisted of triples of text, protein, and number of GO codes (but not the actual GO codes). The systems were required to return not only evidence passages as before, but also GO code assignments for the protein. The performance dropped by roughly a factor of two from performance on task 2.1.
Overall, the performance on task 2 is not surprising. Task 2.1 involves three subtasks: identification of protein, identification of GO term, and correct association of these two. Identification of the protein mention in text would be roughly comparable to task 1b, and we would expect the best systems to achieve between 70–80% accuracy. Identification of mentions of GO terms would be substantially harder. Analysis of the results  revealed that GO terms for cellular location turned out to be easier than terms for biological process. This may be related to the fact that terms for cellular location were shorter and more "concrete" than terms describing abstract complex relations such as biological process. By contrast, terms for biological process are abstract and complex, e.g., "cytokine and chemokine mediated signaling pathway". We would expect performance on GO term identification to be significantly lower than performance on protein name identification. Furthermore, finding the correct association between protein and GO annotation, especially where the association requires integration of information across multiple sentences, constitutes an additional difficulty. If each of these three steps were accomplished at around 70% accuracy, the final outcome would be close to the observed overall accuracy for task 2.1 of about 30%.
The results for task 2 demonstrate that current systems are not yet able to produce satisfactory results for the extraction of biological information, especially where it requires complex extrapolation and integration. However, this assessment represents an important baseline. We expect that these results will improve with the availability of training data generated from the task 2.1 and 2.2 submissions. Also, creation of lexical resources for GO terms and paraphrases should make it easier to recognize GO terms in text.