Volume 6 Supplement 1
GENETAG: a tagged corpus for gene/protein named entity recognition
© Tanabe et al; licensee BioMed Central Ltd 2005
Published: 24 May 2005
Named entity recognition (NER) is an important first step for text mining the biomedical literature. Evaluating the performance of biomedical NER systems is impossible without a standardized test corpus. The annotation of such a corpus for gene/protein name NER is a difficult process due to the complexity of gene/protein names. We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE® sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A Competition.
To ensure heterogeneity of the corpus, MEDLINE sentences were first scored for term similarity to documents with known gene names, and 10K high- and 10K low-scoring sentences were chosen at random. The original 20K sentences were run through a gene/protein name tagger, and the results were modified manually to reflect a wide definition of gene/protein names subject to a specificity constraint, a rule that required the tagged entities to refer to specific entities. Each sentence in GENETAG was annotated with acceptable alternatives to the gene/protein names it contained, allowing for partial matching with semantic constraints. Semantic constraints are rules requiring the tagged entity to contain its true meaning in the sentence context. Application of these constraints results in a more meaningful measure of the performance of an NER system than unrestricted partial matching.
The annotation of GENETAG required intricate manual judgments by annotators which hindered tagging consistency. The data were pre-segmented into words, to provide indices supporting comparison of system responses to the "gold standard". However, character-based indices would have been more robust than word-based indices. GENETAG Train, Test and Round1 data and ancillary programs are freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz. A newer version of GENETAG-05, will be released later this year.
The automatic identification of gene and protein names in the MEDLINE® database of literature citations is a challenging named entity recognition (NER) task. Biomedical NER has been an active research area for some time. Systems capable of high performance on this task are desirable because NER precedes other tasks including information extraction and text mining. The apparent simplicity of the gene/protein NER task conceals its inherent complexity stemming from an often unconventional and ambiguous genetic nomenclature.
We have previously developed AbGene, a gene/protein name tagger trained on MEDLINE abstracts using a combination of statistical and rule-based strategies. Due to the difficulty of manually evaluating AbGene results, we needed to create a tagged corpus for evaluating the performance of AbGene applied to full text articles. The GENIA corpus version 3.0 contains a total of 93,293 biological terms annotated by two domain experts . However, it was not suitable for our purposes because we ran AbGene on unrestricted text, and the GENIA corpus is restricted to text retrieved using the search terms human, blood cell and transcription factor. Additionally, the entities in GENIA are allowed to be generic, whereas AbGene was designed to extract specific gene/protein names only.
One fundamental problem in corpus annotation is the definition of what constitutes an entity to be tagged. For example, the MUC-7 Named Entity Task to identify organizations, persons and locations in text necessitated the lengthy MUC-7 Named Entity Task Definition, which specify the rules for annotating each entity . The following excerpts from the MUC-7 Named Entity Task Definition exemplify the complexity of the annotation process:
A.1.1 Entity-expressions that modify non-entities
Entity names used as modifiers in complex NPs that are not proper names are to be tagged when it is clear to the annotator from context or the annotator's knowledge of the world that the name is that of an organization, person, or location.
A.1.3 Entity-strings embedded in entity-Expressions
In some cases, multi-word strings that are proper names will contain entity name substrings; such strings are not decomposable; therefore, the substrings are not to be tagged. (See A.1.2 re special cases involving prenominal modifiers of person identifiers.)
A.1.6.2 The definite article in an alias
When a definite article is commonly associated with an alias, it also must be tagged.
<ENAMEX TYPE="PERSON">The Godfather</ENAMEX>
However, the scoring program ignores a certain list of premodifiers as specified in section 3.3 which may make the scoring in some of these cases more lenient than this rule implies. The scorer does *not* ignore those premodifiers within quotation marks such as inside the tags in A.1.6.1 above.
The developers of the GENIA corpus followed a less exacting annotation strategy that did not allow determiners, ordinals nor cardinals to appear in tagged entities, but left qualifiers, including adjectives, as somewhat arbitrary judgement calls .
For GENETAG annotation, we chose a wide definition of a gene/protein entity, but added a constraint that requires the tagged entity to refer to a specific entity, hereafter called the "specificity constraint." The specificity constraint allows for entities like tat dna sequence but not dna sequence. No distinctions were made between genes, proteins, RNA, domains, complexes, sequences, fusion proteins, etc. A finer-grained definition is possible, for example, proteins, genes and RNA can be distinguished as separate entities using machine learning with 78–84% accuracy . However, most biomedical NER systems do not make these distinctions. Also, Hatzivassiloglou et al. found that their machine learning algorithms did not perform well against a human baseline model, suggesting that either the humans were correct, and the decreased performance was due to classification difficulty, or the machine-learned programs were penalized for being more consistent than humans. Because humans agreed only about 77% of the time on protein, gene and RNA labels, the inclusion of these distinctions in a gold standard would be an additional source of significant ambiguity.
Our decision to include domains, complexes, subunits and promoters (but only if they refer to a specific gene/protein) was based on gene names in GenBank. (Domain: A discrete portion of a protein with its own function. The combination of domains in a single protein determines its overall function. [Source: DOE Genome Glossary http://www.ornl.gov/sci/techresources/Human_Genome/glossary/index.shtml]; Complex: In chemistry, the relatively stable combination of two or more compounds into a larger molecule without covalent binding; Subunit: A single biopolymer separated from a larger multimeric structure [Source: Stedman's Online Medical Dictionary, 27th Edition http://www.stedmans.com]; Promoter: a segment of DNA located at the "front" end of a gene, which provides a site where the enzymes in involved in the transcription process can bind on to a DNA molecule, and initiate transcription [Source: Genomics Online Terms http://www.biojudiciary.org/glossary/index.asp?flt=p)
Sf3b4, splicing factor 3b, subunit 4
Mus musculus transaldolase gene, promoter region
bHLH transcription factor mRNA
Xenopus laevis similar to POU domain gene
By defining a gene based on gene names in GenBank, but requiring only a partial match, we have addressed the fact that gene names in text are often not exact matches to their official names. This is an advantage of manually annotating a corpus instead of relying on lists of official gene names for a gold standard – each entity in each context can be expertly evaluated and revised if necessary. The above examples illustrate some of the motivation behind the compilation of a list acceptable alternative gene/protein names. In (1), many systems would probably not extract the entire entity, and would be penalized. A more flexible evaluation would be possible if it were recognized that "Sf3b4" and "Sf3b4, splicing factor 3b," are acceptable alternatives to the full form. It would also allow systems to delete the organism name in (2), as well as the fact that it refers to the promoter region. The acceptable alternatives are always subject to the specificity constraint so that the important parts of gene/protein names are preserved. In addition to the specificity constraint described here, we applied semantic constraints to define gene/protein entities.
rabies immunoglobulin (RIG)
Tumor necrosis factor 1
GENETAG consists of 20K sentences that have been run through AbGene  and manually annotated with gene and protein names (via a web interface) by experts in biochemistry, genetics and molecular biology. It is a heterogeneous set of sentences that contains many true positive gene names, and also many non-gene entities that are morphologically similar to gene names. There are approximately 24K instances of gene/protein names in the 20K sentences. 15K of the sentences were used in BioCreAtIvE-2004 Task 1A . Previous biomedical NER systems were difficult to compare because there were few large gene-tagged corpora available. Although GENETAG was not originally intended to be widely distributed, in releasing the corpus to the larger biomedical NER community through the BioCreAtIvE Evaluation, we hoped to stimulate interest in this area and provide a means to evaluate multiple systems on unrestricted biomedical text.
GENETAG annotation guidelines were designed to define true positive gene/protein names in terms of their specificity and semantics. Each sentence in GENETAG is annotated with acceptable alternatives to the gene/protein names it contains, allowing for partial matching with semantic constraints, a more meaningful measure of the performance of an NER system than unrestricted partial matching. This paper provides some background on the corpus including 1) sentence selection, 2) definition of a gene/protein name, 3) tokenization and partial matching and 4) tagging consistency.
GENETAG corpus statistics The 20K sentences were split into four subsets called Train, Test, Round1 and Round2.
Number of Sentences
Number of Words
Number of Tagged Genes = G
Total Number of Alternative Forms of Gene Names in G
Number of Gene Names in G with Alternative Forms = N
Average Number of Alternatives per Gene Name in N
human alpha 1, 2-mannosidase
D. melanogaster Surf-3 / rpL7a
Inflammation has been inferred to play a major role in stimulating TGF-beta1 production since high concentrations of TGF-beta1 have been found in the lungs of patients with various diffuse inflammatory lung diseases.
RIP3-deficient cells showed normal sensitivity to a variety of apoptotic stimuli and were indistinguishable from wild-type cells in their ability to activate NF-kappa B signaling in response to the following: human tumor necrosis factor (TNF), which selectively engages mouse TNF receptor 1 ; cross-linking of the B- or T-cell antigen receptors; peptidoglycan, which activates Toll-like receptor 2 ; and lipopolysaccharide (LPS), which stimulates Toll-like receptor 4.
src homology 2 and 3
wild-type or mutant ICP34 . 5 promoters
Rab1B, -5, -7, -8, or -11A
alpha, beta, or gamma PKC
stress-activated protein kinase-Jun N-terminal kinase
tumor necrosis factor (TNF) receptor-associated factor (TRAF)
E2 RAD5 (UBC2)
Even in unambiguous cases, tagging inconsistencies can appear due to human error. In particular, the partial matching alternatives are sensitive to inconsistencies because the names and indices were input manually into a text box on an annotation web page (see Fig. 1).
We have described the GENETAG corpus of tagged gene/protein names in MEDLINE text which was used in BioCreAtIvE Task 1A. The corpus was designed to contain both true and false positive gene/protein names in a variety of contexts. Gene/protein names are defined widely, but are subject to specificity and semantic constraints. The annotation guidelines were designed with the goal of allowing flexible matching to the gold standard, while retaining the true meaning of the tagged entities. Arbitrary partial matches not corresponding to a complete and meaningful entity fail to meet the annotation guidelines and are scored as false positives and/or false negatives. A more detailed definition of a gene/protein name, as well as additional annotation rules, could improve interannotator agreement and help solve some of the tagging inconsistencies. Subtle tokenization issues exist in the corpus, and the requirement that the gold standard and test sets have the same tokenization is disadvantageous (see discussion in ). However, a positional approach is necessary to disambiguate sentences which contain adjoined, repeated and/or nested gene names, and for future NLP applications. A more robust approach would use character-based rather than word-based indices to allow for a wider diversity of tokenization.
- Kim J-D, Ohta T, Tateisi Y, Tsujii J: GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics 2003, (Suppl 1):i180–2. 10.1093/bioinformatics/btg1023
- MUC-7:Proceedings of the Seventh Message Understanding Conference (MUC-7): Defense Advanced Research Projects Agency. 1998. [http://www.itl.nist.gov/iaui/894.02/related_projects/muc/]
- Hatzivassiloglou V, Duboue PA, Rzhetsky A: Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics 2001, (Suppl 1):S97–106.
- Tanabe L, Wilbur WJ: Tagging gene and protein names in biomedical text. Bioinformatics 2002, 18: 1124–32. 10.1093/bioinformatics/18.8.1124View ArticlePubMed
- Valencia A, Blaschke C, Hirschman L, Yeh A, Morgan A, Colosimo M, Colombe M: A critical assessment of text mining methods in molecular biology.2004. [http://www.pdg.cnb.uam.es/BioLINK/workshop_BioCreative_04/handout/index.html]
- Langley P: Elements of Machine Learning. San Francisco, Morgan Kaufmann; 1996.
- Mitchell TM: Machine Learning. Boston, WCB/McGraw-Hill; 1996.
- Wilbur WJ: Boosting naive Bayesian learning on a large subset of MEDLINE. American Medical Informatics Annual Symposium 2000, 918–922.
- Marcus M, Santorini S, Marcinkiewicz M: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19: 313–330.
- Yeh A, Hirschman L, Morgan A, Colosimo M: BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics 2005, 6(Suppl 1):S2. 10.1186/1471-2105-6-S1-S2PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.