Cross-species gene normalization by species inference
© Wei and Kao; licensee BioMed Central Ltd. 2011
Published: 3 October 2011
To access and utilize the rich information contained in the biomedical literature, the ability to recognize and normalize gene mentions referenced in the literature is crucial. In this paper, we focus on improvements to the accuracy of gene normalization in cases where species information is not provided. Gene names are often ambiguous, in that they can refer to the genes of many species. Therefore, gene normalization is a difficult challenge.
We define “gene normalization” as a series of tasks involving several issues, including gene name recognition, species assignation and species-specific gene normalization. We propose an integrated method, GenNorm, consisting of three modules to handle the issues of this task. Every issue can affect overall performance, though the most important is species assignation. Clearly, correct identification of the species can decrease the ambiguity of orthologous genes.
In experiments, the proposed model attained the top-1 threshold average precision (TAP-k) scores of 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20) when tested against 50 articles that had been selected for their difficulty and the most divergent results from pooled team submissions. In the silver-standard-507 evaluation, our TAP-k scores are 0.4591 for k=5, 10, and 20 and were ranked 2nd, 2nd, and 3rd respectively.
A web service and input, output formats of GenNorm are available at http://ikmbio.csie.ncku.edu.tw/GN/.
In recent years, the amount of biological literature has increased rapidly. Text-mining techniques for extracting information from this work are not completely reliable . Extracting information on proteins automatically and precisely is very important and difficult. Many methods have been developed, and they mainly consist of two tasks. Relation extraction identifies the relationships among biomedical entities in the literature. In extracting relations, each biomedical entity, such as a gene, protein or disease, in an article is mapped to its database identifier. This task is called name entity normalization. This is particularly challenging given the high ambiguity in biology and biomedicine of many entity names, such as the gene/protein name [2–7], species name [8–10], and the chemical/compound name [11, 12].
The Critical Assessment of Information Extraction Systems in Biology (BioCreative), a renowned competition in the field of biological text mining, covers a variety of important issues. BioCreative III addressed three text-mining tasks in the domain of molecular biology: gene normalization (GN), protein-protein interactions (PPI), and an interactive demonstration task for gene indexing and retrieval (IAT). The GN task in BioCreative III was similar to the same tasks in previous BioCreative competitions [4–6], in that the goal was to map genes or proteins mentioned in the literature to standard database identifiers. The GN tasks of BioCreative II.5 & III were more difficult than those in preceding challenges. In particular, species information was not provided, and the normalization targets were changed to full-text articles. The GN tasks of BioCreative I & II had been limited to a specific species, such as a human, fly, yeast, mice, and the tasks extracted information from abstracts in the literature. These two differences resulted in a more challenging evaluation than the previous competitions.
Many gene normalization studies have focused on GN tasks in which species information is provided. Hakenberg  developed an outstanding gene-name normalization system, winning the BioCreative II name entity normalization task. ProMiner  is a well-known gene-name normalization system that employs a dictionary-based approach and relies on manual curation. GENO  is a high-performing and efficient gene-name normalization system. It applies the TF-IDF weighting scheme and calculates semantic similarity scores to resolve ambiguous terms. All of these systems perform well, obtaining F-measures of 80%. Later, Hakenberg  developed a cross-species normalization system, GNAT, which considers 13 different species and obtains an F-measure of 81.4%.
The interactor normalization task (INT) of BioCreative II.5 did not provide species information and focused on full-text articles. Owing to the two difficult characteristics of this task, the normalization results might seem surprisingly low . Hakenberg et al  modified their previous work, including GNAT and a gene mention recognition system (BANNER), and obtained the highest precision (F-measure 43.4%). They disambiguated species and assigned candidate identifiers to proteins mentions. Chen et al.  developed a Biological Literature Miner (BioLMiner) system to handle the INT and IPT(interaction pair task) tasks. Two of their subsystems involved in the INT task are the gene mention recognizer (GMRer) and a gene normalizer (GNer). These two subsystems were developed based on support vector machine (SVM) and a conditional random field with designed informative features. Verspoor et al.  introduced an approach using fuzzy dictionary lookup to detect mentions of proteins. They also described several strategies for disambiguating species associated with gene mentions; these strategies operated globally (throughout the document) and locally (in the immediate vicinity of a protein mention). Sætre et al.  used many subcomponents to produce the AkaneRE system, which obtained the highest recall (68.3%). The AkaneRE is provided by the U-Compare system, and it includes sentence boundary detection, tokenization, stemming, part-of-speech tagging, parsing, named-entity recognition, generation of potential relations, generation of features for each relation, and finally, assignment of confidence scores and ranking of candidate relations. Dai et al.  used a three-stage normalization algorithm with a ranking method to handle this task. For interactor ranking, candidate identifiers are ranked by SVM classifier with baseline features and template features.
The purpose of the GN task in BioCreative III is to produce a list of the EntrezGene  identifiers of all species for gene mentions in full-text articles. This task is a complicated challenge involving three issues: gene-mention variation, orthologous gene ambiguity and intra-species gene ambiguity. Gene-mention variation occurs when a gene in a dictionary has multiple names. Gene names in the literature also show high variation, including orthographical variation (e.g., “TLR7” and “TLR-7”), morphological variation (e.g., “GHF-1 transcriptional factor” and “GHF-1 transcription factor”), syntactic variation, variation with abbreviations, and variation in enumeration (e.g., “TLR7/8” and “TLR7, TLR8”) [20, 21]. Orthologous genes usually belong to several species. To solve the GN task accurately, gene mentions must be assigned to the correct species and normalized to their own database identifiers. This is the most complicated step, and it arises from the extreme ambiguity of orthologous genes. For example, the ESR1 gene is associated with 22 Entrez Gene Ids, which belong to several different species. It is very difficult to normalize gene identifiers that lack species information. Intra-species gene ambiguity can occur when different genes have the same name. For example, the name “CAS” may refer to multiple distinct identifiers, such as Entrez Gene Id:1434 (“Cellular apoptosis susceptibility protein”) or Entrez Gene Id:9564 (“Breast cancer anti-estrogen resistance 1”).
In this study, we propose an integrative method, GenNorm, to handle the three issues of the GN task. Our approach uses three modules, the gene name recognition (GNR) module, the species assignation (SA) module and the species-specific gene normalization (SGN) module. Good GNR processing can insure a high quality of gene mentions and reduce gene mention variation. SA is critical. Given a gene mention, it is essential to know to which species the gene belongs. Species information can help identify orthologous genes and help resolve the species discussed in each article. Ignoring the SA can lead to severe problems with gene ambiguity. SGN is the last and significant for GN; this module is responsible for intra-species gene ambiguity.
Gene name recognition module
Our system uses the name-entity tagging tool AIIA-GMT . This tool is an XML-RPC client of a web server that recognizes named entities in biomedical articles. A general entity recognition process cannot collect sufficient information for entity normalization. These systems usually treat lightly or ignore informal, simplified naming descriptions, such as abbreviated names, enumeration mention descriptions, and names with conjunctions. Thus, we propose a post-processing step which can enhance the ability of general-purpose recognition systems.
Several steps of post-processing of the GNR module.
Roman → Arabic Greek → English
I, II, III Roman→ “1, 2, 3” Arabic alpha, beta, gamma → “a”, ”b”, ”g”
Split entity by and/or
GABARAP and light chain 3→“GABARAP” & “light chain 3” furin or proprotein convertase → “furin” and “proprotein convertase”
Enumeration→ gene mentions
Robo 1/2 → “Robo 1” and “Robo 2” SMADs 1, 5 and 8 → “SMAD 1” and “SMAD 5” and “SMAD 8”
Split entity by parentheses
fibroblast growth factor-2 (FGF-2)-interacting-factor → “FGF-2” and “fibroblast growth factor 2 interacting factor” gamma carboxyglutamic acid (Gla) → “gamma carboxyglutamic acid” and “Gla”
The first rule is the number type. Numbers of different subunit type, e.g., Roman, Arabic and Latin, are unified. Second, entities with conjunctions are split. Sometimes, two or more gene names have been combined into one mention by several conjunctions. This mention is split into several mentions. In the enumeration step, we extract the Arabic and Roman types by splitting conjunctions and combining the mutual family name with each Arabic or Roman type. At this step, an enumeration entity with the sequential numbers represents several gene names belonging to the same family. The entity is separated into several different gene names sharing their mutual family name. In the last rule, abbreviations in the parentheses of a gene entity are isolated. The “protein(s)” or “gene(s)” are then removed from the mentions, e.g., “MURF 3 protein” becomes “MURF 3”.
After the post-processing, the method applies a distillation strategy to prevent false tagging. This step focuses on protein families, group and complex names, and non-gene terms from the tagged entities. In our performance observation in training data, filtering the protein families, group and complex names improved the precision, but the recall was lost. Therefore, the F-measure did not change, but the TAP-K score was improved. Thus, these equivocal terms are filtered in this strategy.
Regular expressions used in the distillation strategy
Lastly, the extracted entities are dumped into a “bag of words,” as determined by their punctuation, symbols, and spaces. For example, “Hypoxia-inducible factor-1 alpha” is split into “Hypoxia“, “inducible”, “factor”, “1” and “alpha”, each of which is stored in the bag-of-words list.
Species assignation module
To assign a suitable species to each gene mention, the first step is to extract species mentions from articles. We propose two robust inference strategies combined with a dictionary-based matching method. The species name collection aggregates three different species name lexicons: the NCBI taxonomy, list of cell lines from Wikipedia and the corpus of Linnaeus . Sometimes, species do not map to any Entrez Gene Ids. For example, Escherichia of tax_id:561 does not map to any Entrez Gene Id. Such species are removed. The number of species in the original lexicons is 570,679; after filtering, the number species in the combined lexicon is 6,764.
Synonyms of every species in the lexicon are used to detect species names using dictionary-based matching. A species may have a variety of abbreviation names, e.g., “Escherichia coli strain k-12” is same as “E.coli k-12” and “E. coli k-12”. Some species have an especially large number of synonyms. To handle this case, a dictionary extension strategy is used. We automatically generate additional synonyms from each species name by replacing the genus name with its first letter and a dot and potentially a white space , e.g., “E. coli”. These species synonyms are then added if they do not already occur in the collection of lexicons. All uppercase letters are then changed to a lowercase form. For example, Taxonomy ID (tax_id): 83333 contains several synonyms, including “escherichia coli k-12”, “escherichia coli k12”, “e.coli k-12” and “e. coli k-12”.
Checking all synonyms of a species is a time-consuming task. To enhance the computation speed, all synonymous names of a species are integrated and transformed into a regular expression automatically. In other words, the regular expression of a tax_id contains all existing names of the species. For the given example, tax_id:83333 would lead to the expression “e(?:\. ?coli k\-12|scherichia coli k\-?12)”.
The NCBI Taxonomy is a hierarchical structure of species types. Our collection only includes four types, “species”, “no rank”, “subspecies” and “variants”. Additionally, we add the genus, taken from the first word of each species’ scientific name, to the species dictionary. These genus names will be used to infer the correct tax_id, as will be discussed below. (We set the priorities for the disambiguation of species names as “species” > “no rank” > “subspecies” > “variants” > “genus name”, with the species type on the left a higher priority than the type on the right). When two species have the same name, the lower priority one is eliminated.
Species family of Escherichia coli (tax_id:562) including 45 species.
Species scientific name
escherichia coli cft073
escherichia coli str. k-12 substr. dh10b
escherichia coli etec h10407
escherichia coli e24377a
escherichia coli hs
escherichia coli 53638
escherichia coli b str. rel606
escherichia coli se15
escherichia coli k-12
escherichia coli str. k-12 substr. mg1655
Types of species mentions extracted by two Inferring types.
Number of species mentions
General species name
3a (215167), t7 (10760), cat (9685), ass (9793), j1 (1829)
yeast two hybrid (4932), yeast 2 hybrid (4932)
anti-, antibodies, polyclonal, igg, serum, monoclonal, antibody
There are 5,933,419 Entrez Gene Ids belonging to more than 6,000 species. Genes are ambiguous among the other genes of the same species and among orthologs. Therefore, we developed a rule-based species assignation strategy to reduce orthologous gene ambiguity.
Species indicators of species assignation strategy which are adapted from Wang et al. 
The gene is assigned a species when the gene is extracted by gene identifier extraction.
A bar for Cat8 is not included, as only one gene (YOR019W, factor change 2.1) has a Cat8 site alone.
The first lowercase letter of the gene name is an abbreviation of its species.
h ZIP 2 (human gene)
The tax_id is assigned to a gene entity if the species entity appears in front of the gene entity.
The tax_id is assigned to a gene entity if the species entity is in front of the gene entity in the same sentence. The nearest species is used for assignment.
Conversely, ISWI is a unique and essential gene in Drosophila, highlighting a possible divergent role for ISWI in flies and a distinct mechanism of interaction with the Sin3A/Rpd3 complex in higher eukaryotes.
The tax_id is assigned to a gene entity if the species entity precedes the gene entity in the same sentence. The nearest species is used for assignment.
We identify the shady gene as encoding a cell signaling receptor, leukocyte tyrosine kinase (Ltk), that has recently been associated with human auto-immune disease.
The most frequently mentioned tax_id is assigned to the gene entity if it cannot be assigned by previous rules.
The most frequent mention of PMCID: 2880583 is tax_id is 9606 (Human) as shown in Table 7.
An example of the frequency of species mentions by PMCID: 2880583.
Species and cell name
293(1), a549(34), human(6)
Species-specific gene normalization module
After species assignation, the proposed SGN module, based on previous work , measures the inference scores of candidate Entrez Gene Ids in articles. The previous study applied an inference network model to the GN task, but it did not handle orthologous gene ambiguity. We applied the inference network model to collect the exact match and partial match between tagged entities from articles and gene name entities from the gene dictionary. The inference by exact match is named entity inference, and the inference by partial match is named bag of word inference. Exact match means that tagged entity and gene name entity must be the same term, and partial match means that at least one word in the bags of words of a tagged entity and of a gene name entity must be the same word. The model applied entities and bag of words to the same estimation. The original model gave equal weight to entities and bags of words, thus could not distinguish the relative importance of entities and words from the bags of words. This design disadvantaged the GN performance. We reorganized the previous design by splitting the inference network model into two inference estimations.
For disambiguating confused Cid s and decreasing the computation cost, we discard many irrelevant Cid s by our intersection filtering method. All pairs of Cid s are compared: all terms extracted by exact and partial match are considered in each Cid. A Cid is removed when its term list is a subset of the term list of another Cid.
Statistics of annotations of full-text articles
Training (fully annotated)
Training (partially annotated)
Test (gold standard)
Test (silver standard)
Total number of full-text articles
Total gene IDs of each annotation
Avg. gene IDs of each annotation
Performance on the gene normalization task by the top 4 performing teams in the BioCreative III competition
Gold standard (50 selected articles)
Silver standard (50 selected articles)
Silver standard (All 507 articles)
Kuo et al. (Team 74)
Our method (Team 83)
Liu et al. (Team 98)
Lai et al. (Team 101)
The annotation of the silver standard depends on the submissions of all teams. Because the best submission might be dissimilar to other teams’ submissions, its relative performance on the silver standard might suffer. Nevertheless, the two best runs with the gold standard from our submission are still among the ones with the highest TAP-k score with the silver standard. In addition, the four teams (numbers 83, 74, 98 and 101) that performed best on the gold standard consistently remained in the top tier in silver-standard evaluations. It is evident that relative rankings tended to be preserved in this comparison. Evaluation with the silver-standard annotation proves that the automatic annotation works .
Performance statistics evaluated by TAP-K and F-measure on test data and training data sets.
Test data (1st run)
50 (gold standard)
Test data (2nd run)
50 (gold standard)
Test data (3rd run)
50 (gold standard)
Test data (1st run)
50 (silver standard)
Test data (2nd run)
50 (silver standard)
Test data (3rd run)
50 (silver standard)
Test data (1st run)
Test data (2nd run)
Test data (3rd run)
32 (gold standard)
Contribution of each component of modules
Post-processing of AIIA-GMT 
Two robust inference strategies
Intersection filtering method
In the following section, we first discuss the impact of the GNR module in the context of three extension components of the gene mention recognition system (AIIA-GMT). The impact of two robust referring strategies, the filtering processing and five assignation rules (because identifier extraction is repeated in the GNR module, we do not perform the same experiment again) of SA module are described. Finally, we analyze the performance of the SGN module by the intersection filtering method and inference network.
First, we found that the post-processing step is a useful component of the GNR module. If the output of AIIA-GMT were not followed by post-processing, performance would clearly decrease. The most effective component is identifier extraction; several popular and high-performance tagging systems, such as AIIA-GMT , ABNER , and GENIA Tagger , usually cannot recognize gene identifiers well.
The three rows under the SA module (Table 11) show the performance when one of the two robust inference strategies, the filtering processing and six species assignation rules, was unused. The disambiguation of species mentions is critical, which is why we designed two robust referring strategies to handle it. Using only one of the two robust referring strategies caused a decline in the performance of the species assignation rules and an approximately 6%-7% decrease in TAP-k scores. The effect of the two robust inference strategies is not obvious; the ratio of species mentions that are unguaranteed species names (i.e., genus and cell line) and sub-type is just 11.3% (Table 4). The filtering of false-positive set is very useful. The next most helpful indicators of species assignation are “Forward indicator” and “Majority indicator”. Especially for “Majority indicator”, the detection of the species covered in the article is an important issue [8, 10]. Majority indicator is most popular. Furthermore, the precise extraction of species mentions leads to good performance of the Majority indicator.
Finally, the SGN module includes two major parts: intersection filtering and inference estimation. The contribution of the intersection filtering method is to enhance computing speed, and thus the effect of this method on performance is not remarkable. To evaluate the contribution of the inference network model of two estimations, we replaced this model by a simple vector space model. The performance decreased substantially.
We evaluated our method using the full-text articles provided in the BioCreative III competition. Our best result had TAP-k scores of 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20) under the gold-standard evaluation.
Our method approaches the challenges of GN as a series of tasks, with several issues handled by respective modules. Thedisadvantageousdesigns of each module may lead to a decline in overall performance. The major goal of this work is to present the architecture of our method in a clear way and analyze the effective components, which we do through systematic removals. This analysis can help in the redesigning of each segment to create a better system.
The GN task of BioCreative III was more difficult than previous GN tasks, and the chief reason is orthologous gene ambiguity. In this study, we focused on the issue of gene normalization in species assignation and developed an integrated method for mapping a biomedical entity to the correct Entrez Gene Id. To obtain good performance, we focused on ameliorating the effects of gene mention variation, orthologous gene ambiguity and intra-species gene ambiguity. The integrated method consists of three modules, GNR, SA and SGN, which function serially to handle these three issues. We participated in the GN task of the BioCreative III competition by adopting an integrated method based on our previous work to handle intra-species gene ambiguity. Results demonstrated that our method worked well, ranking at the top level of performance among all teams. Our proposed method makes sufficient use of gene/species information in context and of a thesaurus of gene/species.
Nonetheless, the current, state-of-the-art performance on the GN task is not good enough. The mining of full-text articles and cross-species normalization are big challenges for GN. To improve future performance, the contexts of articles will be used, e.g., chromosomal locations, families, functions.
This research was partially supported by BioCreativeIII workshop and CNIO institute. Authors would like to thank Zhiyong Lu at BioCreative GN task for his patience in responding to myriad questions about the evaluation. They would also like to think Chun-Nan Hsu who provided AIIA-GMT system for GNR.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 8, 2011: The Third BioCreative – Critical Assessment of Information Extraction in Biology Challenge. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S8.
- Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M, Roebuck S, Tobin R, Wang X: Assisted curation: does text mining really help? Pac Symp Biocomput 2008, 556–567.Google Scholar
- Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G: Inter-species normalization of gene mentions with GNAT. Bioinformatics 2008, 24(ECCB):i126-i132.View ArticlePubMedGoogle Scholar
- Heinz JF, Mevissen T, Dach H, Oster M, Hofmann-Apitius M: ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries. the Second BioCreative Challenge Evaluation Workshop 2007, 149–151.Google Scholar
- Hirschman L, Colosimo M, Morgan A, Yeh A: Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics 2005, 6(Suppl 1):S11. 10.1186/1471-2105-6-S1-S11PubMed CentralView ArticlePubMedGoogle Scholar
- Leitner F, Mardis SA, Krallinger M, Cesareni G, Hirschman LA, Valencia A: An Overview of BioCreative II.5. IEEE/ACM Transactions On Computational Biology And Bioinformatics 2010, 7(3):385–399.View ArticlePubMedGoogle Scholar
- Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, Ruch P, Divoli A, Fundel K, Leaman R, Hakenberg J, et al.: Overview of BioCreative II gene normalization. Genome Biology 2008, 9(Suppl 2):S3. 10.1186/gb-2008-9-s2-s3PubMed CentralView ArticlePubMedGoogle Scholar
- Wermter J, Tomanek K, Hahn U: High-Performance Gene Name Normalization with GENO. Bioinformatics 2009.Google Scholar
- Wang X, Tsujii Ji, Ananiadou S: Disambiguating the Species of Biomedical Named Entities using Natural Language Parsers. Bioinformatics 2010, 26(5):661–667. 10.1093/bioinformatics/btq002PubMed CentralView ArticlePubMedGoogle Scholar
- Gerner M, Nenadic G, Bergman CM: LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics 2010., 11(85):Google Scholar
- Kappeler T, Kaljurand K, Rinaldi F: TX Task:Automatic Detection of Focus Organisms in Biomedical Publications. Proceedings of the Workshop on BioNLP: 2009 2009, 80–88.View ArticleGoogle Scholar
- Klinger R, Kolarik C, Fluck J, Hofmann-Apitius M, Friedrich CM: Detection of IUPAC and IUPAC-like chemical names. Bioinformatics 2008, 24(ISMB2008):i268-i276.PubMed CentralView ArticlePubMedGoogle Scholar
- Corbett P, Batchelor C, Teufel S: Annotation of Chemical Named Entities. BioNLP 2007: Biological, translational, and clinical language processing 2007, 57–64.View ArticleGoogle Scholar
- Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M: Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biology 2008, 9(Suppl 2):S14. 10.1186/gb-2008-9-s2-s14PubMed CentralView ArticlePubMedGoogle Scholar
- Hakenberg J, Leaman R, Vo NH, Jonnalagadda S, Miller RSC, Tari L, Baral C, Gonzalez G: Efficient Extraction of Protein-Protein Interactions from Full-Text Articles. IEEE/ACM Transactions On Computational Biology And Bioinformatics 2010, 7(3):481–494.View ArticlePubMedGoogle Scholar
- Chen Y, Liu F, Manderick B: BioLMiner System: Interaction Normalization Task and Interaction Pair Task in the BioCreative II.5 Challenge. IEEE/ACM Transactions On Computational Biology And Bioinformatics 2010, 7(3):428–441.View ArticlePubMedGoogle Scholar
- Verspoor K, Roeder C, Johnson HL, Cohen KB Jr., W AB, Hunter LE: Exploring Species-Based Strategies for Gene Normalization. IEEE/ACM Transactions On Computational Biology And Bioinformatics 2010, 7(3):462–471.PubMed CentralView ArticlePubMedGoogle Scholar
- Saetre R, Yoshida K, Miwa M, Matsuzaki T, Kano Y, Tsujii Ji: Extracting Protein Interactions from Text with the Unified AkaneRE Event Extraction System. IEEE/ACM Transactions On Computational Biology And Bioinformatics 2010, 7(3):442–453.View ArticlePubMedGoogle Scholar
- Dai HJ, Lai PT, Tsai RTH: Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles. IEEE/ACM Transactions On Computational Biology And Bioinformatics 2010, 7(3):412–420.View ArticlePubMedGoogle Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic acids research 2006, 00(Database issue):D1-D6.Google Scholar
- Hirschman L, Morgan AA, Yeh AS: Rutabaga by any other name: extracting biological names. J of Biomedical Informatics 2002, 35(4):247–259. 10.1016/S1532-0464(03)00014-5View ArticlePubMedGoogle Scholar
- Tuason O, Chen L, Liu H, Blake JA, Friedman C: Biological Nomenclatures: A Source of Lexical Knowledge and Ambiguity. Proc Pacific Symp on Biocomputing 2004, 238–249.Google Scholar
- Hsu CN, Chang YM, Kuo CJ, Lin YS, Huang HS, Chung IF: Integrating High Dimensional Bi-directional Parsing Models for Gene Mention Tagging. Bioinformatics 2008, 24(ISMB2008):i286-i294.PubMed CentralView ArticlePubMedGoogle Scholar
- Wei CH, Huang IC, Hsu YY, Kao HY: Normalizing Biomedical Name Entities by Similarity-Based Inference Network and De-ambiguity Mining. Ninth IEEE International Conference on Bioinformatics and Bioengineering Workshop: Semantic Biomedical Computing: 2009; Taichung, Taiwan 2009, 461–466.View ArticleGoogle Scholar
- Lu Z, Wilbur WJ: Overview of BioCreative III Gene Normalization. In BioCreative III Workshop. Maryland,Bethesda; 2010.Google Scholar
- Carroll HD, Kann MG, Sheetlin SL, Spouge JL: Threshold Average Precision (TAP-k): A Measure of Retrieval Designed for Bioinformatics. Bioinformatics 2010, 26(14):1708–1713. 10.1093/bioinformatics/btq270PubMed CentralView ArticlePubMedGoogle Scholar
- Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21(14):3191–3192. 10.1093/bioinformatics/bti475View ArticlePubMedGoogle Scholar
- Kim JD, Ohta T, Tateisi Y, Tsujii J: GENIA corpus-a semantically annotated corpus for bio-textmining. Bioinformatics 2003, 19(Suppl. 1):i180-i182.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.